Jing Gao and Lu Su
Department of Computer Science and Engineering
One important challenge in big data is information veracity, i.e., information sources might not be reliable. Before we can make good use of the data, we need to identify the true facts among the conflicting information from multiple data sources, where a data source can be a database, a website, a sensor, or even a person. Facing the daunting scale of data, it is unrealistic to expect human to label or tell which piece of information is correct. Therefore, we have to aggregate the information from multiple sources as this will likely cancel out the errors of individual sources and extract the true information. One straightforward aggregation approach is to take majority voting or averaging. The drawback of this simple approach is obvious: It treats all the sources equally and fails to capture the variety in their reliability. Intuitively, if we can identify and put more weight on the reliable sources, the aggregation accuracy can be significantly improved. To achieve this, a major challenge has to be addressed, that is, there is usually neither prior knowledge nor training data for the derivation of source reliability.
In light of this challenge, the topic of truth discovery has gained increasing popularity recently due to its ability to estimate source reliability degrees and infer true information. In truth discovery, the following two processes are tightly coupled: The sources that provide true information more often will be assigned higher weights, and the information that is supported by reliable sources will be regarded as truths. The success of truth discovery methods has been clearly demonstrated in a wide variety of tasks where decisions have to be made based on the correct information from diverse sources. Typical examples include the integration of Web information for structured knowledge base construction and the aggregation of user-contributed information on crowdsourcing platforms. These and other applications demonstrate the broader impact of truth discovery on multi-source information integration.
To apply truth discovery to spatial-temporal data, it is important to take into consideration the spatial-temporal relationships between objects in the truth discovery process. Such correlation information can greatly benefit the truth discovery process—information obtained from reliable sources can be propagated over all correlated objects, such that the aggregated information is more trustworthy. Especially, taking correlations into consideration will be more helpful when the coverage rate is low, i.e., sources only provide observations for a small portion of the objects. In such cases, many objects may receive observations from unreliable sources, and thus reliable information borrowed from correlated objects is important.
A preliminary study on weather condition estimation is demonstrated in Figure 1. We collect weather fore-cast data from three weather forecast platforms (Wunder-ground, HAM weather, and World Weather Online) for 147 locations within New York City Area. From each platform, we collect temperature forecasts at different days. The whole process of collection lasts over a month. On this dataset, we conduct truth discovery considering spatial-temporal correlations (referred to as TD-corr), and compare with the baseline that does not consider the correlations. Results show the improved performance of TD-corr. With the consideration of correlation information, the proposed TD-con- approach can better estimate the truths. Due to the intertwined process of truth and user reliability estimation, such an improved truth estimation leads to an improved estimation of user reliability.