Data Quality Challenges in Community Systems AnHai Doan University of Wisconsin-Madison Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron Gao, Fei Chen, Yoonkyong Lee, Raghu Ramakrishnan, Jeff Naughton Numerous Web Communities Academic domains – database researchers, bioinformatists Infotainments – movie fans, mountain climbers, fantasy football Scientific data management – biomagnetic databank, E. Coli community Business – enterprise intranets, tech support groups, lawyers CIA / homeland security – Intellipedia Much Efforts to Build Community Portals Initially taxonomy based (e.g., Yahoo style) But now many structured data portals – capture key entities and relationships of community No general solution yet on how to build such portals Cimple Project @ Wisconsin / Yahoo! Research Develops such a general solution using extraction + integration + mass collaboration Maintain and add more sources Jim Gray Researcher Homepages * ** * Pages * * Group Pages mailing list Keyword search SQL querying Web pages Conference DBworld Jim Gray * ** ** * SIGMOD-04 ** * Text documents give-talk SIGMOD-04 Question answering Browse Mining Alert/Monitor News summary DBLP Mass collaboration Prototype System: DBLife Integrate data of the DB research community 1164 data sources Crawled daily, 11000+ pages = 160+ MB / day Data Extraction Data Integration Raghu Ramakrishnan co-authors = A. Doan, Divesh Srivastava, ... Resulting ER Graph “Proactive Re-optimization write write Shivnath Babu coauthor write Pedro Bizarro coauthor advise coauthor Jennifer Widom David DeWitt PC-member PC-Chair SIGMOD 2005 advise Provide Services DBLife system Mass Collaboration: Voting Picture is removed if enough users vote “no”. Mass Collaboration via Wiki Summary: Community Systems Data integration systems + extraction + Web 2.0 – manage both data and users in a synergistic fashion In sync with current trends – manage unstructured data (e.g., text, Web pages) – get more structure (IE, Semantic Web) – engage more people (Web 2.0) – best-effort data integration, data spaces, pay-as-you-go Numerous potential applications But raises many difficult data quality challenges Rest of the Talk Data quality challenges in 1. Source selection 2. Extraction and integration 3. Detecting problems and providing feedback 4. Mass collaboration Conclusions & ways forward 1. Source Selection Maintain and add more sources Jim Gray Researcher Homepages ** * * Pages * * Group Pages mailing list Keyword search SQL querying Web pages Conference DBworld Jim Gray * ** ** * SIGMOD-04 ** * Text documents give-talk SIGMOD-04 Question answering Browse Mining Alert/Monitor News summary DBLP Mass collaboration Current Solutions vs. Cimple Current solutions – find all relevant data sources (e.g., using focused crawling, search engines) – maximize coverage – have lot of noisy sources Cimple – starts with a small set of high-quality “core” sources – incrementally adds more sources – only from “high-quality” places – or as suggested by users (mass collaboration) Start with a Small Set of “Core” Sources Key observation: communities often follow 80-20 rules – 20% of sources cover 80% of interesting activities Initial portal over these 20% often is already quite useful How to select these 20% – select as many sources as possible – evaluate and select most relevant ones Evaluate the Relevancy of Sources Use PageRank + virtual links across entities + TF/IDF ... Gerhard Weikum G. Weikum See [VLDB-07a] Add More Sources over Time Key observation: most important sources will eventually be mentioned within the community – so monitor certain “community channels” to find them Message type: conf. ann. Subject: Call for Participation: VLDB Workshop on Management of Uncertain Data Call for Participation Workshop on "Management of Uncertain Data" in conjunction with VLDB 2007 http://mud.cs.utwente.nl ... Also allow users to suggest new sources – e.g., the Silicon Valley Database Society Summary: Source Selection Sharp contrast to current work – start with highly relevant sources – expand carefully – minimize “garbage in, garbage out” Need a notion of source relevance Need a way to compute this 2. Extraction and Integration Maintain and add more sources Jim Gray Researcher Homepages ** * * Pages * * Group Pages mailing list Keyword search SQL querying Web pages Conference DBworld Jim Gray * ** ** * SIGMOD-04 ** * Text documents give-talk SIGMOD-04 Question answering Browse Mining Alert/Monitor News summary DBLP Mass collaboration Extracting Entity Mentions Key idea: reasonable plan, then patch Reasonable plan: – collect person names, e.g., David Smith – generate variations, e.g., D. Smith, Dr. Smith, etc. – find occurrences of these variations ExtractMbyName Union s1 … sn Works well, but can’t handle certain difficult spots Handling Difficult Spots Example – R. Miller, D. Smith, B. Jones – if “David Miller” is in the dictionary will flag “Miller, D.” as a person name Solution: patch such spots with stricter plans ExtractMStrict ExtractMbyName Union s1 … sn FindPotentialNameLists Matching Entity Mentions Key idea: reasonable plan, then patch Reasonable plan – mention names are the same (modulo some variation) match – e.g., David Smith and D. Smith MatchMbyName Extract Plan Union s1 … sn Works well, but can’t handle certain difficult spots Handling Difficult Spots MatchMStrict DBLP: Chen Li ··· 41. Chen Li, Bin Wang, Xiaochun Yang. VGRAM. VLDB 2007. ··· 38. Ping-Qi Pan, Jian-Feng Hu, Chen Li. Feasible region contraction. Applied Mathematics and Computation. ··· MatchMbyName Extract Plan Union {s1 … sn} \ DBLP Extract Plan DBLP Estimate the semantic ambiguity of data sources – use social networking techniques [see ICDE-07a] Apply stricter matchers to more ambiguous sources Going Beyond Sources: Difficult Data Spots Can Cover Any Portion of Data MatchMStrict2 MatchMStrict Mentions that Match “J. Han” MatchMbyName Extract Plan Extract Plan Union {s1 … sn} \ DBLP DBLP Summary: Extraction and Integration Most current solutions – try to find a single good plan, applied to all of data Cimple solution: reasonable plan, then patch So the focus shifts to: – how to find a reasonable plan? – how to detect problematic data spots? – how to patch those? Need a notion of semantic ambiguity Different from the notion of source relevance 3. Detecting Problems and Providing Feedback Maintain and add more sources Jim Gray Researcher Homepages ** * * Pages * * Group Pages mailing list Keyword search SQL querying Web pages Conference DBworld Jim Gray * ** ** * SIGMOD-04 ** * Text documents give-talk SIGMOD-04 Question answering Browse Mining Alert/Monitor News summary DBLP Mass collaboration How to Detect Problems? After extraction and matching, build services – e.g., superhomepages Many such homepages contain minor problems – e.g., X graduated in 19998 X chairs SIGMOD-05 and VLDB-05 X published 5 SIGMOD-03 papers Intuitively, something is semantically incorrect To fix this, lets build a Semantic Debugger – learns what is a normal profile for researcher, paper, etc. – alerts the builder to potentially buggy superhomepages – so feedback can be provided What Types of Feedback? Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06 Add domain knowledge – e.g., no researcher has ever published 5 SIGMOD papers in a year Add more data – e.g., X was advised by Z – e.g., here is the URL of another data source Modify the underlying algorithm – e.g., pull out all data involving X match using names and co-authors, not just names How to Make Providing Feedback Very Easy? “Providing feedback” for the masses – in sync with current trends of empowering the masses Extremely crucial in DBLife context If feedback can be provided easily – can get more feedback – can leverage the mass of users But this turned out to be very difficult How to Make Providing Feedback Very Easy? Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06 Add domain knowledge Add more data Provide form interfaces Modify the underlying algorithm Provide a Wiki interface Critical in our experience, but unsolved Unsolved, some recent interest on how to mass customize software See our IEEE Data Engineering Bulletin paper on user-centric challenges, 2007 What Feedback Would Make the Most Impact? I have one hour spare time, would like to “teach” DBLife – what problems should I work on? – what feedback should I provide? Need a Feedback Advisor – define a notion of system quality Q(s) – define questions q1, ..., qn that DBLife can ask users – for each qi, evaluate its expected improvement in Q(s) – pick question with highest expected quality improvement Observations – a precise notion of system quality is now crucial – this notion should model the expected usage Summary: Detection and Feedback How to detect problems? – Semantic Debugger What types of feedback & how to easily provide them? – critical, largely unsolved What feedback would make most impact? – crucial in large-scale systems – need a Feedback Advisor – need a precise notion of system quality 4. Mass Collaboration Maintenance and expansion Jim Gray Researcher Homepages ** * * Pages * * Group Pages mailing list Keyword search SQL querying Web pages Conference DBworld Jim Gray * ** ** * SIGMOD-04 ** * Text documents give-talk SIGMOD-04 Question answering Browse Mining Alert/Monitor News summary DBLP Mass collaboration Mass Collaboration: Voting Can be applied to numerous problems Example: Matching Dell laptop X200 with mouse ... Mouse for Dell laptop 200 series ... Dell X200; mouse at reduced price ... Hard for machine, but easy for human Challenges How to detect and remove noisy users? – evaluate them using questions with known answers How to combine user feedback? – # of yes votes vs. # of no votes See [ICDE-05a, ICDE-08a] Mass Collaboration: Wiki Data Sources M G T V1 W1 V2 W2 V3 W3 V3’ W3’ T3 ’ Community wikipedia – built by machine + human – backed up by a structured database u1 Mass Collaboration: Wiki Machine <# person(id=1){name}=David J. DeWitt #> Professor <# person(id=1){title}=Professor #> <strong>Interests:</strong> <# person(id=1).interests(id=3) .topic(id=4){name}=Parallel Database #> Human Human <# person(id=1){name}=David J. DeWitt #> <# person(id=1){title}=John P. Morgridge Professor #> <# person(id=1) {organization}=UW #> since 1976 <strong>Interests:</strong> <# person(id=1).interests(id=3) .topic(id=4){name}=Parallel Database #> David J. DeWitt Interests: Parallel Database Machine Machine <# person(id=1){name}=David J. DeWitt #> <# person(id=1){title}= John P. Morgridge Professor #> <# person(id=1){organization}=UW-Madison#> since 1976 <strong>Interests:</strong> <# person(id=1).interests(id=3) .topic(id=4){name}=Parallel Database #> <# person(id=1).interests(id=5) .topic(id=6){name}=Privacy #> David J. DeWitt John P. Morgridge Professor UW-Madison since 1976 Interests: Parallel Database Privacy Sample Data Quality Challenges How to detect noisy users? – no clear solution yet – for now, limit editing to trusted editors – modify notion of system quality to account for this How to combine feedback, handle inconsistent data? – user vs. user – user vs. machine How to verify claimed ownership of data portions? – e.g., this superhomepage is about me – only I can edit it See [ICDE-08b] Summary: Mass Collaboration What can users contribute? How to evaluate user quality? How to reconcile inconsistent data? Additional Challenges Dealing with evolving data (e.g., matching) Iterative code development Lifelong quality improvement Querying over inconsistent data Managing provenance and uncertainty Generating explanations Undo Conclusions Community systems: – data integration + IE + Web 2.0 – potentially very useful in numerous domains Such systems raise myriad data quality challenges – subsume many current challenges – suggest new ones Can provide a unifying context for us to make progress – building systems has been a key strength of our field – we need a community effort, as always See “cimple wisc” for more detail Let us know if you want code/data