Research Internships Advanced Research and Modeling Research Group ADREM – What? • Research group that deals with computational aspects of data – databases – data mining – Information retrieval ADREM – Who? DB/DM/IR • Floris Geerts • Bart Goethals • Martin Theobald Bioinf • Kris Laukens • Tim Van den Bulcke + Phd students and postdoctoral researchers http://adrem.ua.ac.be/adrem Internships – What? • 2 research internships (15 credits each) • Msc thesis (30 credits). Goal: internships are an initiation to research and is in collaboration with researchers in ADReM • 15 credits is a lot = internship is time consuming! • 1 credit = 15 hour work… • Balance your course load and internship well. • Internships are not necessarily related to your Msc thesis (but it can) • In a Msc thesis your ability to independently do research plays an important role. Internships – Who? • Everyone who follows the research option in the database Msc program Research In an internship you need to: 1.Understand a specific problem 2.Implement an (existing) method for solving the problem 3.Test and evaluate 4.Write a report (Msc thesis: you have to solve the problem as well by designing new methods…) Internships in a company • It is allowed to do a internship in a company but you have to ask permission • Also, you have to find the company yourself and convince us that there is research involved • You can’t receive any money from the company during your internship Databases, data mining, information retrieval • These are not separate research domains • The topics for internships that each of us will present next are usually on the intersection of these areas. • Let’s see some example topics…. Bart Goethals Recommender Systems • • • • • Implement state of the art recommenders Pattern mining for better recommendations Interactive Recommendation Explaining recommendations Test recommenders for real data Visual Instant Interactive Pattern Mining • Study Visualizations enabling Interactive Pattern Mining • Implement and Experiment with novel instant mining methods Pattern based Clustering • Implement and evaluate different techniques for clustering based pattern mining, and pattern based clustering Data Mining for Cleaning • Study and experiment with data mining methods for data cleaning. Martin Theobald Information Extraction (I): Wikipedia Infoboxes Information Extraction (I): Infoboxes YAGO/DBpedia et al. bornOn(Jeff, 09/22/42) gradFrom(Jeff, Columbia) hasAdvisor(Jeff, Arthur) hasAdvisor(Surajit, Jeff) knownFor(Jeff, Theory) >120 M facts for YAGO2 (mostly from Wikipedia infoboxes) Information Extraction (II): Wikipedia Categories Information Extraction (II): Wikipedia Categories RDF Knowledge Bases 3 Mio. entities, 120 Mio. facts 100 relations, 200k classes Entity subclass subclass Organization accuracy 95% subclass Person Location subclass subclass Scientist subclass subclass Biologist subclass Politician instanceOf subclass subclass State instanceOf Physicist Country instanceOf City instanceOf Germany instanceOf Oct 23, 1944 instanceOf Max_Planck Society instanceOf Erwin_Planck diedOn Nobel Prize Kiel hasWon FatherOf locatedIn locatedIn bornIn SchleswigHolstein citizenOf Oct 4, 1947 Apr 23, 1858 diedOn Max_Planck bornOn means means Angela Merkel “Max Planck” http://www.mpi-inf.mpg.de/yago-naga/ means “Max Karl Ernst Ludwig Planck” means “Angela Merkel” means “Angela Dorothea Merkel” Linked Open Data As of Sept. 2011: > 200 sources > 30 billion RDF triples > 400 million links http://linkeddata.org/ As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase IBM Watson: Deep Question Answering William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel This town is known as "Sin City" & its downtown is "Glitter Gulch" As of 2010, this is the only former Yugoslav republic in the EU 99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain question classification & decomposition knowledge back-ends D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, Fall 2010. YAGO www.ibm.com/innovation/us/watson/index.htm Jeopardy! A big US city with two airports, one named after a World War II hero, and one named after a World War II battle field? Structured Knowledge Queries A big US city with two airports, one named after a World War II hero, and one named after a World War II battle field? Select Distinct ?c Where { ?c type City . ?c locatedIn USA . ?a1 type Airport . ?a2 type Airport . ?a1 locatedIn ?c . ?a2 locatedIn ?c . ?a1 namedAfter ?p . ?p type WarHero . ?a2 namedAfter ?b . ?b type BattleField . } • Use manually created templates for mapping sentence patterns to structured queries. • Works for factoid and list questions. Mining Rules from RDF Knowledge Bases Goal: Inductively learn (soft) rules: livesIn(x,y) :- bornIn(x,y) R Ground truth for livesIn (only partially known) Knowledge base for livesIn (known positive examples) Facts produced by the rule (only partially correct) KB G | Head Body | accuracy ( S ) P( Head | Body ) | Body | • A-priori-style pre-filtering of low-support join patterns • Dynamic programming ILP algorithm • Learning with constants and type constraints Rule-based Reasoning (Soft) Deduction Rules vs. (Hard) Consistency Constraints • People may live in more than one place livesIn(x,y) marriedTo(x,z) livesIn(z,y) [0.8] livesIn(x,y) hasChild(x,z) livesIn(z,y) [0.5] • People are not born in different places/on different dates bornIn(x,y) bornIn(x,z) y=z • People are not married to more than one person (at the same time, in most countries?) marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z disjoint(t1,t2) Probabilistic RDF Database Query graduatedFrom(Surajit, y) 0.7x(1-0.888)=0.078 graduatedFrom (Surajit, Q1 Princeton) (1-0.7)x0.888=0.266 graduatedFrom (Surajit, Q2 Stanford) A(B (CD)) A(B (CD)) 1-(1-0.72)x(1-0.6) =0.888 \/ 0.8x0.9 =0.72 A graduatedFrom (Surajit, Princeton)[0.7] C hasAdvisor (Surajit,Jeff)[0.8] B graduatedFrom /\ D (Surajit, Stanford)[0.6] worksAt (Jeff,Stanford)[0.9] Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z) [0.4] graduatedFrom(x,y) graduatedFrom(x,z) y=z Base Facts graduatedFrom(Surajit, Princeton) [0.7] graduatedFrom(Surajit, Stanford) [0.6] graduatedFrom(David, Princeton) [0.9] hasAdvisor(Surajit, Jeff) [0.8] hasAdvisor(David, Jeff) [0.7] worksAt(Jeff, Stanford) [0.9] type(Princeton, University) [1.0] type(Stanford, University) [1.0] type(Jeff, Computer_Scientist) [1.0] type(Surajit, Computer_Scientist) [1.0] type(David, Computer_Scientist) [1.0] Temporal Knowledge Probabilistic-Temporal Consistency Reasoning Derived Facts t3 teamMates(Beckham, Ronaldo, Ronaldo, Tt3) State Relation 0.08 ‘03 0.4 Base Facts ‘04 0.16 playsFor(Beckham, Real, T1) playsFor(Ronaldo, Real, T2) overlaps(T1,T2) 0.12 ‘05 ‘07 0.6 ‘05 ‘07 ‘03 playsFor(Beckham, Real, T1) 0.1 0.2 0.4 0.2 ‘00 ‘02 ‘07 ‘04 ‘05 playsFor(Ronaldo, Real, T2) Topics for Internships & Master Theses Research Internships • Preparation & Integration of Linked Data Sources for Scientific Experiments (SQL/Java/Python) • Mining Association Rules from Linked Data (Java/C++) • Visualization Frontend for Linked Data (ActionScript & Adobe Flash) Master Theses • Implementation of a distributed rule-based query engine for RDF data (C++ & Message Passing Interface) • Implementation of a distributed factor graph model for correlated RDF facts (C++ & Message Passing Interface) • Faceted Search and Interactive Browsing for Linked Data Floris Geerts RDBMS-based recommendation systems EDI NY Find top-3 flights from Edi to NYC with at most one stop Items: flights Selection criteria: relational queries Utility function: in terms of price and duration (for ranking) Top-k item selection Selection criteria Utility function top-k items … items 32 Books, music, news, Web sites, research papers,….. Query relaxation Query for 5-day holiday Q(f#, name,type,ticket, time) = ∃DT, AT, AD, xTo ( flight ( f#, EDI, xTo, DT, 5/19/2012, AT, AD, Pr ) ∧ POI ( name, xTo, type, ticket, time) ∧xTo= NYC ) Relaxation: cities There is no direct flight within 15 miles of EDI from EDI to NYC or NYC are acceptable E = { EDI,NYC,4/1/2012 }, X = { xTo } Q1(f#, name, type, ticket, time) =∃DT, AT, AD, uTo, wEdi, wNYC,wDD ( flight ( f#, wEdi, xTo, DT ,wDD, AT,A D, Pr ) ∧ xTo= wNYC ∧ POI( name, uTo, type, ticket, time) ∧ wDD=5/19/2012 ∧ dist(wNYC,NYC)≤15 ∧ dist(wEdi,EDI) ≤15 ∧ xTo=uTo) Further relaxation: departure dates within 3 days of 5/19/2012 arequery relaxation dist(wDD,5/10/2012 ) ≤ 3 valid 33 acceptable Topics Top-k query answering algorithm on top of RDBMS Query relaxation approaches and query completion Data quality • Detecting and correcting inconsistencies • Finding duplicates • Finding most up-to-date information Semantic errors Yahoo! Finance Day’s Range: 93.80-95.71 Nasdaq 52wk Range: 25.38-95.71 Day’s Range: 93.80-95.71 52 Wk: 25.38-93.72 Instance ambiguity Out-of-Date Data 4:05 pm 3:57 pm Unit errors 76.82B 76,821,000 Topics Fast inconsistency detection Duplication elimination algorithms Automated repairing algorithms Mining of “data quality rules”