Web Information Retrieval Projects Ida Mele Rules • Students can work in teams (max 3 people) • The project must be delivered by the deadline that will be published on my web site. Usually the project discussion is the same day of the written exam. Students who register for the first exam call can present the software project in the first or in the second exam call • The project score is from 0 to 10. The professor decides the final mark • The same project can be assigned to max 2 groups • For any question/doubt/problem, send me an email Ida Mele Projects 1 Project Request • Students have to send me an email with object: WebIR project request specifying: • Name and last name of each student in the group • Title of the project and dataset the students intend to use • Short description of what the students intend to do (up to 250 words) Important: all the members of the group should be cc-ed in the email • If everything is OK, you will receive a confirmation email • There is no deadline for the request of the project Ida Mele Projects 2 Project Delivery • The presentation of the project takes 15 minutes • The presentation should contain: • the description of the problem and of the dataset • the most important issues related to the implementation, and how they have been addressed • the results achieved • Students can use slides for their presentations and if they want they can realize a demo as well • Deadline and more instructions about the project delivery will be published on my web site Ida Mele Projects 3 List of Projects 1) 2) 3) 4) 5) 6) 7) Analyze the link structure of a large graph from the Web Find circles in a social network through link analysis Find communities in a network of users Classification of online reviews Topic classification of tweets Personalized ranking of query results Hadoop implementation of a link-based ranking algorithm 8) Hadoop implementation of an inverted index Ida Mele Projects 4 Projects 1) Analyze the link structure of a large graph from the Web • Create the web graph and analyze its link structure by computing degree, in-degree, out-degree, PageRank, TruncatedPageRank, edge reciprocity, graph assortativity, number of triangles, etc. Plot the distributions of the features • List of datasets you can use: • http://law.di.unimi.it/datasets.php use one of the graphs available in Section Larger crawls • http://snap.stanford.edu/data/index.html use graphs in Section Web graphs (e.g., web-Google, web-Stanford, web-NotreDame) • http://webdatacommons.org/hyperlinkgraph/ use the graph representing subdomains Ida Mele Projects 5 Projects 2) Find circles in a social network through link analysis • Create the graph of the users of a popular social network (e.g., Twitter, Facebook, or Google+). Analyze the network and apply link-based features to identify circles. Check if the circles you get match the ones obtained from the analysis of common features • List of datasets you can use: • http://snap.stanford.edu/data/index.html use one of the ego graphs available in Section Social networks: ego-Facebook, egoGplus, or ego-Twitter. Each dataset is made of the ego network, the set of circles for the ego node, and the connections among ego networks. You can use the file with the set of circles as a groundtruth Ida Mele Projects 6 Projects 3) Find communities in a network of users • Create a graph where nodes are people and a link between two people represents the fact that they have something in common. For example, they are collaborators (DBLP co-authorship network) or they have bought the same product (Amazon product copurchasing network), etc. Use this graph to find communities of people and check the results with the ground-truth provided in the dataset • List of datasets you can use: • http://snap.stanford.edu/data/index.html use one of the graphs available in Section Networks with ground-truth communities (e.g., com-DBLP, com-Amazon, com-YouTube, comFriendster) Ida Mele Projects 7 Projects 4) Classification of online reviews • Given a set of user reviews about products (food, wine, etc.), analyze the text and other features for creating a classification of reviews. Some possible classifications are dividing reviews for kind/brand of product, for judgment (positive/neutral/negative), for helpfulness, etc. • List of datasets you can use: • http://snap.stanford.edu/data/index.html use data available in in Section Online Reviews (e.g., CellarTracker, Amazon reviews, Fine Foods, Movies) Ida Mele Projects 8 Projects 5) Topic classification of tweets • Given a set of english tweets, implement a topicclassification algorithm which divides tweets into categories. Possible categories are personal updates, news, politics, economics, sports, music, gossip, etc. You can also use ODP categories (http://www.dmoz.org/) for creating the list of possible topics • List of datasets you can use: • Send me an email, and I will give you the link to the dataset you can download Ida Mele Projects 9 Projects 6) Personalized ranking of query results • Create a system for query-result personalization. The users of the system can specify their interests by selecting them from a list of keywords (e.g., gossip, sport, politics, …). You can use a HTML form for the registration to the system. • Crawl a portion of the web (e.g., news websites) and create the corresponding webgraph. Use a personalized ranking algorithm, for example, Topic-Specific PageRank, for ranking the pages according to user interests and compare the personalized ranking against the notpersonalized one. Ida Mele Projects 10 Projects 7) Hadoop implementation of a link-based ranking algorithm • Given a web graph, where nodes represent web pages and the edge between two nodes u and v represents the link from the source page u to the target page v, implement in Hadoop a ranking algorithm (PageRank or HITS) to computes the scores of the nodes. Plot and analyze the distribution of the obtained scores • List of datasets you can use: • http://law.di.unimi.it/datasets.php use one of the graphs available in Section Larger crawls • http://snap.stanford.edu/data/index.html use graphs in Section Web graphs (e.g., web-Google, web-Stanford, web-NotreDame) Ida Mele Projects 11 Projects 8) Hadoop implementation of an inverted index • Given a large collection of documents, create the inverted index, which is made of a dictionary and the posting lists. The dictionary contains indexed terms (remove stop-words and use stemming for preprocessing). For each term in the dictionary, the posting list contains information about documents where the term appears. Each posting has the ID of the document, the frequency of the term in the document, and the positions of the occurrences of the term in the document • List of datasets you can use: • Gutenberg project (http://www.gutenberg.org/) offers free ebooks that can be used for creating the document collection Ida Mele Projects 12 Important Information • Students can choose one of the projects in the list, or they can propose a different project • There are no constraints on the datasets to use: The students can use the datasets suggested in the list of projects or different datasets available on the Web, or they can even create a new dataset for their project • Links to other dataset sources: • http://vlado.fmf.uni-lj.si/pub/networks/data/default.htm • http://www.trustlet.org/wiki/Repositories_of_datasets • http://www-personal.umich.edu/~mejn/netdata/ Ida Mele Projects 13 Important Information • There are no constraints on programming languages, libraries, and tools to use • Links to some tools/libraries for working with graphs: • Graph visualization: Gephi (http://gephi.org/), Graphviz (http://www.graphviz.org/) • Large-graph partitioning: METIS (http://glaros.dtc.umn.edu/gkhome/metis/metis/overview) • Java Library: WebGraph (http://webgraph.di.unimi.it/), JUNG (http://jung.sourceforge.net/) • Python library: NetworkX (http://networkx.github.io/) Ida Mele Projects 14