Notes on Final Project of MIR Course Part I: Crawling Phase Modern Information Retrival Course, Semantic web Research labratory 1 Crawling Phase Crawling the Dmoz directory It has as taxonomic structure (Tree-like) Each subdirectory by a group Modern Information Retrival Course, Semantic web Research labratory 2 Crawling Phase This tree-like structure has two important components: Internal Nodes (also known as “topics”) Leaves (also known as “pages”) Topics Pages Modern Information Retrival Course, Semantic web Research labratory 3 Crawling Phase Then each topic has a: list of children (subtopics) unique path to root node (supertopics) description list of related pages And each page has: A topic Modern Information Retrival Course, Semantic web Research labratory 4 Crawling Phase Each topic has some characteristics Description of Current Topic The Current Topic (Node) List of super topics List of subtopics List of Related Pages (Leaves) Modern Information Retrival Course, Semantic web Research labratory 5 Crawling Phase Deliveries for first phase: TopicNames.txt TopicDescs.txt Each line contains a topic number and the full name of that topic, separated by a tab character (i.e. 46 Top/Science/Agriculture ) Each line contains a topic number and the description of that topic, separated by a tab character. For some topics, the description is a zero-length string. TopicHierarchy.txt Each line contains a pair of topic numbers (separated by a tab character). The first of these two topics is the parent of the second topic. Each topic has exactly one parent, except for the root (topic 0), which has no parent. Modern Information Retrival Course, Semantic web Research labratory 6 Crawling Phase Deliveries for first phase: DocUrls.txt DocTitles.txt Each line contains a document number and its URL, separated by a tab character Each line contains a document number and its title, separated by a tab character DocTopics.txt Each line contains a document number and a topic number, separated by a tab character. This indicates that the document belongs to the given topic. Modern Information Retrival Course, Semantic web Research labratory 7 Crawling Phase Deliveries for first phase: Documents.zip The contents of the documents seperately A list of samples for each output file have been added to the Assignments page (for “Science” Subdirectory) Modern Information Retrival Course, Semantic web Research labratory 8 Crawling Phase Naming contraction: Names in each subdirectory start with a special character: Subdirectory Arts Business Computers Games Health Home Kids and Teens News Char A B C D E F G H Subdirectory Recreation Reference Regional Science Shopping Society Sports Modern Information Retrival Course, Semantic web Research labratory Char I J K L M N O 9 Crawling Phase Then for each sub tree , generate numeric names for children in BFS search order. i.e. in Science Subdirectory: Sample Topic Sample Page 1 L1 2 L3 L2 3 L5 5 L4 4 L6 Modern Information Retrival Course, Semantic web Research labratory L7 L8 10 Crawling Phase Assignments of subdirectories to groups: Subdir. Group Subdir. Group Arts Abbasi / Kord-Zadeh Recreation Mirjalaali / Sayyedi Business Ahangaraan / Samad-Zadegan Reference Nokhbe-Zaeim / Tabaatabaaei Computers Ashraf / Rahimi (M.) Regional Omid / Arab Games Darvishi / Rahimi (A.) Science Qaderian / KhorramZadeh Health Falaki / Vaezi Shopping Qazvinian / Rsoulian Home Fathi / Sadjadi Society Saremi / Mashayekhi Kids and Teens Iranmanesh / Takhtaai Sports Shafi'i-Nowroozi News Kazemi-Tabar / Parsa Computers Vafaai / Jalili Modern Information Retrival Course, Semantic web Research labratory 11