Data Mining Case Study

UNITEC Data Mining Case Study Academic Networking Research Peng Huang, Muyang He ISCG 8042 Data Mining & Industry Applications Assignment 2 Submitted By Student Name: Student Id: Peng Huang 1425922 Muyang He 1434735 Due Time: 12:00 p.m. Due Date: Nov 10th, 2014 Lecturer Paul Pang Page | 2 Table of Contents Introduction ............................................................................................................................... 4 1. Data Crawler with Java .......................................................................................................... 5 1.1. Structure of Crawler Program ........................................................................................ 5 1.2. Configure part ................................................................................................................ 5 1.3. Original data gathering part ........................................................................................... 7 1.4. Task creation part and Task execution part ................................................................... 9 1.5. Daset creation part and Dataset modification part ..................................................... 11 2. Data Visualization with Gephi .............................................................................................. 15 3. Back end server .................................................................................................................... 26 Page | 3 Introduction The first assignment has presented the process of the conference members information collection, composed of data design, implement and analysis. This report basicly introduce the detail steps of procedure and crucial parameters we met. Social network is one of the most important components in academic society as in other ones. But the fact that relevant data is scattered through out the internet makes the task of collecting, sorting and analysis such data very difficult. Leave alone to developing an industry level application utilizing such data. This research solved such problems by resorting to various sources on internet for information, sort them through and joint them together to generate a feasible data set reflecting the social network among scholars. The goal of this research is to develop a system for network searching. After a user input a scholar’s name who he may or want to know, the system returns the relationship of this scholar to the user as well as his detail information like institute, interests and title. And it provides functions like filtering and real time social socialgram generation. For data collection, numerous technologies and languages were tried. And eventually we decided using python as the prime language because of its skimpiness and fitness for automatic scraper program; for data retrieval service, we use C++ to develop both the full text search engine and the network interface because of its performance; and the application is developed under iOS platform because of its superiority regarding the API consistency and friendly user experience. iOS is also the preference platform for most of the first version of the startup apps based on same reasons. Page | 4 1. Data Crawler with Java 1.1. Structure of Crawler Program The aim of the crawler application is to crawl target webpages in order to store the useful information for data mining and analysis. However, the crawling target web pages are various and difficult, and it is not possible to find one perfect solution to solve all the problems we would meet. Therefore, the appropriate solution should be both reliable and extendible. Mainly there are 6 parts of sections of Java source code, which are configure part, original data gathering part, task creation part, task execution part, dataset creation part, dataset modification part. Main WorkFlow original data gathering part task execution part task creation part dataset creation part dataset modification part 1.2. Configure part Due to the choice of Java and MySql technique solution of crawler, firstly we should introduce the configure of database authentication (username, password, JDBC Driver, Database URL) , the configure of regular expressions and the interval time etc., as the code below. // database configure final static String JDBC_DRIVER = "com.mysql.jdbc.Driver"; Page | 5 final static String DB_URL = "jdbc:mysql://localhost:3306"; final static String USER = "root"; final static String PASS = "root"; // html saving path scraped configure final static String URL_HEAD = "http://scholar.google.co.nz"; final static String SOURCE_FILE_LOCATION =" ./html/"; final static String SOURCE_USER_LOCATION = " ./html/" final static int DELAYTIME = 1000; final static int PAUSE_INTERVAL = 20 * 60 * 1000; // pause time static int pauseCount = 0; // count final static int GRASP_COUNT_MAX = 350; // the maximum pages every time // regular expression final static Pattern PT_NEXT = Pattern .compile( "<button type=\"button\" onclick=\"window.location='(.{1,140})'\" class=\"gs_btnPR gs_in_ib gs_btn_half gs_btn_srt\"><span class=\"gs_wr\"><span class=\"gs_bg\"></span>", Pattern.CASE_INSENSITIVE); // Person name final static Pattern PT_PERSON_NAME = Pattern .compile( "<div class=\"gsc_1usr_text\"><h3 class=\"gsc_1usr_name\"><a href=\"/citations\\?user=(.{1,40})&hl=en\"><span class='gs_hlt'>(.[^>]{1,40})</span></a></h3>", Pattern.CASE_INSENSITIVE); final static Pattern PT_PROFILE_NEXT = Pattern .compile( "</span><a href=\"(.{1,100})\" class=\"cit-dark-link\">Next ></a></td></tr></table>", Pattern.CASE_INSENSITIVE); final static Pattern PT_PROFILE_NAME = Pattern .compile( "<a href=\"(.[^<]{1,200}):(.{1,18})\" class=\"gsc_a_at\">(.[^<]{1,800})</a><div class=\"gs_gray\">(.[^<]{1,800})</div>", Pattern.CASE_INSENSITIVE); final static Pattern PT_519_NAME = Pattern.compile( "#t(.{1,10})left:519px;", Pattern.CASE_INSENSITIVE); final static Pattern PT_535_NAME = Pattern.compile( "#t(.{1,10})left:535px;", Pattern.CASE_INSENSITIVE); final static Pattern PT_474_NAME = Pattern.compile( "#t(.{1,10})left:474px;", Pattern.CASE_INSENSITIVE); Since the target webpages possibly will change the html structure ( we met once during the crawling period), the configure of regular expressions could be modified clearly and accurately, and we do not need to find the right position in the whole code. Page | 6 1.3. Original data gathering part Because the source of conference participators always comes from PDF, the converting PDF to Html should be prior to crawling work. There are lots of online tool to achieve the goal, but the html pages should be collected and stored in database, so we need to find the rules and check the data repeatedly. These are the corresponding code: String fileLocation = " /source/list.htm"; String outputLocation = "/dataFromPdf.sql"; StringBuilder contentHtml = new StringBuilder(""); StringBuilder strSQL = new StringBuilder(""); contentHtml.append(readFileUrl(fileLocation)); String[] parts = contentHtml.toString().replace("0PM", "QQQ") .replace("0AM", .replace(">P2", .replace(">P4", .replace(">P6", "QQQ").replace(">P1", "QQQ") "QQQ").replace(">P3", "QQQ") "QQQ").replace(">P5", "QQQ") "QQQ").replace(">P7", "QQQ").split("QQQ"); int iCount = 0, iPerson = 0, idPerson = 0; String participantGroup = "", sessions = ""; String[] tempPerson; // find regex Pattern ptPerson = Pattern .compile("<P\\p{Space}class=\"p(\\d+)\\p{Space}ft(7|10|14)\">( .*?)</P>"); Pattern ptSession = Pattern.compile("</SPAN>(.*?)</P>"); Pattern ptSessionbak = Pattern .compile("<P\\p{Space}class=\"p(\\d+)\\p{Space}ft(8|11|16)\">( .*?)</P>"); Matcher mtPerson; Matcher mtSession; while (iCount < parts.length) { // take out sessions, save sessions to database // matching regex mtSession = ptSession.matcher(parts[iCount]);// use </SPAN> if (mtSession.find()) { sessions = mtSession.group(1); sessions = sessions.replace("<NOBR>", "") .replace("</NOBR>", "").replace("<nobr>", "") .replace("</nobr>", ""); Page | 7 // System.out.println("Session:"+sessions); } else { mtSession = ptSessionbak.matcher(parts[iCount]);// use ft8 if (mtSession.find()) { sessions = mtSession.group(3); sessions = sessions.replace("<NOBR>", "") .replace("</NOBR>", "").replace("<nobr>", "") .replace("</nobr>", ""); // System.out.println("Session:"+sessions); } else { sessions = ""; } } if (sessions != "") { strSQL.append("INSERT INTO wccinew.wccinew_tempSessions (topicId,topicName,topicCategory,status) " + "VALUES (" + iCount + ",'" + sessions.replace("'", "''") + "','IT',0);\n"); // get participantGroup mtPerson = ptPerson.matcher(parts[iCount]); if (mtPerson.find()) { participantGroup = mtPerson.group(3).replace(", and", ",") .replace(" and", ",").replace(", ", ",") .replace("<NOBR>", "").replace("</NOBR>", "") .replace("<nobr>", "").replace("</nobr>", "") + ","; System.out.println("participant:" + participantGroup); tempPerson = participantGroup.split(","); while (iPerson < tempPerson.length && tempPerson[iPerson] != null && tempPerson[iPerson] != "") { strSQL.append("INSERT INTO wccinew.wccinew_tempUsers (userId,sessionID,name,status)" + " VALUES ('" + idPerson + "','" + iCount + "','" + tempPerson[iPerson].replace("'", "''") + "',0);\n"); iPerson++; idPerson++; Page | 8 } iPerson = 0; } else { } } iCount++; } System.out.println(strSQL.toString()); As we can see, this part of code use the ‘replace’ and regular expression to find the correct names of participators. 1.4. Task creation part and Task execution part When the names of participators are collected, the next step is to create plenty of steps for each in order to get the crawler target profile pages and publications of participators. Each person create several searching pages grasping tasks, one profile tasks, several publications pages, which depends on the real amount of the publications. Furthermore, if we user application threats to visit the target website frequently, the load balance of target website would be damaged instantly. Therefore, the method of solution is that downloading the html to local disk and keep the interval visiting time, these are the code below. mtPersonName = PT_PERSON_NAME.matcher(sContent); if (!mtPersonName.find()) { mtPersonName = Pattern .compile( "<div class=\"gsc_1usr_text\"><h3 class=\"gsc_1usr_name\"><a href=\"/citations\\?user=(.{1,40})&hl=en\"><span class='gs_hlt'>(.{1,100})</a></h3>", Pattern.CASE_INSENSITIVE).matcher( sContent.toString()); } else mtPersonName = PT_PERSON_NAME.matcher(sContent); boolean isMatched = false; while (mtPersonName.find()) { isMatched = true; // get idString and name System.out.println(mtPersonName.group(1)); System.out.println(mtPersonName.group(2)); String fixName = htmlRemoveTag(mtPersonName.group(2).toString()); if (!name.toLowerCase().equals(fixName.toLowerCase()) && !isInName(name, fixName)) continue;// if fuzzy equal，jump out the loop Page | 9 // user strDataSQL .append(" insert into wccinew.users(name,gid,Status) VALUES ('" + fixName.replace("'", "''") + "','" + mtPersonName.group(1).replace("'", "''") + "',0);\n"); executeSql(strDataSQL.toString()); strDataSQL = new StringBuffer(""); // profile url strDataSQL .append(" insert into wccinew.task(name,gid,url,type,Status) VALUES ('" + fixName.replace("'", "''") + "','" + mtPersonName.group(1).replace("'", "''") + "','" + URL_HEAD + "/citations?hl=en&user=" + mtPersonName.group(1).replace("'", "''") .replace("&", "&") + "&view_op=list_works&pagesize=100&sortby=pubdate" + "',1,0);\n"); executeSql(strDataSQL.toString()); strDataSQL = new StringBuffer(""); mtPersonName.appendReplacement(sb, "*****"); } mtPersonName.appendTail(sb); if (isMatched) executeSql("update wccinew.task set status=1 where id = " + taskId + ";"); else executeSql("update wccinew.task set status=-1 where id = " + taskId + ";"); // if there are another pages or not mtNext = PT_NEXT.matcher(sContent); if (mtNext.find()) { pageNumber++; return strDataSQL.append(getPageDataSql(taskId, name, pageNumber)) .toString(); } else return strDataSQL.toString(); Page | 10 1.5. Daset creation part and Dataset modification part Once we get the information, we could store them in database, but we do need to clean the data, as the clean code below. String sqlfilters = "select a.id,a.name,b.name as belongtoname,a.type from wccinew.filter as a " + "INNER JOIN wccinew.filter as b ON a.belongtoid=b.id order by a.type asc; "; String sqlusers = "select id,institute,positiontype from wccinew.users where gid is not null ;"; try { Class.forName(JDBC_DRIVER); System.out.println("Connecting to database..."); conn = DriverManager.getConnection(DB_URL, USER, PASS); System.out.println("begin to query statement..."); stmtfilters = conn.createStatement(); stmtusers = conn.createStatement(); ResultSet rsfilters = stmtfilters.executeQuery(sqlfilters); ResultSet rsusers = stmtusers.executeQuery(sqlusers); ArrayList<String> filtersArrList = new ArrayList<String>(); ArrayList<String> nounArrList = new ArrayList<String>(); String[] filtersArr = null; String[] nounArr = null; String[] tempArr = null; // Extract data from result set while (rsfilters.next()) { filtersArrList.add(rsfilters.getString("name") + ":" + rsfilters.getString("belongtoname") + ":" + rsfilters.getString("type")); } filtersArr = (String[]) filtersArrList .toArray(new String[filtersArrList.size()]); // Extract data from result set while (rsusers.next()) { String userid = rsusers.getString("id"); String tests = rsusers.getString("institute"); Page | 11 String[] nameSource = rsusers.getString("institute") .replace(",", " ").replace("(", " ").replace(")", " ") .replace("%20", " ").replace("&", " ") .replace("/", " ").toLowerCase().split(" "); System.out.println("****userid=" + userid + " ，" + rsusers.getString("institute") + "------------------------------------------\n"); if (userid.equals("124")) // rsusers.getString("institute").equals( // "graduate research assistant (phd)")) System.out.println("****userid=" + userid + "-" + rsusers.getString("institute") + "---\n"); nounArrList = new ArrayList<String>(); boolean isGet = false; // boolean isNoun = false;// for (int i = 0; i < nameSource.length; i++) { String tt = nameSource[i].trim(); System.out.println("The word extracted is ：" + tt + "\n"); if (tt.length() > 0) { for (int j = 0; j < filtersArr.length; j++) { tempArr = filtersArr[j].split(":"); if (tt.equals(tempArr[0])) { if (tempArr[2].equals("1")) isNoun = true; nounArrList.add(tempArr[1] + ":" + tempArr[2]); isGet = true; } } if (!isGet) nounArrList.add(":"); isGet = false;// revert } } Iterator<String> sListIterator = nounArrList.iterator(); int i = 0; while (sListIterator.hasNext()) { Page | 12 String e = sListIterator.next(); if (e.contains(":1")) { i++; } } if (i == 0) continue; removeDuplicateWithOrder(nounArrList); sListIterator = nounArrList.iterator(); while (sListIterator.hasNext()) { String e = sListIterator.next(); if (e.contains(":2") && nounArrList.contains(e.replace(":2", ":1"))) { sListIterator.remove(); } } nounArr = (String[]) nounArrList.toArray(new String[nounArrList .size()]); StringBuffer sb = new StringBuffer(""); for (int t = 0; t < nounArr.length; t++) { tempArr = nounArr[t].split(":"); if (tempArr.length > 0 && tempArr[0].length() > 0) { if (tempArr[1].equals("2")) { sb.append(tempArr[0] + " "); } else sb.append(tempArr[0] + " "); } } if (sb.toString().trim().length() > 0) sqllist.add(" update wccinew.users set positiontype='" + sb.toString().substring(0, sb.toString().length() - 1) + "' where id=" + userid + "; "); } We could create a table filter to get the filtered data. Page | 13 Connection conn = null; Statement stmt = null; Statement stmtRSsource = null; Statement stmtRStarget = null; try { Class.forName(JDBC_DRIVER); System.out.println("Connecting to database..."); conn = DriverManager.getConnection(DB_URL, USER, PASS); System.out.println("Creating statement..."); stmt = conn.createStatement(); stmtRSsource = conn.createStatement(); stmtRStarget = conn.createStatement(); String sqlinsertNodes = "insert into wccinew.nodes (nodes) " + "select distinct(gid) as gid from wccinew.topicuser where hashid in " + " (select hashid from wccinew.topicuser " + " group by hashid having count(distinct(gid))>1 " + " ) order by username ; "; stmt.executeUpdate(sqlinsertNodes); // update label sqlinsertNodes = "UPDATE wccinew.nodes t1 " + "INNER JOIN wccinew.topicuser t2 " + " ON t1.nodes = t2.gid " + "SET t1.label = t2.username ; "; stmt.executeUpdate(sqlinsertNodes); // update class=1 sqlinsertNodes = "UPDATE wccinew.nodes set class=1;"; // stmt.executeUpdate(sqlinsertNodes); String sqluserlist = "select nodes as gid, id from wccinew.nodes "; ResultSet rs = stmt.executeQuery(sqluserlist); ResultSet rstargetList = null; // Extract data from result set while (rs.next()) { System.out.println(rs.getString("id") + "processing..."); String source = rs.getString("gid"); // get targetlist String sqlgetTarget = "select distinct(gid) as gid from wccinew.topicuser where hashid in " + " (select hashid from wccinew.topicuser where gid='" + source + "') order by username;"; Page | 14 rstargetList = stmtRSsource.executeQuery(sqlgetTarget); while (rstargetList.next()) { String target = rstargetList.getString("gid"); String sqlinsert = "insert into wccinew.edges (source, target,weight,type) values ('" + source + "','" + target + "', 1,1); "; stmtRStarget.executeUpdate(sqlinsert); sqlinsert = "update wccinew.edges set weight = " + "(select count(*) from wccinew.topicuser where gid='" + target + "' and hashid in " + "(select hashid from wccinew.topicuser where gid='" + source + "')) " + "where source='" + source + "' and target='" + target + "';"; stmtRStarget.executeUpdate(sqlinsert); } } stmtRStarget .executeUpdate("delete from wccinew.edges where source=target;"); INNER t2.id INNER t2.id stmtRStarget .executeUpdate("UPDATE wccinew.edges t1 JOIN wccinew.nodes t2 ON t1.source = t2.nodes SET t1.sid = ;"); stmtRStarget .executeUpdate("UPDATE wccinew.edges t1 JOIN wccinew.nodes t2 ON t1.target = t2.nodes SET t1.tid = ;"); // STEP 6: Clean-up environment rstargetList.close(); rs.close(); stmt.close(); conn.close(); 2. Data Visualization with Gephi Gephi is a Real-time visualization tools, we could download it at www.gephi.org    a tool for exploring and understanding graphs according to networking data; to help data analysts to make hypothesis, intuitively discover patterns, isolate structure singularities or faults during data sourcing ; a tool for visual thinking with interactive interfaces ; Page | 15  a tool for Exploratory Data Analysis. This graph shows the connections of the conference pariticipators. Page | 16 If we zoom in and zoom out, it could clearly present the specific name and relationships. Page | 17 Import data The original dataset has 3 main tables, they are: CREATE TABLE `topicuser` ( ìd` int(11) NOT NULL AUTO_INCREMENT, `name` varchar(1000) COLLATE utf8_unicode_ci NOT NULL, `hashid` int(11) DEFAULT NULL COMMENT 'topic name plus authors', `gid` varchar(80) COLLATE utf8_unicode_ci DEFAULT NULL, ùsername` varchar(500) COLLATE utf8_unicode_ci DEFAULT NULL, `valid` int(11) DEFAULT '1', ùserid` int(11) DEFAULT NULL, PRIMARY KEY (ìd`), KEY ìndex2` (`hashid`), KEY ìndex3` (`gid`) ) ENGINE=MyISAM AUTO_INCREMENT=122253 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; CREATE TABLE ùsers` ( ìd` int(11) NOT NULL AUTO_INCREMENT, `gid` varchar(100) COLLATE utf8_bin DEFAULT NULL, `name` varchar(500) COLLATE utf8_bin NOT NULL, `namebase64` varchar(1000) COLLATE utf8_bin DEFAULT NULL, àrealabel` varchar(500) COLLATE utf8_bin DEFAULT NULL, àrealabelbase64` varchar(1000) COLLATE utf8_bin DEFAULT NULL, ìnstitute` varchar(327) COLLATE utf8_bin DEFAULT NULL, ìnstitutebase64` varchar(1000) COLLATE utf8_bin DEFAULT NULL, `positiontype` varchar(300) COLLATE utf8_bin DEFAULT NULL, `status` int(11) NOT NULL DEFAULT '0' COMMENT '0 means active', `verifiedemailat` varchar(100) COLLATE utf8_bin DEFAULT NULL, `verifiedemailatbase64` varchar(1000) COLLATE utf8_bin DEFAULT NULL, Page | 18 `homepageurl` varchar(500) COLLATE utf8_bin DEFAULT NULL, ùserphotourl` varchar(600) COLLATE utf8_bin DEFAULT NULL, `citationsall` int(11) DEFAULT NULL, `citationhindex` int(11) DEFAULT NULL, `citationI10index` int(11) DEFAULT NULL, `citationssince2009` int(11) DEFAULT NULL, `citationshindexsince2009` int(11) DEFAULT NULL, `citationI10indexsince2009` int(11) DEFAULT NULL, `valid` int(11) DEFAULT '0', òldname` varchar(500) COLLATE utf8_bin DEFAULT NULL, PRIMARY KEY (ìd`) ) ENGINE=MyISAM AUTO_INCREMENT=8202 DEFAULT CHARSET=utf8 COLLATE=utf8_bin; CREATE TABLE `task` ( ìd` int(11) NOT NULL AUTO_INCREMENT, `name` varchar(1000) COLLATE utf8_unicode_ci NOT NULL, `namebase64` varchar(3000) COLLATE utf8_unicode_ci DEFAULT NULL, `hashid` int(11) DEFAULT NULL, `gid` varchar(80) COLLATE utf8_unicode_ci DEFAULT NULL, `gtid` varchar(80) COLLATE utf8_unicode_ci DEFAULT NULL, ùrl` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, `type` int(11) NOT NULL DEFAULT '0' COMMENT '0 means author profile page, `status` int(11) NOT NULL DEFAULT '0' COMMENT '-1 means disable status; 0 means active status;', PRIMARY KEY (ìd`) ) ENGINE=MyISAM AUTO_INCREMENT=233737 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; Page | 19 Before we illustrate the data with the graph, we should import the data. Mainly we just need two tables, which are Nodes and Edges. Table Nodes：     id nodes label other properties (h-index, an index that attempts to measure both the productivity and citation impact of the publications of a scientist or scholar.) Table Edges：     source target weight other properties Page | 20 partitionning We could use the Modularity statistic to partition a network data, even a pie graph. Furthermore, we could use different colors to present clearly the different groups. Page | 21 Page | 22 Interaction     data laboratory(filtering, add, remove, merge, duplicate data) nodes research edges research there are 2 methods to present the specific person and connections Page | 23 Ranking Based on the data of nodes or edges, the ranking data with label could help us form a list. Page | 24 Page | 25 3. Back end server The backend server is a full text search engine is built on Xapian. And a C++ implemented CGI is programmed to provide HTTP interface to the APP Linked Conference. Although C++ implemented Xapian has an advantage on performance, it is planned to be replaced by Apache Lucene since it has some practical drawbacks like study curve and human resource. And CGI is also being replaced by JSP/Servlet implemented web server based on the same reason. The location of Xapian is /root/sttc. The project tree is demonstrated as bellow: -- STTCApi.cpp |-- lib | |-- STTC.cpp | |-- STTC.h | |-- comm.cpp | |-- comm.h | |-- indexer.cpp | |-- libsttc.a | |-- makefile |-- log |-- makefile |-- nginx.conf |-- sql | |-- Dump20141019-6.sql | |-- Dump20141020-1.sql | |-- Dump20141020-1.sql.zip | |-- Dump20141020.sql | `-- Dump20141020.sql.zip |-- test |-- test.cpp |-- xa.db Page | 26 | |-- flintlock | |-- iamchert | |-- position.DB | |-- position.baseA | |-- position.baseB | |-- postlist.DB | |-- postlist.baseA | |-- postlist.baseB | |-- record.DB | |-- record.baseA | |-- record.baseB | |-- termlist.DB | |-- termlist.baseA | `-- termlist.baseB The STTCApi.cpp is the CGI source file dealing with all the HTTP requests and response by communication with nginx. The configure file defining the communication is nginx.conf. The core part of STTC backend server locates at lib/, and the core interface is compiled into a static library .a file providing service to the upper unessential part of the system, i.e. CGI server. Thus, the core part of STTC backend could be either a close source component or an open source project depending on the future plan of us for STTC. Here are some important files in the project directory:  comm.cpp includes all the commonly used functions and utilities used by the whole project, which includes Jason string generation, database operation, base64 encoding and decoding.  STTC.cpp is the actual live service provider that is complied into the library “.a” file  indexer is a program for indexing the raw data collected from crawler to Xapian database for further searching.  Xa.db is the full text searching db which is the product generated by indexer  Test.cpp is program for unit testing which is especially useful when developing new interface The process of the backend server dealing with user request and response is depicted as bellow: Page | 27 CGI (APISTTC.cpp) nginx HTTP request Local socket Underline library (STTC.cpp) Function call Function call Xapian Crawler(s) File read DB Indexer (MySQL) DB (Xapian) The data source is generated by two phases: 1. crawlers scraping the source web site by several rounds and fetch all the data and store them into a relation Database, currently MySQL is being used for the purpose. 2. the raw data is further processed by indexer and be stored in Xapian’s database which is in fact a binary file including all the key-value maps formatted by indexer. The format of this file is designed dedicatedly for full text searching so the speed is much faster than traditional relation databases; The data fetching could be divided into 3 phases: 1． http request is sent to the nginx running on the back end server 2． nginx communicates with the CGI crafted by us 3． CGI only contains code for bussiness protocol parsing, and 4． calls the functions provided by underline library which contains all the core logic and communication with Xapian 5． Xapian reads the regenerated binary file aforementioned, performs the searching requests according to the parameters passed and generate results 6． the results were passed all the way back to the front end Page | 28

Data Mining Case Study

Related documents

Products

Support

Data Mining Case Study

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib