Data Mining Case Study

advertisement
UNITEC
Data Mining Case Study
Academic Networking Research
Peng Huang, Muyang He
ISCG 8042
Data Mining & Industry Applications
Assignment 2
Submitted By
Student Name:
Student Id:
Peng Huang
1425922
Muyang He
1434735
Due Time: 12:00 p.m.
Due Date: Nov 10th, 2014
Lecturer
Paul Pang
Page | 2
Table of Contents
Introduction ............................................................................................................................... 4
1. Data Crawler with Java .......................................................................................................... 5
1.1. Structure of Crawler Program ........................................................................................ 5
1.2. Configure part ................................................................................................................ 5
1.3. Original data gathering part ........................................................................................... 7
1.4. Task creation part and Task execution part ................................................................... 9
1.5. Daset creation part and Dataset modification part ..................................................... 11
2. Data Visualization with Gephi .............................................................................................. 15
3. Back end server .................................................................................................................... 26
Page | 3
Introduction
The first assignment has presented the process of the conference members
information collection, composed of data design, implement and analysis. This report
basicly introduce the detail steps of procedure and crucial parameters we met. Social
network is one of the most important components in academic society as in other ones. But the
fact that relevant data is scattered through out the internet makes the task of collecting, sorting
and analysis such data very difficult. Leave alone to developing an industry level application
utilizing such data. This research solved such problems by resorting to various sources on
internet for information, sort them through and joint them together to generate a feasible data set
reflecting the social network among scholars. The goal of this research is to develop a system for
network searching. After a user input a scholar’s name who he may or want to know, the system
returns the relationship of this scholar to the user as well as his detail information like institute,
interests and title. And it provides functions like filtering and real time social socialgram
generation. For data collection, numerous technologies and languages were tried. And eventually
we decided using python as the prime language because of its skimpiness and fitness for
automatic scraper program; for data retrieval service, we use C++ to develop both the full text
search engine and the network interface because of its performance; and the application is
developed under iOS platform because of its superiority regarding the API consistency and
friendly user experience. iOS is also the preference platform for most of the first version of the
startup apps based on same reasons.
Page | 4
1. Data Crawler with Java
1.1. Structure of Crawler Program
The aim of the crawler application is to crawl target webpages in order to store the
useful information for data mining and analysis. However, the crawling target web pages are
various and difficult, and it is not possible to find one perfect solution to solve all the
problems we would meet. Therefore, the appropriate solution should be both reliable and
extendible. Mainly there are 6 parts of sections of Java source code, which are configure
part, original data gathering part, task creation part, task execution part, dataset creation
part, dataset modification part.
Main WorkFlow
original data
gathering part
task execution part
task creation part
dataset creation
part
dataset
modification part
1.2. Configure part
Due to the choice of Java and MySql technique solution of crawler, firstly we should
introduce the configure of database authentication (username, password, JDBC Driver,
Database URL) , the configure of regular expressions and the interval time etc., as the code
below.
// database configure
final static String JDBC_DRIVER = "com.mysql.jdbc.Driver";
Page | 5
final static String DB_URL = "jdbc:mysql://localhost:3306";
final static String USER = "root";
final static String PASS = "root";
// html saving path scraped configure
final static String URL_HEAD = "http://scholar.google.co.nz";
final static String SOURCE_FILE_LOCATION =" ./html/";
final static String SOURCE_USER_LOCATION = " ./html/"
final static int DELAYTIME = 1000;
final static int PAUSE_INTERVAL = 20 * 60 * 1000; // pause
time
static int pauseCount = 0; // count
final static int GRASP_COUNT_MAX = 350; // the maximum pages
every time
// regular expression
final static Pattern PT_NEXT = Pattern
.compile(
"<button type=\"button\"
onclick=\"window.location='(.{1,140})'\" class=\"gs_btnPR gs_in_ib
gs_btn_half gs_btn_srt\"><span class=\"gs_wr\"><span
class=\"gs_bg\"></span>",
Pattern.CASE_INSENSITIVE);
// Person name
final static Pattern PT_PERSON_NAME = Pattern
.compile(
"<div class=\"gsc_1usr_text\"><h3
class=\"gsc_1usr_name\"><a
href=\"/citations\\?user=(.{1,40})&hl=en\"><span
class='gs_hlt'>(.[^>]{1,40})</span></a></h3>",
Pattern.CASE_INSENSITIVE);
final static Pattern PT_PROFILE_NEXT = Pattern
.compile(
"</span><a href=\"(.{1,100})\"
class=\"cit-dark-link\">Next ></a></td></tr></table>",
Pattern.CASE_INSENSITIVE);
final static Pattern PT_PROFILE_NAME = Pattern
.compile(
"<a href=\"(.[^<]{1,200}):(.{1,18})\"
class=\"gsc_a_at\">(.[^<]{1,800})</a><div
class=\"gs_gray\">(.[^<]{1,800})</div>",
Pattern.CASE_INSENSITIVE);
final static Pattern PT_519_NAME = Pattern.compile(
"#t(.{1,10})left:519px;", Pattern.CASE_INSENSITIVE);
final static Pattern PT_535_NAME = Pattern.compile(
"#t(.{1,10})left:535px;", Pattern.CASE_INSENSITIVE);
final static Pattern PT_474_NAME = Pattern.compile(
"#t(.{1,10})left:474px;", Pattern.CASE_INSENSITIVE);
Since the target webpages possibly will change the html structure ( we met once
during the crawling period), the configure of regular expressions could be modified clearly
and accurately, and we do not need to find the right position in the whole code.
Page | 6
1.3. Original data gathering part
Because the source of conference participators always comes from PDF, the
converting PDF to Html should be prior to crawling work. There are lots of online tool to
achieve the goal, but the html pages should be collected and stored in database, so we need
to find the rules and check the data repeatedly. These are the corresponding code:
String fileLocation = " /source/list.htm";
String outputLocation = "/dataFromPdf.sql";
StringBuilder contentHtml = new StringBuilder("");
StringBuilder strSQL = new StringBuilder("");
contentHtml.append(readFileUrl(fileLocation));
String[] parts = contentHtml.toString().replace("0PM",
"QQQ")
.replace("0AM",
.replace(">P2",
.replace(">P4",
.replace(">P6",
"QQQ").replace(">P1", "QQQ")
"QQQ").replace(">P3", "QQQ")
"QQQ").replace(">P5", "QQQ")
"QQQ").replace(">P7",
"QQQ").split("QQQ");
int iCount = 0, iPerson = 0, idPerson = 0;
String participantGroup = "", sessions = "";
String[] tempPerson;
// find regex
Pattern ptPerson = Pattern
.compile("<P\\p{Space}class=\"p(\\d+)\\p{Space}ft(7|10|14)\">(
.*?)</P>");
Pattern ptSession = Pattern.compile("</SPAN>(.*?)</P>");
Pattern ptSessionbak = Pattern
.compile("<P\\p{Space}class=\"p(\\d+)\\p{Space}ft(8|11|16)\">(
.*?)</P>");
Matcher mtPerson;
Matcher mtSession;
while (iCount < parts.length) {
// take out sessions, save sessions to database
// matching regex
mtSession = ptSession.matcher(parts[iCount]);// use
</SPAN>
if (mtSession.find()) {
sessions = mtSession.group(1);
sessions = sessions.replace("<NOBR>", "")
.replace("</NOBR>",
"").replace("<nobr>", "")
.replace("</nobr>", "");
Page | 7
// System.out.println("Session:"+sessions);
} else {
mtSession =
ptSessionbak.matcher(parts[iCount]);// use ft8
if (mtSession.find()) {
sessions = mtSession.group(3);
sessions = sessions.replace("<NOBR>", "")
.replace("</NOBR>",
"").replace("<nobr>", "")
.replace("</nobr>", "");
//
System.out.println("Session:"+sessions);
} else {
sessions = "";
}
}
if (sessions != "") {
strSQL.append("INSERT INTO
wccinew.wccinew_tempSessions
(topicId,topicName,topicCategory,status) "
+ "VALUES ("
+ iCount
+ ",'"
+ sessions.replace("'", "''") +
"','IT',0);\n");
// get participantGroup
mtPerson = ptPerson.matcher(parts[iCount]);
if (mtPerson.find()) {
participantGroup =
mtPerson.group(3).replace(", and", ",")
.replace(" and",
",").replace(", ", ",")
.replace("<NOBR>",
"").replace("</NOBR>", "")
.replace("<nobr>",
"").replace("</nobr>", "")
+ ",";
System.out.println("participant:" +
participantGroup);
tempPerson = participantGroup.split(",");
while (iPerson < tempPerson.length
&& tempPerson[iPerson] != null
&& tempPerson[iPerson] != "")
{
strSQL.append("INSERT INTO
wccinew.wccinew_tempUsers (userId,sessionID,name,status)"
+ " VALUES ('"
+ idPerson
+ "','"
+ iCount
+ "','"
+
tempPerson[iPerson].replace("'", "''")
+ "',0);\n");
iPerson++;
idPerson++;
Page | 8
}
iPerson = 0;
} else {
}
}
iCount++;
}
System.out.println(strSQL.toString());
As we can see, this part of code use the ‘replace’ and regular expression to find the
correct names of participators.
1.4. Task creation part and Task execution part
When the names of participators are collected, the next step is to create plenty of
steps for each in order to get the crawler target profile pages and publications of
participators. Each person create several searching pages grasping tasks, one profile tasks,
several publications pages, which depends on the real amount of the publications.
Furthermore, if we user application threats to visit the target website frequently, the
load balance of target website would be damaged instantly. Therefore, the method of
solution is that downloading the html to local disk and keep the interval visiting time, these
are the code below.
mtPersonName = PT_PERSON_NAME.matcher(sContent);
if (!mtPersonName.find()) {
mtPersonName = Pattern
.compile(
"<div
class=\"gsc_1usr_text\"><h3 class=\"gsc_1usr_name\"><a
href=\"/citations\\?user=(.{1,40})&hl=en\"><span
class='gs_hlt'>(.{1,100})</a></h3>",
Pattern.CASE_INSENSITIVE).matcher(
sContent.toString());
} else
mtPersonName = PT_PERSON_NAME.matcher(sContent);
boolean isMatched = false;
while (mtPersonName.find()) {
isMatched = true;
// get idString and name
System.out.println(mtPersonName.group(1));
System.out.println(mtPersonName.group(2));
String fixName =
htmlRemoveTag(mtPersonName.group(2).toString());
if
(!name.toLowerCase().equals(fixName.toLowerCase())
&& !isInName(name, fixName))
continue;// if fuzzy equal,jump out the loop
Page | 9
// user
strDataSQL
.append(" insert into
wccinew.users(name,gid,Status) VALUES ('"
+ fixName.replace("'", "''")
+ "','"
+
mtPersonName.group(1).replace("'", "''")
+ "',0);\n");
executeSql(strDataSQL.toString());
strDataSQL = new StringBuffer("");
// profile url
strDataSQL
.append(" insert into
wccinew.task(name,gid,url,type,Status) VALUES ('"
+ fixName.replace("'", "''")
+ "','"
+
mtPersonName.group(1).replace("'", "''")
+ "','"
+ URL_HEAD
+ "/citations?hl=en&user="
+
mtPersonName.group(1).replace("'", "''")
.replace("&",
"&")
+
"&view_op=list_works&pagesize=100&sortby=pubdate"
+ "',1,0);\n");
executeSql(strDataSQL.toString());
strDataSQL = new StringBuffer("");
mtPersonName.appendReplacement(sb, "*****");
}
mtPersonName.appendTail(sb);
if (isMatched)
executeSql("update wccinew.task set status=1 where
id = " + taskId
+ ";");
else
executeSql("update wccinew.task set status=-1 where
id = " + taskId
+ ";");
// if there are another pages or not
mtNext = PT_NEXT.matcher(sContent);
if (mtNext.find()) {
pageNumber++;
return strDataSQL.append(getPageDataSql(taskId,
name, pageNumber))
.toString();
} else
return strDataSQL.toString();
Page | 10
1.5. Daset creation part and Dataset modification part
Once we get the information, we could store them in database, but we do need to
clean the data, as the clean code below.
String sqlfilters = "select a.id,a.name,b.name as
belongtoname,a.type from wccinew.filter as a "
+ "INNER JOIN wccinew.filter as b ON
a.belongtoid=b.id order by a.type asc; ";
String sqlusers = "select id,institute,positiontype from
wccinew.users where gid is not null ;";
try {
Class.forName(JDBC_DRIVER);
System.out.println("Connecting to database...");
conn = DriverManager.getConnection(DB_URL, USER,
PASS);
System.out.println("begin to query statement...");
stmtfilters = conn.createStatement();
stmtusers = conn.createStatement();
ResultSet rsfilters =
stmtfilters.executeQuery(sqlfilters);
ResultSet rsusers =
stmtusers.executeQuery(sqlusers);
ArrayList<String> filtersArrList = new
ArrayList<String>();
ArrayList<String> nounArrList = new
ArrayList<String>();
String[] filtersArr = null;
String[] nounArr = null;
String[] tempArr = null;
// Extract data from result set
while (rsfilters.next()) {
filtersArrList.add(rsfilters.getString("name")
+ ":"
+
rsfilters.getString("belongtoname") + ":"
+ rsfilters.getString("type"));
}
filtersArr = (String[]) filtersArrList
.toArray(new
String[filtersArrList.size()]);
// Extract data from result set
while (rsusers.next()) {
String userid = rsusers.getString("id");
String tests = rsusers.getString("institute");
Page | 11
String[] nameSource =
rsusers.getString("institute")
.replace(",", " ").replace("(", "
").replace(")", " ")
.replace("%20", "
").replace("&", " ")
.replace("/", "
").toLowerCase().split(" ");
System.out.println("****userid=" + userid + "
,"
+ rsusers.getString("institute")
+ "------------------------------------------\n");
if (userid.equals("124"))
// rsusers.getString("institute").equals(
// "graduate research assistant (phd)"))
System.out.println("****userid=" + userid
+ "-"
+
rsusers.getString("institute") + "---\n");
nounArrList = new ArrayList<String>();
boolean isGet = false; //
boolean isNoun = false;//
for (int i = 0; i < nameSource.length; i++) {
String tt = nameSource[i].trim();
System.out.println("The word extracted is
:" + tt + "\n");
if (tt.length() > 0) {
for (int j = 0; j <
filtersArr.length; j++) {
tempArr =
filtersArr[j].split(":");
if (tt.equals(tempArr[0])) {
if
(tempArr[2].equals("1"))
isNoun = true;
nounArrList.add(tempArr[1] + ":" + tempArr[2]);
isGet = true;
}
}
if (!isGet)
nounArrList.add(":");
isGet = false;// revert
}
}
Iterator<String> sListIterator =
nounArrList.iterator();
int i = 0;
while (sListIterator.hasNext()) {
Page | 12
String e = sListIterator.next();
if (e.contains(":1")) {
i++;
}
}
if (i == 0)
continue;
removeDuplicateWithOrder(nounArrList);
sListIterator = nounArrList.iterator();
while (sListIterator.hasNext()) {
String e = sListIterator.next();
if (e.contains(":2")
&&
nounArrList.contains(e.replace(":2", ":1"))) {
sListIterator.remove();
}
}
nounArr = (String[]) nounArrList.toArray(new
String[nounArrList
.size()]);
StringBuffer sb = new StringBuffer("");
for (int t = 0; t < nounArr.length; t++) {
tempArr = nounArr[t].split(":");
if (tempArr.length > 0 &&
tempArr[0].length() > 0) {
if (tempArr[1].equals("2")) {
sb.append(tempArr[0] + " ");
} else
sb.append(tempArr[0] + " ");
}
}
if (sb.toString().trim().length() > 0)
sqllist.add(" update wccinew.users set
positiontype='"
+ sb.toString().substring(0,
sb.toString().length() - 1) + "' where id="
+ userid + "; ");
}
We could create a table filter to get the filtered data.
Page | 13
Connection conn = null;
Statement stmt = null;
Statement stmtRSsource = null;
Statement stmtRStarget = null;
try {
Class.forName(JDBC_DRIVER);
System.out.println("Connecting to database...");
conn = DriverManager.getConnection(DB_URL, USER,
PASS);
System.out.println("Creating statement...");
stmt = conn.createStatement();
stmtRSsource = conn.createStatement();
stmtRStarget = conn.createStatement();
String sqlinsertNodes = "insert into wccinew.nodes
(nodes) "
+ "select distinct(gid) as gid from
wccinew.topicuser where hashid in "
+ " (select hashid from wccinew.topicuser
"
+ " group by hashid having
count(distinct(gid))>1 "
+ " ) order by username ; ";
stmt.executeUpdate(sqlinsertNodes);
// update label
sqlinsertNodes = "UPDATE wccinew.nodes t1 "
+ "INNER JOIN wccinew.topicuser t2 "
+ "
ON t1.nodes = t2.gid "
+ "SET t1.label = t2.username ; ";
stmt.executeUpdate(sqlinsertNodes);
// update class=1
sqlinsertNodes = "UPDATE wccinew.nodes set
class=1;";
// stmt.executeUpdate(sqlinsertNodes);
String sqluserlist = "select nodes as gid, id from
wccinew.nodes ";
ResultSet rs = stmt.executeQuery(sqluserlist);
ResultSet rstargetList = null;
// Extract data from result set
while (rs.next()) {
System.out.println(rs.getString("id") +
"processing...");
String source = rs.getString("gid");
// get targetlist
String sqlgetTarget = "select distinct(gid) as
gid from wccinew.topicuser where hashid in "
+ " (select hashid from
wccinew.topicuser where gid='"
+ source + "') order by username;";
Page | 14
rstargetList =
stmtRSsource.executeQuery(sqlgetTarget);
while (rstargetList.next()) {
String target =
rstargetList.getString("gid");
String sqlinsert = "insert into
wccinew.edges (source, target,weight,type) values ('"
+ source + "','" + target +
"', 1,1); ";
stmtRStarget.executeUpdate(sqlinsert);
sqlinsert = "update wccinew.edges set
weight = "
+ "(select count(*) from
wccinew.topicuser where gid='"
+ target
+ "' and hashid in "
+ "(select hashid from
wccinew.topicuser where gid='"
+ source + "')) " + "where
source='" + source
+ "' and target='" + target +
"';";
stmtRStarget.executeUpdate(sqlinsert);
}
}
stmtRStarget
.executeUpdate("delete from wccinew.edges
where source=target;");
INNER
t2.id
INNER
t2.id
stmtRStarget
.executeUpdate("UPDATE wccinew.edges t1
JOIN wccinew.nodes t2 ON t1.source = t2.nodes SET t1.sid =
;");
stmtRStarget
.executeUpdate("UPDATE wccinew.edges t1
JOIN wccinew.nodes t2
ON t1.target = t2.nodes SET t1.tid =
;");
// STEP 6: Clean-up environment
rstargetList.close();
rs.close();
stmt.close();
conn.close();
2. Data Visualization with Gephi
Gephi is a Real-time visualization tools, we could download it at www.gephi.org



a tool for exploring and understanding graphs according to networking data;
to help data analysts to make hypothesis, intuitively discover patterns, isolate structure
singularities or faults during data sourcing ;
a tool for visual thinking with interactive interfaces ;
Page | 15

a tool for Exploratory Data Analysis.
This graph shows the connections of the conference pariticipators.
Page | 16
If we zoom in and zoom out, it could clearly present the specific name and relationships.
Page | 17
Import data
The original dataset has 3 main tables, they are:
CREATE TABLE `topicuser` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(1000) COLLATE utf8_unicode_ci NOT NULL,
`hashid` int(11) DEFAULT NULL COMMENT 'topic name plus authors',
`gid` varchar(80) COLLATE utf8_unicode_ci DEFAULT NULL,
`username` varchar(500) COLLATE utf8_unicode_ci DEFAULT NULL,
`valid` int(11) DEFAULT '1',
`userid` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index2` (`hashid`),
KEY `index3` (`gid`)
)
ENGINE=MyISAM
AUTO_INCREMENT=122253
DEFAULT
CHARSET=utf8
COLLATE=utf8_unicode_ci;
CREATE TABLE `users` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`gid` varchar(100) COLLATE utf8_bin DEFAULT NULL,
`name` varchar(500) COLLATE utf8_bin NOT NULL,
`namebase64` varchar(1000) COLLATE utf8_bin DEFAULT NULL,
`arealabel` varchar(500) COLLATE utf8_bin DEFAULT NULL,
`arealabelbase64` varchar(1000) COLLATE utf8_bin DEFAULT NULL,
`institute` varchar(327) COLLATE utf8_bin DEFAULT NULL,
`institutebase64` varchar(1000) COLLATE utf8_bin DEFAULT NULL,
`positiontype` varchar(300) COLLATE utf8_bin DEFAULT NULL,
`status` int(11) NOT NULL DEFAULT '0' COMMENT '0 means active',
`verifiedemailat` varchar(100) COLLATE utf8_bin DEFAULT NULL,
`verifiedemailatbase64` varchar(1000) COLLATE utf8_bin DEFAULT NULL,
Page | 18
`homepageurl` varchar(500) COLLATE utf8_bin DEFAULT NULL,
`userphotourl` varchar(600) COLLATE utf8_bin DEFAULT NULL,
`citationsall` int(11) DEFAULT NULL,
`citationhindex` int(11) DEFAULT NULL,
`citationI10index` int(11) DEFAULT NULL,
`citationssince2009` int(11) DEFAULT NULL,
`citationshindexsince2009` int(11) DEFAULT NULL,
`citationI10indexsince2009` int(11) DEFAULT NULL,
`valid` int(11) DEFAULT '0',
`oldname` varchar(500) COLLATE utf8_bin DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=8202 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
CREATE TABLE `task` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(1000) COLLATE utf8_unicode_ci NOT NULL,
`namebase64` varchar(3000) COLLATE utf8_unicode_ci DEFAULT NULL,
`hashid` int(11) DEFAULT NULL,
`gid` varchar(80) COLLATE utf8_unicode_ci DEFAULT NULL,
`gtid` varchar(80) COLLATE utf8_unicode_ci DEFAULT NULL,
`url` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`type` int(11) NOT NULL DEFAULT '0' COMMENT '0 means author profile page,
`status` int(11) NOT NULL DEFAULT '0' COMMENT '-1 means disable status; 0 means active
status;',
PRIMARY KEY (`id`)
)
ENGINE=MyISAM
AUTO_INCREMENT=233737
DEFAULT
CHARSET=utf8
COLLATE=utf8_unicode_ci;
Page | 19
Before we illustrate the data with the graph, we should import the data. Mainly we just
need two tables, which are Nodes and Edges.
Table Nodes:




id
nodes
label
other properties (h-index, an index that attempts to measure both the productivity and
citation impact of the publications of a scientist or scholar.)
Table Edges:




source
target
weight
other properties
Page | 20
partitionning
We could use the Modularity statistic to partition a network data, even a pie graph.
Furthermore, we could use different colors to present clearly the different groups.
Page | 21
Page | 22
Interaction




data laboratory(filtering, add, remove, merge, duplicate data)
nodes research
edges research
there are 2 methods to present the specific person and connections
Page | 23
Ranking
Based on the data of nodes or edges, the ranking data with label could help us form a list.
Page | 24
Page | 25
3. Back end server
The backend server is a full text search engine is built on Xapian. And a C++
implemented CGI is programmed to provide HTTP interface to the APP Linked Conference.
Although C++ implemented Xapian has an advantage on performance, it is planned to be
replaced by Apache Lucene since it has some practical drawbacks like study curve and
human resource. And CGI is also being replaced by JSP/Servlet implemented web server
based on the same reason.
The location of Xapian is /root/sttc. The project tree is demonstrated as bellow:
-- STTCApi.cpp
|-- lib
| |-- STTC.cpp
| |-- STTC.h
| |-- comm.cpp
| |-- comm.h
| |-- indexer.cpp
| |-- libsttc.a
| |-- makefile
|-- log
|-- makefile
|-- nginx.conf
|-- sql
| |-- Dump20141019-6.sql
| |-- Dump20141020-1.sql
| |-- Dump20141020-1.sql.zip
| |-- Dump20141020.sql
| `-- Dump20141020.sql.zip
|-- test
|-- test.cpp
|-- xa.db
Page | 26
| |-- flintlock
| |-- iamchert
| |-- position.DB
| |-- position.baseA
| |-- position.baseB
| |-- postlist.DB
| |-- postlist.baseA
| |-- postlist.baseB
| |-- record.DB
| |-- record.baseA
| |-- record.baseB
| |-- termlist.DB
| |-- termlist.baseA
| `-- termlist.baseB
The STTCApi.cpp is the CGI source file dealing with all the HTTP requests and response by
communication with nginx. The configure file defining the communication is nginx.conf.
The core part of STTC backend server locates at lib/, and the core interface is compiled into
a static library .a file providing service to the upper unessential part of the system, i.e. CGI
server. Thus, the core part of STTC backend could be either a close source component or an
open source project depending on the future plan of us for STTC.
Here are some important files in the project directory:

comm.cpp includes all the commonly used functions and utilities used by the whole
project, which includes Jason string generation, database operation, base64 encoding
and decoding.
 STTC.cpp is the actual live service provider that is complied into the library “.a” file
 indexer is a program for indexing the raw data collected from crawler to Xapian
database for further searching.
 Xa.db is the full text searching db which is the product generated by indexer
 Test.cpp is program for unit testing which is especially useful when developing new
interface
The process of the backend server dealing with user request and response is depicted as
bellow:
Page | 27
CGI (APISTTC.cpp)
nginx
HTTP request
Local socket
Underline library (STTC.cpp)
Function call
Function call
Xapian
Crawler(s)
File read
DB
Indexer
(MySQL)
DB
(Xapian)
The data source is generated by two phases:
1. crawlers scraping the source web site by several rounds and fetch all the data and store
them into a relation Database, currently MySQL is being used for the purpose.
2. the raw data is further processed by indexer and be stored in Xapian’s database which is
in fact a binary file including all the key-value maps formatted by indexer. The format of
this file is designed dedicatedly for full text searching so the speed is much faster than
traditional relation databases;
The data fetching could be divided into 3 phases:
1. http request is sent to the nginx running on the back end server
2. nginx communicates with the CGI crafted by us
3. CGI only contains code for bussiness protocol parsing, and
4. calls the functions provided by underline library which contains all the core logic and
communication with Xapian
5. Xapian reads the regenerated binary file aforementioned, performs the searching
requests according to the parameters passed and generate results
6. the results were passed all the way back to the front end
Page | 28
Download