Deep Web Mining and Sentimental Learning for Local Search

advertisement
Deep Web Mining and Sentimental
Learning for Local Search
(Group 17 in AIA8803 class)
Dan Hou, Zhigang Hua, Yu Liu, Xin Sun, Yanbing Yu
{houdan, hua, yuliu, xinsun, yyu}@cc.gatech.edu
College of computing, Georgia Tech
1. Introduction
As very large amounts of information are available for local search field, we have seen rapid
growth in online discussion groups and reviews in recent years. A crucial characteristic of the
posted review is their sentiment, or overall opinion towards the subject matter, for example,
whether a product review is positive or negative. Labeling these with their sentiment would
provide succinct summaries to users. In our work, we presented a sentimental learning model
that adopts a supervised learning approach to automatically learn the sentimental levels of a
local search entity based on the presence of user reviews or comments.
The fast-growing World Wide Web contains a large amount of semi-structured HTML
information about real-world objects. There are actually various kinds of real-world objects
embedded in dynamic Web pages generated by online backend databases. Such dynamic
content is called hidden Web or deep Web that refers to World Wide Web content that is not
part of the surface Web indexed by search engines. This provides a great opportunity for the
database research community to extract and integrate the related deep Web information
about the object together as an information unit. For example, some typical Web objects are
products, people, conferences/papers, etc. Generally, the deep Web objects of the same type
follow similar structure or schema. Accordingly, when these deep Web objects are extracted
and integrated, a large warehouse can be constructed to perform further knowledge discovery
tasks on the structured data. Furthermore, we believe the construction of such large-scale
database based on deep Web mining can allow us to build advanced search applications that
can help improve the next-generation Web search performance.
However, by now there is still few work dedicated to exploring algorithms or methods for deep
Web sources crawling and mining. This leads us to raise the idea of building a Web database
to store the data records that can be crawled and extracted from the deep Web. With this
constructed database, we build advanced local search application to the users. In our work,
we try to implement a complete solution to build advanced search applications based on the
extracted Web object-level data. We describe several essential components such as deep
Web crawler, deep Web mining and database construction, sentimental learning, and
advanced search application based on the deep Web data.
2. System Architecture
The whole system is designed with the traditional browser/server architecture. The following
figure shows the server-side architecture of the system.
Server-side Architecture
First, the query-based crawler will probe the dynamic web sites (such as Yahoo!Local and
www.yelp.com) by submitting queries to their internal search engines and save the dynamic
generated web pages to the local disk. Then the HTML parser will parse these web pages
collected by the crawler, extract useful information, and store the information into the
database through JDBC. After getting enough information in the database, the sentimental
learner will perform machine learning tasks to analyze the information and provide more
advanced intelligent information to be used by the users. The super local-search module
running in the Apache server provides most of the user- visible and tangible functions which
rely on the data in the database.
The modules showed in the above figure are loosely coupled which means they don’t have to
run on one single server machine. To improve the system performance, for example, the
database, the super local-search and the Apache server, the query-based crawler, and the
HTML parser could be run on different servers or even clusters of servers.
3. Query-based Crawler
Since this course project is focused on local search which is the natural evolution of "yellow
pages" directory advertising moving to the internet [wikipedia], the goal of the query-based
crawler is to deeply crawl the local search web sites, such as Yahoo!Local and www.yelp.com,
to get as much “yellow pages” as we can. Useful information will be extracted from these
“yellow pages” to be used by our system.
There exists some common pattern among the local search web sites such as Yahoo!Local
and www.yelp.com. Their internal search engines need only two kinds of parameters, one is
the keyword and another is the address. The query-based crawler will utilized this pattern to
submit queries to their internal search engines. Be aware that it is impossible to enumerate all
possible combination of keywords and addresses because they could be astronomically
numerous. However, it is possible to get all the “yellow pages” behind these internal search
engines, given plenty of time.
To make the problem as simple but also effective as possible, our query-based crawler will
submit a query combination of only one keyword and one address each time. It maintains a
list of keywords and a list of address (simply speaking, cities). We get the city list from
[wikipedia] ranked by population. And the initial keyword list is from http://yp.yahoo.com/.
During the probing, the query-based crawler will try each combination of keywords and cities.
After getting the dynamic generated web pages corresponding to the queries, the crawler will
try to get every item (or “yellow page”) and save them to the local disk. At the same time it
also updates the keyword list when it finds new keywords from the “yellow pages”. In other
words, the query-based crawler will perform some initial HTML parsing task to keep the
keyword list updated. This process will be repeated again and again until there are no more
new keywords.
In theory, the query-based crawler should stop automatically because we only get tens of
thousands of words and a few thousands of cities. However, in our project, due to the
constraints of time and storage, we manually stopped the crawler when we got enough web
pages. Although we didn’t let the crawler try all possible combinations of keywords and cities,
we can still somehow know how the keyword trend will go by looking at how the keyword
number increases in the early stage. The following figure shows the keyword trend of
Yahoo!Local where the x axis represents the first 20 minutes since we started the crawler and
the y axis represents the keyword numbers at each time point. The slope of the curve gets
smaller and smaller with time going. We could somehow predict that after a relative long time
(maybe several hours or even days), the curve will get almost flat because it will have almost
all the words in the list. Another thing we can say from the figure is that the curve can indicate
how good a local search web site is. For example, a web site represented by a curve
becoming flat quickly is richer and more diversified than a web site represented by a curve
becoming flat slowly.
Yahoo!Local Keyword Trend
16000
14000
Numer of Keywords
12000
10000
8000
6000
4000
2000
0
1
2
3
4
5
6
7
8
9
10
11
12
13
Time (min)
Yahoo! Local Keyword Trend
14
15
16
17
18
19
20
3.1. HTML Parser
The web pages collected by the query-based crawler contain a lot of useful information. Let’s
take a look at a Yahoo!Local “yellow page”, for example. (We got this page by querying
Yahoo!Local with “car” as the keyword and “Atlanta” as the city. At the time being of the report,
it is the first item showed in the resulting page.) The information in the first rectangle contains
the company name, telephone number, web site, and address etc. The second rectangle
contains some general information about this company such as its business description,
operation time, and categories. The third rectangle contains the web users’ rating and
comments about this company. All these information is useful and interested to the web users.
Our HTML parser will parse all these useful information and save them into our database
which will be used later by the sentimental learner and the super local-search.
To store the information extracted by the HTML parser, we designed the following database
with five main relational tables. The Business table contains the general information related to
a company such as name, phone, address, etc. The field rating will be later updated by the
sentimental learner which fairly computes a score for this company by analyzing the reviews
and ratings given by the web users. The latitude and longitude fields will be computed by
invoking the Google Map API with the company address. The Category table records each
company’s category information. The Review table contains all the reviews for each company
in the Business table. The Reviewer contains the reviewers’ information which may be used
later (for example, to evaluate how good a reviewer is, according to the review quality and
quantity that he writes). The Term table is the inverted index table which is the core table of
the system. It contains the term itself (or simply, word), the business where the term occurs,
the frequency of the term in that business, etc. The Term table will be heavily used by both
the sentimental learner and the super local-search.
An Yahoo!Local Yellow Page
The Database Design
To extract the information for the Term table, we need to analyze the review content, the
business description, etc. Fortunately, the English words can be easily separated by
separators such as ‘ ’, ‘,’, etc. However, there are still two common problems we need to
consider. First, we need to filter out the common words such as ‘we’, ‘the’, ‘for’ because these
words have little information for searching. To solve this problem, we employed a stop-word
list which means we will discard the words occurring in the list. Second, the same English
word could have several different formats. Take the word ‘study’ for example, it could have
the other formats such as ‘studies’, ‘studied’, and ‘studying’. Therefore, we need some
algorithm to map all these different formats to its original format. We employed the famous
Porter stemming algorithm to solve this problem.
4. Sentimental Learning
Sentiment analysis focuses on the attitude of an author. It is often conducted within a specific
topic, for example, the degree of satisfaction for a type of products, or the political viewpoints.
More specifically, the attitude usually includes the author’s judgment, evaluation, affectional
state or the intended emotional communication. Learning such sentiment can help us to
understand the author’s intention and position, which will facilitate further analysis and
application. Let’s take an example from the web pages we have crawled. The following is a
review page for an auto body shop.
From the review page, we can see that there are several reviews post by different
customers. From the words customers use, we can find both positive ones (e.g. excellent,
great) and negative ones (e.g. liar, sloppy). Each of the reviews is labeled by stars. The
more stars a customer gives, the better he/she thinks the store is. If we can get a general
score for the store, we can not only save customers time and energy from looking through
all the available reviews, but also enable the design of more complicated ranking functions
by integrating the recommendation score from previous customers. More specifically, we
present the crawled review corpus by bag-of-terms, and use normalized TF-IDF features.
Again, our goal is to identify positive or negative opinion of a store.
4.1. Sentimental Concepts
We used Vector Space Model for representing term features that are later used for
sentimental learning. These features are extracted from user reviews and comments from
each local search entity, which are believed to contain sensitive sentiments that can be used
for machine learning. For example, some keywords are related to positive sentiments
including and some are probably related to negative sentiments. In the following table, we
present some examples showing the terms that can be classified into different sentiment
categories.
Sentiment
categories
Positive sentiment
Negative sentiment
Neutral sentiment
Example words
good, happy, wonderful, excellent, awesome, great,
ok, okey, nice, brilliant, gorgeous, etc,
bad, ugly, outdated, shabby, tacky, stupid, wrong,
awful, bullshit, stereotyping, etc
Buy, take, shop, etc.
However, apart from these sentimental words, most words are neutrally unrelated to
sentiments, like buy, take, go, iPod, apple, comment, etc. These sentimentally unrelated
terms cover a large majority in the term vocabulary, which cause a considerable level of noise
for sentimental learning. As a result, these words need to be removed from our sentimental
learning model.
4.2. Manual Labeling versus Average Rating
In our approach, we consider the sentimental learning problem through a supervised learning
perspective. Naturally, the supervised learning algorithm requires labeling result given a set of
training data, and the learned models from these training data can be used to predict labels
for other data. As a result, we need to identify the sentiment levels for a training set of local
search entities. In our work, we set five sentiment levels that is easily distinguished by human
perception (e.g. most rating system provide a 5-level judgment), in which 1 represents least
satisfactory (negative), and 5 represents most satisfactory (positive). With these labels, we
can treat the sentimental learning problem as a classical classification or regression problem
in machine learning.
Generally, there are two approaches to identify the sentiment level of a local search entity
(e.g. a bookstore or a shoe shop, etc.), i.e. manual labeling and average rating.
(1) The manual labeling approach allows users to identify the five sentimental levels based
on all the user reviews or comments included by a local search entity.
(2) Additionally, for each user review or comment, there is also a rating filled by users when
they input their reviews. A local search entity might include several or hundreds of
reviews/comments input by users, and the average rating of these reviews will be also a
measure to evaluate the sentiment levels of an online entity.
In our approach, we did not purely adopt the existing average ratings, but we also adopted
human labeling for sentimental training. The reason is based on the loss of generality of
average rating in quantitive analysis, for example, if 5 users identify an online entity as 4
(positive), and 2 users identify the entity as 1, as a result, the online entity is averaged to be 3
(which means neutral sentiment). As a result, this definitely will influence the accuracy of
training model. However, the manual labeling approach will allow three users to judge the
sentiment levels of an online entity after viewing all the comments or reviews of the entity.
The final sentiment will be selected based on the voting of these users, which can effectively
avoid the biased ratings among different people.
4.3. Principle Component Analysis
Before learning the sentiments, we notice that the dimensionality of the feature vectors is
high for our problem. There are 6857 tokens in total. If we use the normalized TF-IDF
features directly, there will be a large consumption of the machine memory, and probably
current scientific computing software will not be able to support the calculation. Moreover,
there are many non-related features in the raw TF-IDFs. If we train our machine learning
model directly on the raw feature, the recommendation might turn out to be under-fitting.
Therefore, we conduct Principle Component Analysis (PCA) on the raw TF-IDF features
first. PCA is a technique used to reduce multidimensional dataset to lower dimensions for
analysis. PCA is mathematically defined as an orthogonal linear transformation that
transforms the data to a new coordinate system. The greatest variance by any projection
of the data comes to lie on the first coordinate, the second greatest variance on the
second coordinate, etc.. Evaluated by the mean square loss, PCA is theoretically the
optimum transform for a given dataset.
PCA can be used for dimensionality reduction in a data set by retaining those
characteristics of the data set that contribute most to its variance, by keeping lower-order
principal components and ignoring higher-order ones. Such low-order components often
contain the "most important" aspects of the data.
For a data matrix, XT, with zero empirical mean (the empirical mean of the distribution has
been subtracted from the data set), where each row represents a different repetition of the
experiment, and each column gives the results from a particular probe, the PCA
transformation is given by:
Y T = X TW = V ∑
T
where V ∑ W is the singular value decomposition (svd) of X T .
PCA has the distinction of being the optimal linear transformation for keeping the
subspace that has largest variance. This advantage, however, comes at the price of
greater computational requirement.
4.3.1. Linear algebra trick in PCA
For the consideration of efficiency, we use a standard linear algebra trick in the process. If
we represent the document using column vectors and the stack them as a matrix D, the
rank of D is bounded by the lowest dimension of the two. Usually the size of the vector is
way bigger than the number of training samples. Therefore, even the covariance matrix of
B=AAT has a large size, the rank of it is pretty low, and we are wasting lots of resource in
maintaining such a matrix and perform singular value decomposition (SVD) to obtain the
eigenvalues. Instead, performing SVD on matrix C=ATA is much cheaper and can be
easily fit in the primary physical memory. Following the derivation in the figure below, we
see that B and C share the same set of eigenvalue, while the eigenvectors of B can be
easily computed from those of C.
The relationship between the eigenvalue and eigenvectors of B and C
We preserve 95% of the covariance in the documents. Note that because we are using the
covariance matrix, not correlation matrix for PCA computation, we do not need to remove
the most dominant component (which is the mean vector of the correlation).
4.4. Sentimental Learning Model
Because human sentiment presents a very important measure to local search entity, as a
result, the learned sentiments can be used to judge the quality or popularity of a local search
entity. In our work, we designed a sentimental learning model that can be used to rank the
local search results based on a combination metrics that simultaneously evaluate the contentbased relevance and sentiment-based ranking. Similarly to the well-known PageRank [], our
model can be represented as a combination of content-based similarity and sentimental value
as follows:
SentiRank = a*ContentSim + (1-a)*SentiValue
where a is a tunable parameter that can be used to flexibly control the weighted importance of
content similarity and sentimental value of an online entity. For example, if it is set as 1.0,
then it is a content-based Term Vector Space (TVS) model, and it is a pure sentimental
ranking model if a is set as 0. In our experiments, we empirically set it as 0.5 as we believe
both of these two values are significant to indicate the final ranking of a local search entity.
5. Sentimental Learning Results
In the following subsections, we report our experimental results on the PCA dimensionality
reduction of the term space model, and present the performance of different supervised
learning approaches for sentimental learning.
5.1. Data Description
As we mentioned before, the dataset we use is the review documents crawled from Yahoo!
Local Search and Yelp. We first pre-process the documents to get rid of the stop words.
We also run stemming to get the tokens. Then we calculate the TF-IDF features for the
document corpus, and therefore convert each document into a vector where the
components are the TF-IDFs of the tokens. After this pre-processing, the review document
corpus can be represented as a matrix. The rows are corresponding to different
documents, while the columns are different tokens appearing in the corpus.
The total number of review documents we crawled is 2576, while the number of the
features, which is the number of different tokens in our case, is 6857. Though the feature
space is large, most of the features will turn out to be useless. The reason for this is easy
to explain. For most of the words appearing in the review document corpus, there is not
much sentimental element. As a result, if we build a classifier based on such features, the
accuracy will be much lower because of the noisy non-relevant features. Moreover, if we
use the raw features directly, most of the current software will not be able to support such
a big memory need. Based on the reasons above, we execute a PCA before we train the
classifier.
5.2. Principal Component Analysis Results
In our experiments, we conduct PCA on both the manual labeling model and the score
averaging model. Here is the result dimension of the feature vectors. In both of the models,
downsampling rates are set to be 40%, 20% and 4% respectively.
Downsampling Rate
Result Dimension
Result dimension after PCA for manual labeling model
40%
20%
4%
231
135
41
Result dimension after PCA for score averaging model
Downsampling Rate
40%
20%
4%
Result Dimension
344
211
37
The reason to do experiments on different downsampling rate is to get variant feature
dimensionality for the result vectors. If the number of samples we use varies, the rank of
the document TF-IDF feature also changes correspondingly, which results in the different
number of new features in the vector space after PCA. By using different downsampling
rate, we can test the performance of the following classifier with a more thorough analysis.
5.3. Supervised Sentimental Learning
In our approach, we use Vector Space Model (VSM) to represent a local search entity. With
this model, we extract all user comments/reviews for an entity, and these user comments or
reviews are broken into bag of terms. Furthermore, the stop words will be removed, and a
stemming algorithm is employed to tokenize these terms into a unified term space. A
frequently used TF.IDF model is then employed to compute the value for each term.
As a result, for each local search entity, we generated a feature vector that contains TF.IDF
values for all the term tokens it includes. However, the problem is that these vectors are very
big in the unified term space, furthermore, these vectors are also very sparse, because the
term space is very big while the included terms in an entity is comparably small. As a result, it
is very difficult to develop some heuristic rules to decide the sentiments based on these
sparse feature vectors. Fortunately, the machine learning technique can effectively deal with
such problems. The PCA techniques can effectively transform a high-dimensional feature
space into a smaller one without losing generality.
In our approach, we adopted several classical supervised learning algorithms for learning a
reliable sentimental model. In each learning algorithm, we used the leave-one-out estimation
to evaluate the learning results, and in each algorithm, we compared the learning
performance between manual labeling and average rating, original large feature space and
different levels of PCA-reduced feature space.
5.3.1. Support Vector Machine
In support vector regression, the input x is first mapped onto a high dimensional feature space
using some nonlinear mapping, and then a linear model is constructed in this feature space.
Support vector regression uses a new type of loss function calledε-insensitive loss function:
Support vector regression tries to minimize ||ω||2. This can be described by introducing (nonnegative) slack variables to measure the deviation of training samples. Thus support vector
machine is formalized as minimization of the following function:
Sentimental learning - average
80%
Linear-kernel
RBF-kernel
Poly-kernel
70%
60%
50%
40%
30%
20%
10%
0%
High-dimension
1
PCA-37
2
PCA-211
3
PCA-344
4
(a) SVM for sentimental learning for average rating dataset.
Sentimental learning - manual labeling
90%
Linear-kernel
RBF-kernel
Poly-kernel
80%
70%
60%
50%
40%
30%
20%
10%
0%
High-dimension
1
PCA-41
2
PCA-135
3
PCA-231
4
(b) SVM for sentimental learning for Manual labeling dataset.
As shown from the figures, the SVM can achieve a better performance on the PCA-reduced
feature space (e.g. PCA-344, PCA-211 in (a), and PCA-135, PCA-231 in (b)) than the original
high dimensional feature space. Furthermore, the results also showed the nonlinearity of the
features. We also see from this figure that the learning performance on the manual labeling
dataset is better than the average rating dataset. The reason is obvious in that the manual
labeling can effectively avoid the biased rating incurred by the average rating dataset. Among
different kernels we adopted in the SVM approach, the polynomial approach can achieve the
best performance, and the best performance falls on the SVM polynomial kernel on the
manual labeling datasets with a PCA-231 reduction. Our sentimental model will adopt this
learning approach the final candidate.
5.3.2. Neutral Network
When the labels are continuous real number, neural network learning can be applied for
learning the optimal f* which is given by minimizing the following cost function:
where m is the number of blocks in training dataset. Clearly, this is a multivariate
nonparametric regression problem, since there is no a priori knowledge about the form of the
true regression function which is being estimated. In our experiments, we adopted two typical
neural networks to train a sentimental model, i.e. multi-layer perceptron (MLP), and Radial
Basis Function network (RBF). In the following, we show the MLP training results in (a), and
RBF training results in (b).
As shown in both figures, the manual labeling data works better than the average rating.
However, in this neural network result, it shows that a higher dimensional feature space
generally works better than a smaller feature space. This is also natural because the
advantage of neural network is to incorporate more features. However, this might also cause
other problems like training dataset overfitting.
0.8
0.7
Precision
0.6
Manual labeling
0.5
Average rating
0.4
0.3
0.2
0.1
0
0
1000
2000
3000
4000
5000
6000
7000
8000
Sampling feature size
(a) Multi-layer perceptron network.
0.7
0.6
0.5
Precision
Manual labeling
0.4
Average rating
0.3
0.2
0.1
0
0
1000
2000
3000
4000
5000
6000
Sampling feature size
(b) Radial basis function network.
7000
8000
5.3.3. K-Nearest Neighbor (KNN)
In pattern recognition, the k-nearest neighbor algorithm (k-NN) is a method for classifying
objects based on closest training examples in the feature space. k-NN is a type of instancebased learning, or lazy learning where the function is only approximated locally and all
computation is deferred until classification. It can also be used for regression. In the following
experimental result, it shows that the PCA-reduced feature space can achieve better or
comparable performance than the original high-dimensional feature space.
0.7
0.6
0.5
Precision
Manual labeling
0.4
Average rating
0.3
0.2
0.1
0
0
1000
2000
3000
4000
5000
6000
7000
8000
Sampling feature size
KNN results.
6. Building Local Search Engine
Our website is based on LAMP Open Source Web Platform : Linux, Apache, MySQL and
PHP. Besides, we use ELGG, a open source social networking framework as our application
server. PHP is used mainly for server-side application software and Ajax is adopted in our
framework to utilize the Google Map API.
For example, To retrieve our addresses, we need to connect to our MySQL server, select the
database in question, and run a query on the database to return those rows of interest. To do
so, we use mysql_connect to establish a connection, mysql_select_db to select our particular
database, and mysql_query to perform the address lookup on the database. We then use
mysql_fetch_assoc to retrieve an associative array representing a row of the query results.
The HTTP geocoder can respond with either XML or CSV output through changing the
"output" query parameter. Examples of both responses are shown in the Maps API
documentation. Since our server is enabled with PHP 5 and the SimpleXML extension, it is
preferable to retrieve XML output. Use the PHP simplexml_load_file function to create an
XML object from the geocoder response, and use XPath expressions to extract the
coordinates and status code.
To find locations in our markers table that are within a certain radius distance of a given
latitude/longitude, you can use a SELECT statement based on the Haversine formula. The
Haversine formula is used generally for computing great-circle distances between two pairs of
coordinates on a sphere. An in-depth mathemetical explanation is given by Wikipedia and a
good discussion of the formula as it relates to programming is on Movable Type's site.
Here's the SQL statement that will find the closest 20 locations that are within a radius of 25
miles to the center coordinate. It calculates the distance based on the latitude/longitude of
that row and the target latitude/longitude, and then asks for only rows where the distance
value is less than 25, orders the whole query by distance, and limits it to 20 results. To search
by kilometers instead of miles, we can just replace 3959 with 6371.
$query = sprintf("SELECT DISTINCT BUSINESS.BUSINESS_ID, BUSINESS.ADDRESS,
BUSINESS.BUSINESS_NAME, BUSINESS.LATITUDE, BUSINESS.LONGTITUDE,
BUSINESS.BUSINESS_PHONE, BUSINESS.RATING,BUSINESS.DESCRIPTION,
TERM_BUSINESS.TERM, TERM_BUSINESS.NORM_FREQUENCY,
BUSINESS_CATEGORY.CATEGORY,
( 3959 * acos( cos( radians('%s') ) * cos( radians( LATITUDE ) ) * cos( radians( LONGTITUDE ) radians('%s') ) + sin( radians('%s') ) * sin( radians( LATITUDE ) ) ) ) AS distance,
( ((BUSINESS.RATING + 0.01) ) / ( 3959 * acos( cos( radians('%s') ) * cos( radians( LATITUDE ) ) *
cos( radians( LONGTITUDE ) - radians('%s') ) + sin( radians('%s') ) * sin( radians( LATITUDE ) ) ) ) )
AS RANKALL
FROM TERM_BUSINESS, BUSINESS, BUSINESS_CATEGORY
where BUSINESS.BUSINESS_ID = TERM_BUSINESS.BUSINESS_ID
and
BUSINESS.BUSINESS_ID = BUSINESS_CATEGORY.BUSINESS_ID
and
MATCH(BUSINESS_NAME) AGAINST (" . $dbname1 . ")
and
MATCH(ADDRESS) AGAINST (" . $dbname2 . ")
GROUP BY LATITUDE, LONGTITUDE, RATING
HAVING distance < '%s'
ORDER BY RANKALL DESC LIMIT 0 , 20",
mysql_real_escape_string($center_lat),
mysql_real_escape_string($center_lng),
mysql_real_escape_string($center_lat),
mysql_real_escape_string($center_lat),
mysql_real_escape_string($center_lng),
mysql_real_escape_string($center_lat),
mysql_real_escape_string(25));
6.1. User Interface
In the search field, we can input the keyword we are interested in, such as auto, pet cares, etc.
In the address text field, we can input all kinds of addresses, including city name, zipcode,
street address. In the option selector, right now, we have two options: person and business.
Yes, we can search both. Our framework is a social network for both business and people.
For instance, we input “auto”, “Chicago” in the first two text fields and chosse “business”
from the option selector. In the following, we show the user interface, and the search results
interface.
6.2. Geo-coding of Location information
In contrast to global search, in our local business space system, location information is a
crucial aspect to assist the business searching and ranking processes. We made use of
Google Map API (Geo-Coding) to convert the location information into geometric coordinates.
The obtained pair of coordinates corresponds to the latitude and longitude of the geo-center
of the address. For example: given the address of store AA National Auto Parts: 3410
Washington St, Phoenix, AZ, 85009, we can extract the corresponding coordinate pair which
is (33.447708, -112.13246).
Therefore, for each store, we can obtain the exact geo-center coordinates given their
addresses. For users’ input query, which might not be completed and only even contain a city
name or zip code, we just extract the longitude and latitude of the city or the area
corresponding to the zip code as the geo-center of users’ desired area. The figure below
shows the example of geo-centers for different types of inputs.
One aspect needed to be improved in our current system is interfacing our geo-coding
functionality with location detection devices (e.g. GPS), such that users can uses accurate
geo-devices to get more exact location coordinates instead of input address.
6.3. Geo-distance computation:
Once we have the geo-coordinates of two locations, it is convenient to compute the geodistance between these two locations. We used Haversine Formula for to compute the Greatcircle distance between two pairs of coordinates on sphere
Great-circle distance = (3959 * acos( cos( radians(lat) ) * cos( radians( lat ) ) *
cos( radians( lng ) - radians(lng) ) + sin( radians(lng) ) * sin(radians( lat ) ) ) )
One defect of computing great-circle distance is it doesn’t indicate the actual traveling
distance between two locations, which is usually larger and not proportional to the great-circle
distance. Therefore, switching to driving distance representation of geo-distance is one way
we will go the next step.
6.4. Geo-Sentimental Ranking Model (GSRM)
To serve users’ query of local business, we design a geo-sentimental ranking model (GSRM)
to perform searching and ranking in our local business database. Our GSRM has the
following three factors:
6.4.1. Content Similarity
It indicates the relevance of users’ input query and the searched result. We pre-computed the
term-frequency of user’s input query term for each business store from all text information we
extract for that store, including the business title, categories, descriptions, and users’ reviews.
This is because just business title and description may not contain all products or services the
business store can provide, and users’ review can certainly enrich the business’ product and
service information.
6.4.2. Sentimental Value
It is the recommendation score we give for each business store, which is obtained from our
sentimental learning. We rank each business store from score 1-5 to indicate whether this
store is favored by other customers and whether it is recommended by our system.
6.4.3. Geo-distance
It is the geographical information that indicates how far a particular location is from another,
which is computed from Google Map API as describe in the previous section. These three
factors are comprehensive in terms of the information it collects, since it encrypts the users’
desirements, other users’ evaluation, and the geo-information of those business store
themselves. We combine them together and construct our Geo-Sentimental Ranking Model
(GSRM) as follows:
rank =
a * contentSim + (1 − a ) * SentiValue
GeoDis tan ce b
Where a ∈ [0,1] and b ∈ {0,1} .
Our GSRM model is very flexible and can serve users’ different searching preference by
different combination of the values of a and b. For example:
1) a = 1, b = 0
It gives pure content based ranking, which only returns the results to users in the order of
decreasing content relevance.
2) a = 0.5, b = 0
It gives Hybrid content-sentimental ranking, which combines the content similarity and our
sentimental recommendation value to sort the returned results.
3) a = 1, b = 1
It gives geo-distance weighted ranking, which returned results to users in the order of
decreasing content similarity weighted by the geo-distance.
4) a = 0.5, b = 1
It gives hybrid geo-content-sentimental ranking, which combines all three factors in our
system to serve users who want to find highly recommended nearby relevant stores.
There are several aspects which need to be improved in our future work. First, the content
similarity based on term frequency can be very easily spammed by malicious customers by
inputting the same term in the review thousands of times. Second, we have not done
comparison between our ranking model and the ones used by yahoo!ocal or yelp, since we
have not built a big enough database which contains sufficient number of business stores
which can serve users’ any searching query.
References
[1] Local Search. http://en.wikipedia.org/wiki/Local_search_(Internet)
[2] United States Cities.
http://en.wikipedia.org/wiki/List_of_United_States_cities_by_population
[3] Stop Words. http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words
[4] Porter Stemming Algorithm. http://tartarus.org/~martin/PorterStemmer/
[5] Google Map API: http://code.google.com/apis/maps/documentation/
[6] Open source Social Network Platform: http://elgg.org/index.php
[7] D. Tebbutt, Search moves up a notch with emotional feedback. Information World Review,
July 2006.
[8] I.T. Jolliffe, Principal Component Analysis, Springer Series in Statistics, 2nd ed., Springer,
NY, 2002, XXIX, 487 p. 28.
[9] Fukunaga, Keinosuke Introduction to Statistical Pattern Recognition, Elsevier, 1990.
[10] H. Zha, C. Ding, M. Gu, X. He and H.D. Simon, Spectral Relaxation for K-means
Clustering, NIPS 2001, vol. 14, pp. 1057-1064.
[11] M. Turk and A. Pentland (1991). "Eigenfaces for recognition". Journal of Cognitive
Neuroscience 3 (1): 71–86
[12] Bo Pang, Lillian Lee, Shivakumar Vaithyanathan: Thumbs up? Sentiment Classification
using Machine Learning Techniques CoRR cs.CL/0205070: (2002).
[13] Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai: Topic sentiment
mixture: modeling facets and opinions in weblogs. WWW 2007: 171-180.
[14] K. Eguchi and V. Lavrenko. Sentiment retrieval using generative models. In Proceedings
of the 2006 Conference on Empirical Methods in Natural Language Processing, pages
345-354, July 2006.
[15] C. Engstr¡§om. Topic dependence in sentiment classification. master¡¯s thesis. university
of cambridge. 2004.
[16] B. Liu, M. Hu, and J. Cheng. Opinion observer: analyzing and comparing opinions on the
web. In WWW '05: Proceedings of the 14th international conference on World Wide Web,
pages 342-351, 2005.
[17] http://www.g1.com/Images/Products/BusinessGeographics/Geocoding_Accuracy_1_Web
_350.png
[18] http://www.astro.uu.nl/~strous/AA/pic/Figure_4.jpg
[19] http://bp2.blogger.com/_Gct8lVAxKqQ/RmgL6wLR6dI/AAAAAAAAACY/BL1KdCuNmRc/s
1600-h/GoogleEarth_Image.jpg
Download