Deep Web Mining and Sentimental Learning for Local Search

Deep Web Mining and Sentimental Learning for Local Search (Group 17 in AIA8803 class) Dan Hou, Zhigang Hua, Yu Liu, Xin Sun, Yanbing Yu {houdan, hua, yuliu, xinsun, yyu}@cc.gatech.edu College of computing, Georgia Tech 1. Introduction As very large amounts of information are available for local search field, we have seen rapid growth in online discussion groups and reviews in recent years. A crucial characteristic of the posted review is their sentiment, or overall opinion towards the subject matter, for example, whether a product review is positive or negative. Labeling these with their sentiment would provide succinct summaries to users. In our work, we presented a sentimental learning model that adopts a supervised learning approach to automatically learn the sentimental levels of a local search entity based on the presence of user reviews or comments. The fast-growing World Wide Web contains a large amount of semi-structured HTML information about real-world objects. There are actually various kinds of real-world objects embedded in dynamic Web pages generated by online backend databases. Such dynamic content is called hidden Web or deep Web that refers to World Wide Web content that is not part of the surface Web indexed by search engines. This provides a great opportunity for the database research community to extract and integrate the related deep Web information about the object together as an information unit. For example, some typical Web objects are products, people, conferences/papers, etc. Generally, the deep Web objects of the same type follow similar structure or schema. Accordingly, when these deep Web objects are extracted and integrated, a large warehouse can be constructed to perform further knowledge discovery tasks on the structured data. Furthermore, we believe the construction of such large-scale database based on deep Web mining can allow us to build advanced search applications that can help improve the next-generation Web search performance. However, by now there is still few work dedicated to exploring algorithms or methods for deep Web sources crawling and mining. This leads us to raise the idea of building a Web database to store the data records that can be crawled and extracted from the deep Web. With this constructed database, we build advanced local search application to the users. In our work, we try to implement a complete solution to build advanced search applications based on the extracted Web object-level data. We describe several essential components such as deep Web crawler, deep Web mining and database construction, sentimental learning, and advanced search application based on the deep Web data. 2. System Architecture The whole system is designed with the traditional browser/server architecture. The following figure shows the server-side architecture of the system. Server-side Architecture First, the query-based crawler will probe the dynamic web sites (such as Yahoo!Local and www.yelp.com) by submitting queries to their internal search engines and save the dynamic generated web pages to the local disk. Then the HTML parser will parse these web pages collected by the crawler, extract useful information, and store the information into the database through JDBC. After getting enough information in the database, the sentimental learner will perform machine learning tasks to analyze the information and provide more advanced intelligent information to be used by the users. The super local-search module running in the Apache server provides most of the user- visible and tangible functions which rely on the data in the database. The modules showed in the above figure are loosely coupled which means they don’t have to run on one single server machine. To improve the system performance, for example, the database, the super local-search and the Apache server, the query-based crawler, and the HTML parser could be run on different servers or even clusters of servers. 3. Query-based Crawler Since this course project is focused on local search which is the natural evolution of "yellow pages" directory advertising moving to the internet [wikipedia], the goal of the query-based crawler is to deeply crawl the local search web sites, such as Yahoo!Local and www.yelp.com, to get as much “yellow pages” as we can. Useful information will be extracted from these “yellow pages” to be used by our system. There exists some common pattern among the local search web sites such as Yahoo!Local and www.yelp.com. Their internal search engines need only two kinds of parameters, one is the keyword and another is the address. The query-based crawler will utilized this pattern to submit queries to their internal search engines. Be aware that it is impossible to enumerate all possible combination of keywords and addresses because they could be astronomically numerous. However, it is possible to get all the “yellow pages” behind these internal search engines, given plenty of time. To make the problem as simple but also effective as possible, our query-based crawler will submit a query combination of only one keyword and one address each time. It maintains a list of keywords and a list of address (simply speaking, cities). We get the city list from [wikipedia] ranked by population. And the initial keyword list is from http://yp.yahoo.com/. During the probing, the query-based crawler will try each combination of keywords and cities. After getting the dynamic generated web pages corresponding to the queries, the crawler will try to get every item (or “yellow page”) and save them to the local disk. At the same time it also updates the keyword list when it finds new keywords from the “yellow pages”. In other words, the query-based crawler will perform some initial HTML parsing task to keep the keyword list updated. This process will be repeated again and again until there are no more new keywords. In theory, the query-based crawler should stop automatically because we only get tens of thousands of words and a few thousands of cities. However, in our project, due to the constraints of time and storage, we manually stopped the crawler when we got enough web pages. Although we didn’t let the crawler try all possible combinations of keywords and cities, we can still somehow know how the keyword trend will go by looking at how the keyword number increases in the early stage. The following figure shows the keyword trend of Yahoo!Local where the x axis represents the first 20 minutes since we started the crawler and the y axis represents the keyword numbers at each time point. The slope of the curve gets smaller and smaller with time going. We could somehow predict that after a relative long time (maybe several hours or even days), the curve will get almost flat because it will have almost all the words in the list. Another thing we can say from the figure is that the curve can indicate how good a local search web site is. For example, a web site represented by a curve becoming flat quickly is richer and more diversified than a web site represented by a curve becoming flat slowly. Yahoo!Local Keyword Trend 16000 14000 Numer of Keywords 12000 10000 8000 6000 4000 2000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Time (min) Yahoo! Local Keyword Trend 14 15 16 17 18 19 20 3.1. HTML Parser The web pages collected by the query-based crawler contain a lot of useful information. Let’s take a look at a Yahoo!Local “yellow page”, for example. (We got this page by querying Yahoo!Local with “car” as the keyword and “Atlanta” as the city. At the time being of the report, it is the first item showed in the resulting page.) The information in the first rectangle contains the company name, telephone number, web site, and address etc. The second rectangle contains some general information about this company such as its business description, operation time, and categories. The third rectangle contains the web users’ rating and comments about this company. All these information is useful and interested to the web users. Our HTML parser will parse all these useful information and save them into our database which will be used later by the sentimental learner and the super local-search. To store the information extracted by the HTML parser, we designed the following database with five main relational tables. The Business table contains the general information related to a company such as name, phone, address, etc. The field rating will be later updated by the sentimental learner which fairly computes a score for this company by analyzing the reviews and ratings given by the web users. The latitude and longitude fields will be computed by invoking the Google Map API with the company address. The Category table records each company’s category information. The Review table contains all the reviews for each company in the Business table. The Reviewer contains the reviewers’ information which may be used later (for example, to evaluate how good a reviewer is, according to the review quality and quantity that he writes). The Term table is the inverted index table which is the core table of the system. It contains the term itself (or simply, word), the business where the term occurs, the frequency of the term in that business, etc. The Term table will be heavily used by both the sentimental learner and the super local-search. An Yahoo!Local Yellow Page The Database Design To extract the information for the Term table, we need to analyze the review content, the business description, etc. Fortunately, the English words can be easily separated by separators such as ‘ ’, ‘,’, etc. However, there are still two common problems we need to consider. First, we need to filter out the common words such as ‘we’, ‘the’, ‘for’ because these words have little information for searching. To solve this problem, we employed a stop-word list which means we will discard the words occurring in the list. Second, the same English word could have several different formats. Take the word ‘study’ for example, it could have the other formats such as ‘studies’, ‘studied’, and ‘studying’. Therefore, we need some algorithm to map all these different formats to its original format. We employed the famous Porter stemming algorithm to solve this problem. 4. Sentimental Learning Sentiment analysis focuses on the attitude of an author. It is often conducted within a specific topic, for example, the degree of satisfaction for a type of products, or the political viewpoints. More specifically, the attitude usually includes the author’s judgment, evaluation, affectional state or the intended emotional communication. Learning such sentiment can help us to understand the author’s intention and position, which will facilitate further analysis and application. Let’s take an example from the web pages we have crawled. The following is a review page for an auto body shop. From the review page, we can see that there are several reviews post by different customers. From the words customers use, we can find both positive ones (e.g. excellent, great) and negative ones (e.g. liar, sloppy). Each of the reviews is labeled by stars. The more stars a customer gives, the better he/she thinks the store is. If we can get a general score for the store, we can not only save customers time and energy from looking through all the available reviews, but also enable the design of more complicated ranking functions by integrating the recommendation score from previous customers. More specifically, we present the crawled review corpus by bag-of-terms, and use normalized TF-IDF features. Again, our goal is to identify positive or negative opinion of a store. 4.1. Sentimental Concepts We used Vector Space Model for representing term features that are later used for sentimental learning. These features are extracted from user reviews and comments from each local search entity, which are believed to contain sensitive sentiments that can be used for machine learning. For example, some keywords are related to positive sentiments including and some are probably related to negative sentiments. In the following table, we present some examples showing the terms that can be classified into different sentiment categories. Sentiment categories Positive sentiment Negative sentiment Neutral sentiment Example words good, happy, wonderful, excellent, awesome, great, ok, okey, nice, brilliant, gorgeous, etc, bad, ugly, outdated, shabby, tacky, stupid, wrong, awful, bullshit, stereotyping, etc Buy, take, shop, etc. However, apart from these sentimental words, most words are neutrally unrelated to sentiments, like buy, take, go, iPod, apple, comment, etc. These sentimentally unrelated terms cover a large majority in the term vocabulary, which cause a considerable level of noise for sentimental learning. As a result, these words need to be removed from our sentimental learning model. 4.2. Manual Labeling versus Average Rating In our approach, we consider the sentimental learning problem through a supervised learning perspective. Naturally, the supervised learning algorithm requires labeling result given a set of training data, and the learned models from these training data can be used to predict labels for other data. As a result, we need to identify the sentiment levels for a training set of local search entities. In our work, we set five sentiment levels that is easily distinguished by human perception (e.g. most rating system provide a 5-level judgment), in which 1 represents least satisfactory (negative), and 5 represents most satisfactory (positive). With these labels, we can treat the sentimental learning problem as a classical classification or regression problem in machine learning. Generally, there are two approaches to identify the sentiment level of a local search entity (e.g. a bookstore or a shoe shop, etc.), i.e. manual labeling and average rating. (1) The manual labeling approach allows users to identify the five sentimental levels based on all the user reviews or comments included by a local search entity. (2) Additionally, for each user review or comment, there is also a rating filled by users when they input their reviews. A local search entity might include several or hundreds of reviews/comments input by users, and the average rating of these reviews will be also a measure to evaluate the sentiment levels of an online entity. In our approach, we did not purely adopt the existing average ratings, but we also adopted human labeling for sentimental training. The reason is based on the loss of generality of average rating in quantitive analysis, for example, if 5 users identify an online entity as 4 (positive), and 2 users identify the entity as 1, as a result, the online entity is averaged to be 3 (which means neutral sentiment). As a result, this definitely will influence the accuracy of training model. However, the manual labeling approach will allow three users to judge the sentiment levels of an online entity after viewing all the comments or reviews of the entity. The final sentiment will be selected based on the voting of these users, which can effectively avoid the biased ratings among different people. 4.3. Principle Component Analysis Before learning the sentiments, we notice that the dimensionality of the feature vectors is high for our problem. There are 6857 tokens in total. If we use the normalized TF-IDF features directly, there will be a large consumption of the machine memory, and probably current scientific computing software will not be able to support the calculation. Moreover, there are many non-related features in the raw TF-IDFs. If we train our machine learning model directly on the raw feature, the recommendation might turn out to be under-fitting. Therefore, we conduct Principle Component Analysis (PCA) on the raw TF-IDF features first. PCA is a technique used to reduce multidimensional dataset to lower dimensions for analysis. PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system. The greatest variance by any projection of the data comes to lie on the first coordinate, the second greatest variance on the second coordinate, etc.. Evaluated by the mean square loss, PCA is theoretically the optimum transform for a given dataset. PCA can be used for dimensionality reduction in a data set by retaining those characteristics of the data set that contribute most to its variance, by keeping lower-order principal components and ignoring higher-order ones. Such low-order components often contain the "most important" aspects of the data. For a data matrix, XT, with zero empirical mean (the empirical mean of the distribution has been subtracted from the data set), where each row represents a different repetition of the experiment, and each column gives the results from a particular probe, the PCA transformation is given by: Y T = X TW = V ∑ T where V ∑ W is the singular value decomposition (svd) of X T . PCA has the distinction of being the optimal linear transformation for keeping the subspace that has largest variance. This advantage, however, comes at the price of greater computational requirement. 4.3.1. Linear algebra trick in PCA For the consideration of efficiency, we use a standard linear algebra trick in the process. If we represent the document using column vectors and the stack them as a matrix D, the rank of D is bounded by the lowest dimension of the two. Usually the size of the vector is way bigger than the number of training samples. Therefore, even the covariance matrix of B=AAT has a large size, the rank of it is pretty low, and we are wasting lots of resource in maintaining such a matrix and perform singular value decomposition (SVD) to obtain the eigenvalues. Instead, performing SVD on matrix C=ATA is much cheaper and can be easily fit in the primary physical memory. Following the derivation in the figure below, we see that B and C share the same set of eigenvalue, while the eigenvectors of B can be easily computed from those of C. The relationship between the eigenvalue and eigenvectors of B and C We preserve 95% of the covariance in the documents. Note that because we are using the covariance matrix, not correlation matrix for PCA computation, we do not need to remove the most dominant component (which is the mean vector of the correlation). 4.4. Sentimental Learning Model Because human sentiment presents a very important measure to local search entity, as a result, the learned sentiments can be used to judge the quality or popularity of a local search entity. In our work, we designed a sentimental learning model that can be used to rank the local search results based on a combination metrics that simultaneously evaluate the contentbased relevance and sentiment-based ranking. Similarly to the well-known PageRank [], our model can be represented as a combination of content-based similarity and sentimental value as follows: SentiRank = a*ContentSim + (1-a)*SentiValue where a is a tunable parameter that can be used to flexibly control the weighted importance of content similarity and sentimental value of an online entity. For example, if it is set as 1.0, then it is a content-based Term Vector Space (TVS) model, and it is a pure sentimental ranking model if a is set as 0. In our experiments, we empirically set it as 0.5 as we believe both of these two values are significant to indicate the final ranking of a local search entity. 5. Sentimental Learning Results In the following subsections, we report our experimental results on the PCA dimensionality reduction of the term space model, and present the performance of different supervised learning approaches for sentimental learning. 5.1. Data Description As we mentioned before, the dataset we use is the review documents crawled from Yahoo! Local Search and Yelp. We first pre-process the documents to get rid of the stop words. We also run stemming to get the tokens. Then we calculate the TF-IDF features for the document corpus, and therefore convert each document into a vector where the components are the TF-IDFs of the tokens. After this pre-processing, the review document corpus can be represented as a matrix. The rows are corresponding to different documents, while the columns are different tokens appearing in the corpus. The total number of review documents we crawled is 2576, while the number of the features, which is the number of different tokens in our case, is 6857. Though the feature space is large, most of the features will turn out to be useless. The reason for this is easy to explain. For most of the words appearing in the review document corpus, there is not much sentimental element. As a result, if we build a classifier based on such features, the accuracy will be much lower because of the noisy non-relevant features. Moreover, if we use the raw features directly, most of the current software will not be able to support such a big memory need. Based on the reasons above, we execute a PCA before we train the classifier. 5.2. Principal Component Analysis Results In our experiments, we conduct PCA on both the manual labeling model and the score averaging model. Here is the result dimension of the feature vectors. In both of the models, downsampling rates are set to be 40%, 20% and 4% respectively. Downsampling Rate Result Dimension Result dimension after PCA for manual labeling model 40% 20% 4% 231 135 41 Result dimension after PCA for score averaging model Downsampling Rate 40% 20% 4% Result Dimension 344 211 37 The reason to do experiments on different downsampling rate is to get variant feature dimensionality for the result vectors. If the number of samples we use varies, the rank of the document TF-IDF feature also changes correspondingly, which results in the different number of new features in the vector space after PCA. By using different downsampling rate, we can test the performance of the following classifier with a more thorough analysis. 5.3. Supervised Sentimental Learning In our approach, we use Vector Space Model (VSM) to represent a local search entity. With this model, we extract all user comments/reviews for an entity, and these user comments or reviews are broken into bag of terms. Furthermore, the stop words will be removed, and a stemming algorithm is employed to tokenize these terms into a unified term space. A frequently used TF.IDF model is then employed to compute the value for each term. As a result, for each local search entity, we generated a feature vector that contains TF.IDF values for all the term tokens it includes. However, the problem is that these vectors are very big in the unified term space, furthermore, these vectors are also very sparse, because the term space is very big while the included terms in an entity is comparably small. As a result, it is very difficult to develop some heuristic rules to decide the sentiments based on these sparse feature vectors. Fortunately, the machine learning technique can effectively deal with such problems. The PCA techniques can effectively transform a high-dimensional feature space into a smaller one without losing generality. In our approach, we adopted several classical supervised learning algorithms for learning a reliable sentimental model. In each learning algorithm, we used the leave-one-out estimation to evaluate the learning results, and in each algorithm, we compared the learning performance between manual labeling and average rating, original large feature space and different levels of PCA-reduced feature space. 5.3.1. Support Vector Machine In support vector regression, the input x is first mapped onto a high dimensional feature space using some nonlinear mapping, and then a linear model is constructed in this feature space. Support vector regression uses a new type of loss function calledε-insensitive loss function: Support vector regression tries to minimize ||ω||2. This can be described by introducing (nonnegative) slack variables to measure the deviation of training samples. Thus support vector machine is formalized as minimization of the following function: Sentimental learning - average 80% Linear-kernel RBF-kernel Poly-kernel 70% 60% 50% 40% 30% 20% 10% 0% High-dimension 1 PCA-37 2 PCA-211 3 PCA-344 4 (a) SVM for sentimental learning for average rating dataset. Sentimental learning - manual labeling 90% Linear-kernel RBF-kernel Poly-kernel 80% 70% 60% 50% 40% 30% 20% 10% 0% High-dimension 1 PCA-41 2 PCA-135 3 PCA-231 4 (b) SVM for sentimental learning for Manual labeling dataset. As shown from the figures, the SVM can achieve a better performance on the PCA-reduced feature space (e.g. PCA-344, PCA-211 in (a), and PCA-135, PCA-231 in (b)) than the original high dimensional feature space. Furthermore, the results also showed the nonlinearity of the features. We also see from this figure that the learning performance on the manual labeling dataset is better than the average rating dataset. The reason is obvious in that the manual labeling can effectively avoid the biased rating incurred by the average rating dataset. Among different kernels we adopted in the SVM approach, the polynomial approach can achieve the best performance, and the best performance falls on the SVM polynomial kernel on the manual labeling datasets with a PCA-231 reduction. Our sentimental model will adopt this learning approach the final candidate. 5.3.2. Neutral Network When the labels are continuous real number, neural network learning can be applied for learning the optimal f* which is given by minimizing the following cost function: where m is the number of blocks in training dataset. Clearly, this is a multivariate nonparametric regression problem, since there is no a priori knowledge about the form of the true regression function which is being estimated. In our experiments, we adopted two typical neural networks to train a sentimental model, i.e. multi-layer perceptron (MLP), and Radial Basis Function network (RBF). In the following, we show the MLP training results in (a), and RBF training results in (b). As shown in both figures, the manual labeling data works better than the average rating. However, in this neural network result, it shows that a higher dimensional feature space generally works better than a smaller feature space. This is also natural because the advantage of neural network is to incorporate more features. However, this might also cause other problems like training dataset overfitting. 0.8 0.7 Precision 0.6 Manual labeling 0.5 Average rating 0.4 0.3 0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 Sampling feature size (a) Multi-layer perceptron network. 0.7 0.6 0.5 Precision Manual labeling 0.4 Average rating 0.3 0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 Sampling feature size (b) Radial basis function network. 7000 8000 5.3.3. K-Nearest Neighbor (KNN) In pattern recognition, the k-nearest neighbor algorithm (k-NN) is a method for classifying objects based on closest training examples in the feature space. k-NN is a type of instancebased learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification. It can also be used for regression. In the following experimental result, it shows that the PCA-reduced feature space can achieve better or comparable performance than the original high-dimensional feature space. 0.7 0.6 0.5 Precision Manual labeling 0.4 Average rating 0.3 0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 Sampling feature size KNN results. 6. Building Local Search Engine Our website is based on LAMP Open Source Web Platform : Linux, Apache, MySQL and PHP. Besides, we use ELGG, a open source social networking framework as our application server. PHP is used mainly for server-side application software and Ajax is adopted in our framework to utilize the Google Map API. For example, To retrieve our addresses, we need to connect to our MySQL server, select the database in question, and run a query on the database to return those rows of interest. To do so, we use mysql_connect to establish a connection, mysql_select_db to select our particular database, and mysql_query to perform the address lookup on the database. We then use mysql_fetch_assoc to retrieve an associative array representing a row of the query results. The HTTP geocoder can respond with either XML or CSV output through changing the "output" query parameter. Examples of both responses are shown in the Maps API documentation. Since our server is enabled with PHP 5 and the SimpleXML extension, it is preferable to retrieve XML output. Use the PHP simplexml_load_file function to create an XML object from the geocoder response, and use XPath expressions to extract the coordinates and status code. To find locations in our markers table that are within a certain radius distance of a given latitude/longitude, you can use a SELECT statement based on the Haversine formula. The Haversine formula is used generally for computing great-circle distances between two pairs of coordinates on a sphere. An in-depth mathemetical explanation is given by Wikipedia and a good discussion of the formula as it relates to programming is on Movable Type's site. Here's the SQL statement that will find the closest 20 locations that are within a radius of 25 miles to the center coordinate. It calculates the distance based on the latitude/longitude of that row and the target latitude/longitude, and then asks for only rows where the distance value is less than 25, orders the whole query by distance, and limits it to 20 results. To search by kilometers instead of miles, we can just replace 3959 with 6371. $query = sprintf("SELECT DISTINCT BUSINESS.BUSINESS_ID, BUSINESS.ADDRESS, BUSINESS.BUSINESS_NAME, BUSINESS.LATITUDE, BUSINESS.LONGTITUDE, BUSINESS.BUSINESS_PHONE, BUSINESS.RATING,BUSINESS.DESCRIPTION, TERM_BUSINESS.TERM, TERM_BUSINESS.NORM_FREQUENCY, BUSINESS_CATEGORY.CATEGORY, ( 3959 * acos( cos( radians('%s') ) * cos( radians( LATITUDE ) ) * cos( radians( LONGTITUDE ) radians('%s') ) + sin( radians('%s') ) * sin( radians( LATITUDE ) ) ) ) AS distance, ( ((BUSINESS.RATING + 0.01) ) / ( 3959 * acos( cos( radians('%s') ) * cos( radians( LATITUDE ) ) * cos( radians( LONGTITUDE ) - radians('%s') ) + sin( radians('%s') ) * sin( radians( LATITUDE ) ) ) ) ) AS RANKALL FROM TERM_BUSINESS, BUSINESS, BUSINESS_CATEGORY where BUSINESS.BUSINESS_ID = TERM_BUSINESS.BUSINESS_ID and BUSINESS.BUSINESS_ID = BUSINESS_CATEGORY.BUSINESS_ID and MATCH(BUSINESS_NAME) AGAINST (" . $dbname1 . ") and MATCH(ADDRESS) AGAINST (" . $dbname2 . ") GROUP BY LATITUDE, LONGTITUDE, RATING HAVING distance < '%s' ORDER BY RANKALL DESC LIMIT 0 , 20", mysql_real_escape_string($center_lat), mysql_real_escape_string($center_lng), mysql_real_escape_string($center_lat), mysql_real_escape_string($center_lat), mysql_real_escape_string($center_lng), mysql_real_escape_string($center_lat), mysql_real_escape_string(25)); 6.1. User Interface In the search field, we can input the keyword we are interested in, such as auto, pet cares, etc. In the address text field, we can input all kinds of addresses, including city name, zipcode, street address. In the option selector, right now, we have two options: person and business. Yes, we can search both. Our framework is a social network for both business and people. For instance, we input “auto”, “Chicago” in the first two text fields and chosse “business” from the option selector. In the following, we show the user interface, and the search results interface. 6.2. Geo-coding of Location information In contrast to global search, in our local business space system, location information is a crucial aspect to assist the business searching and ranking processes. We made use of Google Map API (Geo-Coding) to convert the location information into geometric coordinates. The obtained pair of coordinates corresponds to the latitude and longitude of the geo-center of the address. For example: given the address of store AA National Auto Parts: 3410 Washington St, Phoenix, AZ, 85009, we can extract the corresponding coordinate pair which is (33.447708, -112.13246). Therefore, for each store, we can obtain the exact geo-center coordinates given their addresses. For users’ input query, which might not be completed and only even contain a city name or zip code, we just extract the longitude and latitude of the city or the area corresponding to the zip code as the geo-center of users’ desired area. The figure below shows the example of geo-centers for different types of inputs. One aspect needed to be improved in our current system is interfacing our geo-coding functionality with location detection devices (e.g. GPS), such that users can uses accurate geo-devices to get more exact location coordinates instead of input address. 6.3. Geo-distance computation: Once we have the geo-coordinates of two locations, it is convenient to compute the geodistance between these two locations. We used Haversine Formula for to compute the Greatcircle distance between two pairs of coordinates on sphere Great-circle distance = (3959 * acos( cos( radians(lat) ) * cos( radians( lat ) ) * cos( radians( lng ) - radians(lng) ) + sin( radians(lng) ) * sin(radians( lat ) ) ) ) One defect of computing great-circle distance is it doesn’t indicate the actual traveling distance between two locations, which is usually larger and not proportional to the great-circle distance. Therefore, switching to driving distance representation of geo-distance is one way we will go the next step. 6.4. Geo-Sentimental Ranking Model (GSRM) To serve users’ query of local business, we design a geo-sentimental ranking model (GSRM) to perform searching and ranking in our local business database. Our GSRM has the following three factors: 6.4.1. Content Similarity It indicates the relevance of users’ input query and the searched result. We pre-computed the term-frequency of user’s input query term for each business store from all text information we extract for that store, including the business title, categories, descriptions, and users’ reviews. This is because just business title and description may not contain all products or services the business store can provide, and users’ review can certainly enrich the business’ product and service information. 6.4.2. Sentimental Value It is the recommendation score we give for each business store, which is obtained from our sentimental learning. We rank each business store from score 1-5 to indicate whether this store is favored by other customers and whether it is recommended by our system. 6.4.3. Geo-distance It is the geographical information that indicates how far a particular location is from another, which is computed from Google Map API as describe in the previous section. These three factors are comprehensive in terms of the information it collects, since it encrypts the users’ desirements, other users’ evaluation, and the geo-information of those business store themselves. We combine them together and construct our Geo-Sentimental Ranking Model (GSRM) as follows: rank = a * contentSim + (1 − a ) * SentiValue GeoDis tan ce b Where a ∈ [0,1] and b ∈ {0,1} . Our GSRM model is very flexible and can serve users’ different searching preference by different combination of the values of a and b. For example: 1) a = 1, b = 0 It gives pure content based ranking, which only returns the results to users in the order of decreasing content relevance. 2) a = 0.5, b = 0 It gives Hybrid content-sentimental ranking, which combines the content similarity and our sentimental recommendation value to sort the returned results. 3) a = 1, b = 1 It gives geo-distance weighted ranking, which returned results to users in the order of decreasing content similarity weighted by the geo-distance. 4) a = 0.5, b = 1 It gives hybrid geo-content-sentimental ranking, which combines all three factors in our system to serve users who want to find highly recommended nearby relevant stores. There are several aspects which need to be improved in our future work. First, the content similarity based on term frequency can be very easily spammed by malicious customers by inputting the same term in the review thousands of times. Second, we have not done comparison between our ranking model and the ones used by yahoo!ocal or yelp, since we have not built a big enough database which contains sufficient number of business stores which can serve users’ any searching query. References [1] Local Search. http://en.wikipedia.org/wiki/Local_search_(Internet) [2] United States Cities. http://en.wikipedia.org/wiki/List_of_United_States_cities_by_population [3] Stop Words. http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words [4] Porter Stemming Algorithm. http://tartarus.org/~martin/PorterStemmer/ [5] Google Map API: http://code.google.com/apis/maps/documentation/ [6] Open source Social Network Platform: http://elgg.org/index.php [7] D. Tebbutt, Search moves up a notch with emotional feedback. Information World Review, July 2006. [8] I.T. Jolliffe, Principal Component Analysis, Springer Series in Statistics, 2nd ed., Springer, NY, 2002, XXIX, 487 p. 28. [9] Fukunaga, Keinosuke Introduction to Statistical Pattern Recognition, Elsevier, 1990. [10] H. Zha, C. Ding, M. Gu, X. He and H.D. Simon, Spectral Relaxation for K-means Clustering, NIPS 2001, vol. 14, pp. 1057-1064. [11] M. Turk and A. Pentland (1991). "Eigenfaces for recognition". Journal of Cognitive Neuroscience 3 (1): 71–86 [12] Bo Pang, Lillian Lee, Shivakumar Vaithyanathan: Thumbs up? Sentiment Classification using Machine Learning Techniques CoRR cs.CL/0205070: (2002). [13] Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai: Topic sentiment mixture: modeling facets and opinions in weblogs. WWW 2007: 171-180. [14] K. Eguchi and V. Lavrenko. Sentiment retrieval using generative models. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 345-354, July 2006. [15] C. Engstr¡§om. Topic dependence in sentiment classification. master¡¯s thesis. university of cambridge. 2004. [16] B. Liu, M. Hu, and J. Cheng. Opinion observer: analyzing and comparing opinions on the web. In WWW '05: Proceedings of the 14th international conference on World Wide Web, pages 342-351, 2005. [17] http://www.g1.com/Images/Products/BusinessGeographics/Geocoding_Accuracy_1_Web _350.png [18] http://www.astro.uu.nl/~strous/AA/pic/Figure_4.jpg [19] http://bp2.blogger.com/_Gct8lVAxKqQ/RmgL6wLR6dI/AAAAAAAAACY/BL1KdCuNmRc/s 1600-h/GoogleEarth_Image.jpg

Deep Web Mining and Sentimental Learning for Local Search

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib