YelpBusy: Prediction of Busyness Levels Erdinc Korpeoglu, Jasen Hall UC Santa Barbara {korpeoglu, jasen}@cs.ucsb.edu Abstract— Yelp is a popular local business directory service that incorporates reviewing and social networking features. It enables people to figure out how well different businesses operate in various categories including restaurants, shopping malls, cinemas. Although it has many capabilities – even reporting the health score of the restaurants – it lacks the capability to inform users how busy a business operates dynamically. It would be great to know if a business is currently at or above its capacity to provide service to its customers, especially for the potential patrons who need quick service due to the time constraints. The main objective of this paper is to develop a model to dynamically predict busyness level – a measure of the current congestion versus capacity to service customers – for various businesses. I. M OTIVATION AND P ROBLEM D EFINITION Online review services including Yelp [8], TripAdvisor [7], Yahoo Local [4], Citysearch [2], and MerchantCircle [5] provide new forms of interaction between customers and businesses. While the information that is provided through reviews and ratings is useful, an interesting problem exists in trying to extract more information and predict future through analysis and correlation of the available data. Yelp currently dominates this market. However, even Yelp does not provide any information about the busyness levels of shops, restaurants, cinemas, hotels or bars. In fact, crowding and waiting times in a business are as critical as the quality of service for many customers. Our intention is to predict dynamic and consistent outcomes to classify businesses in terms of busyness levels. II. BACKGROUND Our work mainly requires effort in sentiment analysis and forecasting over time-series. Two main tasks in sentiment analysis are sentiment detection and polarity classification. There are various sentiment analysis techniques, including: n-gram, lexicon and part-of-speech. In general, sentiment classification is constructed from the semantic orientation of words, sentences and documents. Support Vector Machines, Naive Bayes and Maximum Entropy are some machine learning algorithms that are used in sentiment analysis. In terms of predictive analysis, various metrics are used to predict future outcomes in different works. Asur et al. [1] predict the box-office revenue generated by a movie in its opening weekend using the tweets referring to the movie prior to its release. All tweets sent in the week prior to release are considered to define a tweet-rate that is the number of tweets referring to a particular movie per hour. Zhang et al. [9] proposes a news-based model to predict movie revenue. Nguyen et al. [6] aim to predict the dynamic change in sentiment of a given topic over time in timeseries datasets from Twitter. A history window determines the length of the data sequence that is used for the prediction. Finally, Gruhl et al. [3] examines the dynamic correlation between online sales rank and blog mentions. III. M ETHODOLOGY There are various challenges in forecasting the busyness degree for a business model at a specific time. At first, the busyness is not a precisely countable concept as number of transactions or amount of revenue. In order to differentiate various busyness degrees, we need to use a reference data that could be assessed in the form of the busyness. Therefore, we initially decided to use hourly number of check-ins data as the reference since it is the only hour based information in the Yelp dataset. Through the dataset analysis, we realized that the check-in operation for a specific business is applied very rarely, once or twice in a week. This fact puts an obligation to switch to another approach that forecasting the busyness degree in daily basis rather than the hourly prediction scheme. Next, another challenge is to explore the effect of the same type of businesses in the same neighborhood to each other. A. Busyness Degree Since the busyness is not a countable concept, we need to make some assumptions on the real busyness degree. We are going to apply the following equation that is composed of the polarity and the number of reviews of the day to examine the accuracy of our prediction scheme. Coefficients a and b are going to be determined through an empirical approach. Busyness Degree = normalize(K) K = a * polarity + b * normalize(numberOfReviews) B. Linear Regression The main elements of our prediction model are the polarity and the number of reviews. The polarity is used to quantify the sentiments for a business in terms of how busy it is. It is measured as ratio of the negative reviews to positive and neutral reviews. In fact, the neutral reviews are taken into account since positive reviews in the matter create a busyness sentiment for each day of the week. This sentiment rank ranges from 0 (not busy) to 1 (completely busy) with 0.5 being a neutral opinion regarding the congestion/wait time for a reviewer’s experience. To calculate these sentiment ranks, we make the assumption that reviews were written on the day of the visit. We then find the number of positive, negative, and neutral opinions for both time targets on each day for all businesses. Then we normalize the results, fitting the sentiment within our range. IV. E XPERIMENTAL E VALUATION A. Yelp Dataset Figure 1: Application of the Linear Regression of the busyness is rare. Additionally, the polarity is obtained through the sentiment analysis, and both the polarity and the number of reviews are considered in daily basis. In our prediction mechanism, linear regression model is used with two components. As shown in the Figure 1, these components are referred as C1 and C2 , C1 is the time-series corresponding to the business that the prediction process is done for and C2 is the collection of the time-series for the same type of businesses in the same neighborhood. Each time-series includes information from the n prior weeks for the same day. Elements M1 , M2 . . . Mn refer to the n prior Mondays to estimate the busyness degree in the current Monday and each element has the information polarity and number of reviews for the corresponding day. Additionally, time-series B1 , B2 . . . Bk refer to the information for k the businesses of the same type in the same neighborhood. C. Sentiment Analysis In order to extract some time-series meaning from the Yelp reviews, we need to identify opinions that describe the time frames we are targeting. In order to do this, we construct dictionaries of words that describe these time frames. For example, words that correlate to the earlier target time might be 'lunch', 'luncheon', 'midday', or types of meals correlate like 'buffet'. The later target time might be identified with the words like 'dinner', 'evening', or meals like 'banquet' or 'potluck'. Once we have extracted and categorized opinions specific to our target times, we then need to gauge the sentiment of these opinions with respect to the speed of their service or the degree to which the reviewer had to wait for his meal. To accomplish this we again need to create dictionaries of words, but this time the dictionaries contain expressions that communicate positive ('quick', 'fast', 'empty') and negative ('wait', 'crowded', 'forever') opinions of the busyness of the reviewer’s experience. With these dictionaries, we are be able to use the Python Natural Language Toolkit (NLTK) to determine the sentiment of the extracted opinions. For each business, we The data released by Yelp for the Yelp Dataset Challenge spanned from March 2005 to January 2013 and included data for almost 12,000 businesses and 44,000 users in the city of Phoenix, Arizona. It also contained about 230,000 reviews of businesses created by real Yelp users. Altogether and uncompressed, we were dealing with 232 MB of data. As we investigated the distribution of data, we became concerned. While we are looking at a time span of 8 years, we have on average about 19 reviews per business. And when we investigated further we found that most of the businesses in the dataset only received a single review in those 8 years. Fearing the sparsity of this data would frustrate our attempts to make healthy predictions, we made the decision to reduce the dataset down to businesses identified as restaurants. Considering use cases for a busyness predictor it is more relevant to consider restaurants than businesses such as auto parts stores. After pruning out the non-restaurant businesses, we were left with 4503 businesses and 158,430 reviews that applied to those businesses. This improved our average review-tobusiness count from 19 to 35. We then repeated our analysis and found that more than half of these restaurants had less than 10 reviews. We again pruned the dataset to include only restaurants that had 100 or more reviews. Then, we had 389 businesses and 72,441 reviews, giving us an average of 186 reviews per business. At this point we decided that these were the best numbers we were going to get out of the data. Reducing the set down to businesses with 500 or more reviews would have left us with 10 businesses and just over 6000 reviews, representing less than 1/1000th of the original businesses and 2% of the reviews. As the result of our analysis, we decided to generate our own dataset to examine the models that we want to apply due to the significant sparsity of the dataset. B. Dataset Generation In order to explore the locality effect into busyness degrees, we applied our methods to datasets of 2 categories locality-unaware and locality-aware. (a) Empirical Locality-Unaware MSE (b) Empirical Locality-Unaware CDF (c) Empirical Locality-Aware MSE (d) Empirical Locality-Aware CDF (e) Normal Locality-Unaware MSE (f) Normal Locality-Unaware CDF (g) Normal Locality-Aware MSE (h) Normal Locality-Aware CDF Figure 2: Experimental results based on the Mean Squared Error The locality-unaware datasets are intended not to show any correlation between businesses in terms of the number of reviews or the polarity. The locality-unaware datasets are used as control group to examine if the results obtained using our methods are different due to the locality effect. The locality-aware datasets are generated as the increase in the number of reviews for a specific business causes a decrease for the other businesses around in average. Since we do not have the information about the capacity of the businesses, we do not take the correlation of the businesses in terms of the polarity into consideration. Therefore, both datasets are generated as locality-unaware in terms of the polarity. The locality-unaware and the locality-aware datasets are generated in 2 ways, empirical generation and the generation using normal distribution. Our empirical dataset generation is basically multiplying some trend over a time period into multiple periods. Together with that, normal distribution based generation is done by applying the normal distribution over all time period. As a result, we obtained 4 different datasets classified as empirical locality-unaware, empirical locality-aware, normal distribution locality-unaware and normal distribution locality-aware. C. Methods We apply 4 methods in our evaluations, 3 of them are based on the linear regression and the last one is the application of the weighted moving average. Methods are applied to the number of reviews and the polarity separately; therefore, we are forecasting a single dependent variable in each application of the methods. We apply linear regression as based on 3 parameters training set size, number of weeks and prediction set size. The training set size determines the number of observations and the number of weeks determines the number of predictors for each observation to obtain the linear model. Additionally, the prediction set size represents the number of observations that is predicted by the linear model and examined for accuracy. Our accuracy measure is the mean squared error (MSE). 1) Simple Linear Model: The predictors are obtained from the number of reviews or the polarity values of the same day in the previous weeks as C1 in the Figure 1. The locality is not engaged into the predictions. The simple linear model and the weighted moving average model are used as baselines for our 2 locality based models given in the following. 2) Linear Locality Total Average Model: As an addition to the predictors of the simple linear model, the number of reviews for each business in the same set of days is obtained as shown as C2 in the Figure 1 and their total average is applied into the linear model. It means that this model has one more predictor than the simple linear model. 3) Linear Locality Daily Average Model: The difference from the total average model is to obtain the daily averages of the number of reviews of the businesses. By that way, the number of predictors is 2 times the number of predictors for the simple linear model. 4) Weighted Moving Average Model: Weighted moving average is applied over a fixed size observations history and the weights are determined as the most recent observation with the highest weight. D. Results The experimental results that are obtained through applying 4 methods to 4 types of datasets are shown in the Figure 2. Since the number of reviews is the only factor that is considered to be affected by the locality, our experiments are conducted just for this element. In fact, we believe that the polarity does not reflect any strong correlation between businesses in the same neighborhood, therefore, our insight is to apply the simple linear model for this element and we are not going to show any results on it. As mentioned before, our accuracy measure is the mean squared error (MSE), therefore, our results are a collection of the measurements and evaluations on the MSE. Additionally, we should state that MSE is used only for comparison of the methods in the same dataset, different datasets could have different MSE ranges depending on the generation process. Basically, the Figures 2(a), 2(c), 2(e), 2(g) depict the MSE measurements for 100 sample datasets and the Figures 2(b), 2(d), 2(f), 2(h) show the corresponding cdfs for the datasets. As expected, the empirical locality-unaware dataset does not show any significant difference between 3 linear models in terms of MSE as shown in the Figures 2(a), 2(b). It is because the locality effect is not taken into consideration in the generation of this dataset. Although the empirical locality-aware dataset does not result in a significant cdf enhancement in terms of the locality engaged linear models, linear locality daily average model reflects the most stable MSE behaviour among all linear models. As shown in the Figure 2(c), both the linear locality total average model and the simple linear model have some spikes in terms of MSE but the linear locality daily average model does not. It shows us that the linear locality daily average model is the most reliable among the other linear models for this dataset. In the matter of the normal distribution locality-unaware dataset, it is hard to make a clear distinction in the Figure 2(e), each linear model seems similar. Together with that, the Figure 2(f) shows that there is not a significant difference between locality engaged linear models and these 2 models perform slightly better than the simple linear model. However, it is hard to make a strong reasoning on this issue, most probably it is due to the dataset generation process. Lastly, the normal distribution locality-aware dataset results in an expected performance distinction between linear models as shown in the Figures 2(g), 2(h). The linear locality daily average model performs significantly better than the other 2 linear models and the linear locality total average model has better MSE results than the simple linear model. Overall, the experimental results show that the locality engaged linear models especially the linear locality daily average model are better tools for the prediction of the future outcome in the case of strong locality correlation. Apart from that, weighted moving average model outperforms the linear models in each case. V. C ONCLUSIONS AND F UTURE W ORK In this paper, we mainly propose 2 approaches to explore and benefit from the locality effect of the same types of businesses in the same neighborhood to forecast the busyness degrees. We express the busyness levels by the number of reviews and the polarity. Since we believe that the locality effect between the businesses more relevant to work for the number of reviews and not the polarity, we focused our effort onto the number of reviews forecasting. As methods, we applied the simple linear model and the weighted moving average model as baselines for our linear locality total average and linear locality daily average models. As the future improvements, busyness level measurements should be done using the heuristic given in the methodology section, and the relation between these measurements and both the number of reviews and the polarity should be explored. Weighted moving average could be modified to distribute weights in correlation with the locality relations. Different time series models such as ARMA could be applied using the locality effect. One of the difficulties that we face is the significant sparsity of the dataset, it is a serious obstacle to make a qualified analysis. As future work, locality relation should be explored in detail by examining the trends for different factors including the number of reviews, the polarity, the ratings among the businesses in the same neighborhood. The proposed approaches resulted in reasonably well in the generated datasets, however, we did not have a chance to apply it to the real world data. Also, we would like to extract more if there is a real effect between the businesses or not. Since we use our own dataset, we worked on some assumptions. Overall, we know what to apply first in the case we have real data. Another challenge we encounter is the lack of the publications that focuses on the prediction by taking the locality effect into consideration. Therefore, we had to apply some heuristics in our methodology. Apart from that, forecasting a concept that is not countable prevents us to get concrete results in terms of the accuracy. R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] Sitaram Asur and Bernardo A. Huberman. “Predicting the Future with Social Media”. In: Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Volume 01. WI-IAT ’10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 492–499. ISBN: 978-07695-4191-4. Citysearch. 2013. URL: http://www.citysearch.com. Daniel Gruhl et al. “The predictive power of online chatter”. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. KDD ’05. Chicago, Illinois, USA: ACM, 2005, pp. 78–87. ISBN: 1-59593-135-X. Yahoo! Local. 2013. URL: http://local.yahoo.com. MerchantCircle. 2013. URL: http : / / www . merchantcircle.com. Le T. Nguyen et al. “Predicting collective sentiment dynamics from time-series social media”. In: Proceedings of the First International Workshop on Issues of Sentiment Discovery and Opinion Mining. WISDOM ’12. Beijing, China: ACM, 2012, 6:1–6:8. ISBN: 9781-4503-1543-2. TripAdvisor. 2013. URL: http://www.tripadvisor.com. Yelp. 2013. URL: http://www.yelp.com. Wenbin Zhang and Steven Skiena. “Improving Movie Gross Prediction through News Analysis”. In: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01. WI-IAT ’09. Washington, DC, USA: IEEE Computer Society, 2009, pp. 301–304. ISBN : 978-0-7695-3801-3.