YelpBusy: Prediction of Busyness Levels

advertisement
YelpBusy: Prediction of Busyness Levels
Erdinc Korpeoglu, Jasen Hall
UC Santa Barbara
{korpeoglu, jasen}@cs.ucsb.edu
Abstract— Yelp is a popular local business directory service
that incorporates reviewing and social networking features.
It enables people to figure out how well different businesses
operate in various categories including restaurants, shopping
malls, cinemas. Although it has many capabilities – even
reporting the health score of the restaurants – it lacks the
capability to inform users how busy a business operates
dynamically. It would be great to know if a business is currently
at or above its capacity to provide service to its customers,
especially for the potential patrons who need quick service
due to the time constraints.
The main objective of this paper is to develop a model to
dynamically predict busyness level – a measure of the current
congestion versus capacity to service customers – for various
businesses.
I. M OTIVATION AND P ROBLEM D EFINITION
Online review services including Yelp [8], TripAdvisor
[7], Yahoo Local [4], Citysearch [2], and MerchantCircle
[5] provide new forms of interaction between customers and
businesses. While the information that is provided through
reviews and ratings is useful, an interesting problem exists
in trying to extract more information and predict future
through analysis and correlation of the available data. Yelp
currently dominates this market. However, even Yelp does
not provide any information about the busyness levels of
shops, restaurants, cinemas, hotels or bars. In fact, crowding
and waiting times in a business are as critical as the quality
of service for many customers. Our intention is to predict
dynamic and consistent outcomes to classify businesses in
terms of busyness levels.
II. BACKGROUND
Our work mainly requires effort in sentiment analysis and
forecasting over time-series. Two main tasks in sentiment
analysis are sentiment detection and polarity classification.
There are various sentiment analysis techniques, including:
n-gram, lexicon and part-of-speech. In general, sentiment
classification is constructed from the semantic orientation of
words, sentences and documents. Support Vector Machines,
Naive Bayes and Maximum Entropy are some machine
learning algorithms that are used in sentiment analysis. In
terms of predictive analysis, various metrics are used to
predict future outcomes in different works. Asur et al. [1]
predict the box-office revenue generated by a movie in its
opening weekend using the tweets referring to the movie
prior to its release. All tweets sent in the week prior to
release are considered to define a tweet-rate that is the
number of tweets referring to a particular movie per hour.
Zhang et al. [9] proposes a news-based model to predict
movie revenue. Nguyen et al. [6] aim to predict the dynamic
change in sentiment of a given topic over time in timeseries datasets from Twitter. A history window determines
the length of the data sequence that is used for the prediction.
Finally, Gruhl et al. [3] examines the dynamic correlation
between online sales rank and blog mentions.
III. M ETHODOLOGY
There are various challenges in forecasting the busyness
degree for a business model at a specific time. At first, the
busyness is not a precisely countable concept as number of
transactions or amount of revenue. In order to differentiate
various busyness degrees, we need to use a reference data
that could be assessed in the form of the busyness. Therefore,
we initially decided to use hourly number of check-ins data
as the reference since it is the only hour based information
in the Yelp dataset. Through the dataset analysis, we realized
that the check-in operation for a specific business is applied
very rarely, once or twice in a week. This fact puts an
obligation to switch to another approach that forecasting
the busyness degree in daily basis rather than the hourly
prediction scheme. Next, another challenge is to explore
the effect of the same type of businesses in the same
neighborhood to each other.
A. Busyness Degree
Since the busyness is not a countable concept, we need to
make some assumptions on the real busyness degree. We are
going to apply the following equation that is composed of
the polarity and the number of reviews of the day to examine
the accuracy of our prediction scheme. Coefficients a and b
are going to be determined through an empirical approach.
Busyness Degree = normalize(K)
K = a * polarity + b * normalize(numberOfReviews)
B. Linear Regression
The main elements of our prediction model are the
polarity and the number of reviews. The polarity is used
to quantify the sentiments for a business in terms of how
busy it is. It is measured as ratio of the negative reviews
to positive and neutral reviews. In fact, the neutral reviews
are taken into account since positive reviews in the matter
create a busyness sentiment for each day of the week. This
sentiment rank ranges from 0 (not busy) to 1 (completely
busy) with 0.5 being a neutral opinion regarding the congestion/wait time for a reviewer’s experience. To calculate these
sentiment ranks, we make the assumption that reviews were
written on the day of the visit. We then find the number
of positive, negative, and neutral opinions for both time
targets on each day for all businesses. Then we normalize
the results, fitting the sentiment within our range.
IV. E XPERIMENTAL E VALUATION
A. Yelp Dataset
Figure 1: Application of the Linear Regression
of the busyness is rare. Additionally, the polarity is obtained
through the sentiment analysis, and both the polarity and the
number of reviews are considered in daily basis.
In our prediction mechanism, linear regression model is
used with two components. As shown in the Figure 1, these
components are referred as C1 and C2 , C1 is the time-series
corresponding to the business that the prediction process is
done for and C2 is the collection of the time-series for the
same type of businesses in the same neighborhood. Each
time-series includes information from the n prior weeks for
the same day. Elements M1 , M2 . . . Mn refer to the n prior
Mondays to estimate the busyness degree in the current
Monday and each element has the information polarity and
number of reviews for the corresponding day. Additionally,
time-series B1 , B2 . . . Bk refer to the information for k the
businesses of the same type in the same neighborhood.
C. Sentiment Analysis
In order to extract some time-series meaning from the
Yelp reviews, we need to identify opinions that describe
the time frames we are targeting. In order to do this, we
construct dictionaries of words that describe these time
frames. For example, words that correlate to the earlier
target time might be 'lunch', 'luncheon', 'midday', or types
of meals correlate like 'buffet'. The later target time might
be identified with the words like 'dinner', 'evening', or meals
like 'banquet' or 'potluck'.
Once we have extracted and categorized opinions specific
to our target times, we then need to gauge the sentiment of
these opinions with respect to the speed of their service or
the degree to which the reviewer had to wait for his meal.
To accomplish this we again need to create dictionaries of
words, but this time the dictionaries contain expressions that
communicate positive ('quick', 'fast', 'empty') and negative
('wait', 'crowded', 'forever') opinions of the busyness of the
reviewer’s experience.
With these dictionaries, we are be able to use the Python
Natural Language Toolkit (NLTK) to determine the sentiment of the extracted opinions. For each business, we
The data released by Yelp for the Yelp Dataset Challenge
spanned from March 2005 to January 2013 and included
data for almost 12,000 businesses and 44,000 users in the
city of Phoenix, Arizona. It also contained about 230,000
reviews of businesses created by real Yelp users. Altogether
and uncompressed, we were dealing with 232 MB of data.
As we investigated the distribution of data, we became
concerned. While we are looking at a time span of 8 years,
we have on average about 19 reviews per business. And
when we investigated further we found that most of the
businesses in the dataset only received a single review in
those 8 years.
Fearing the sparsity of this data would frustrate our
attempts to make healthy predictions, we made the decision to reduce the dataset down to businesses identified as
restaurants. Considering use cases for a busyness predictor
it is more relevant to consider restaurants than businesses
such as auto parts stores.
After pruning out the non-restaurant businesses, we were
left with 4503 businesses and 158,430 reviews that applied
to those businesses. This improved our average review-tobusiness count from 19 to 35. We then repeated our analysis
and found that more than half of these restaurants had less
than 10 reviews. We again pruned the dataset to include only
restaurants that had 100 or more reviews. Then, we had 389
businesses and 72,441 reviews, giving us an average of 186
reviews per business.
At this point we decided that these were the best numbers
we were going to get out of the data. Reducing the set down
to businesses with 500 or more reviews would have left us
with 10 businesses and just over 6000 reviews, representing
less than 1/1000th of the original businesses and 2% of the
reviews.
As the result of our analysis, we decided to generate our
own dataset to examine the models that we want to apply
due to the significant sparsity of the dataset.
B. Dataset Generation
In order to explore the locality effect into busyness
degrees, we applied our methods to datasets of 2 categories
locality-unaware and locality-aware.
(a) Empirical Locality-Unaware MSE
(b) Empirical Locality-Unaware CDF
(c) Empirical Locality-Aware MSE
(d) Empirical Locality-Aware CDF
(e) Normal Locality-Unaware MSE
(f) Normal Locality-Unaware CDF
(g) Normal Locality-Aware MSE
(h) Normal Locality-Aware CDF
Figure 2: Experimental results based on the Mean Squared
Error
The locality-unaware datasets are intended not to show
any correlation between businesses in terms of the number
of reviews or the polarity. The locality-unaware datasets are
used as control group to examine if the results obtained using
our methods are different due to the locality effect.
The locality-aware datasets are generated as the increase
in the number of reviews for a specific business causes a
decrease for the other businesses around in average. Since
we do not have the information about the capacity of the
businesses, we do not take the correlation of the businesses
in terms of the polarity into consideration. Therefore, both
datasets are generated as locality-unaware in terms of the
polarity.
The locality-unaware and the locality-aware datasets are
generated in 2 ways, empirical generation and the generation using normal distribution. Our empirical dataset generation is basically multiplying some trend over a time period
into multiple periods. Together with that, normal distribution
based generation is done by applying the normal distribution
over all time period. As a result, we obtained 4 different
datasets classified as empirical locality-unaware, empirical
locality-aware, normal distribution locality-unaware and
normal distribution locality-aware.
C. Methods
We apply 4 methods in our evaluations, 3 of them are
based on the linear regression and the last one is the
application of the weighted moving average. Methods are
applied to the number of reviews and the polarity separately;
therefore, we are forecasting a single dependent variable in
each application of the methods. We apply linear regression
as based on 3 parameters training set size, number of weeks
and prediction set size. The training set size determines the
number of observations and the number of weeks determines
the number of predictors for each observation to obtain the
linear model. Additionally, the prediction set size represents
the number of observations that is predicted by the linear
model and examined for accuracy. Our accuracy measure is
the mean squared error (MSE).
1) Simple Linear Model: The predictors are obtained
from the number of reviews or the polarity values of the
same day in the previous weeks as C1 in the Figure 1. The
locality is not engaged into the predictions. The simple linear
model and the weighted moving average model are used
as baselines for our 2 locality based models given in the
following.
2) Linear Locality Total Average Model: As an addition
to the predictors of the simple linear model, the number of
reviews for each business in the same set of days is obtained
as shown as C2 in the Figure 1 and their total average is
applied into the linear model. It means that this model has
one more predictor than the simple linear model.
3) Linear Locality Daily Average Model: The difference
from the total average model is to obtain the daily averages
of the number of reviews of the businesses. By that way, the
number of predictors is 2 times the number of predictors for
the simple linear model.
4) Weighted Moving Average Model: Weighted moving
average is applied over a fixed size observations history and
the weights are determined as the most recent observation
with the highest weight.
D. Results
The experimental results that are obtained through applying 4 methods to 4 types of datasets are shown in the Figure
2. Since the number of reviews is the only factor that is
considered to be affected by the locality, our experiments
are conducted just for this element. In fact, we believe
that the polarity does not reflect any strong correlation
between businesses in the same neighborhood, therefore,
our insight is to apply the simple linear model for this
element and we are not going to show any results on it. As
mentioned before, our accuracy measure is the mean squared
error (MSE), therefore, our results are a collection of the
measurements and evaluations on the MSE. Additionally, we
should state that MSE is used only for comparison of the
methods in the same dataset, different datasets could have
different MSE ranges depending on the generation process.
Basically, the Figures 2(a), 2(c), 2(e), 2(g) depict the
MSE measurements for 100 sample datasets and the Figures
2(b), 2(d), 2(f), 2(h) show the corresponding cdfs for the
datasets.
As expected, the empirical locality-unaware dataset does
not show any significant difference between 3 linear models
in terms of MSE as shown in the Figures 2(a), 2(b). It is
because the locality effect is not taken into consideration in
the generation of this dataset.
Although the empirical locality-aware dataset does not
result in a significant cdf enhancement in terms of the
locality engaged linear models, linear locality daily average
model reflects the most stable MSE behaviour among all
linear models. As shown in the Figure 2(c), both the linear
locality total average model and the simple linear model
have some spikes in terms of MSE but the linear locality
daily average model does not. It shows us that the linear
locality daily average model is the most reliable among the
other linear models for this dataset.
In the matter of the normal distribution locality-unaware
dataset, it is hard to make a clear distinction in the Figure
2(e), each linear model seems similar. Together with that,
the Figure 2(f) shows that there is not a significant difference between locality engaged linear models and these 2
models perform slightly better than the simple linear model.
However, it is hard to make a strong reasoning on this issue,
most probably it is due to the dataset generation process.
Lastly, the normal distribution locality-aware dataset results in an expected performance distinction between linear
models as shown in the Figures 2(g), 2(h). The linear
locality daily average model performs significantly better
than the other 2 linear models and the linear locality total
average model has better MSE results than the simple linear
model.
Overall, the experimental results show that the locality
engaged linear models especially the linear locality daily
average model are better tools for the prediction of the future
outcome in the case of strong locality correlation. Apart from
that, weighted moving average model outperforms the linear
models in each case.
V. C ONCLUSIONS AND F UTURE W ORK
In this paper, we mainly propose 2 approaches to explore
and benefit from the locality effect of the same types
of businesses in the same neighborhood to forecast the
busyness degrees. We express the busyness levels by the
number of reviews and the polarity. Since we believe that
the locality effect between the businesses more relevant to
work for the number of reviews and not the polarity, we
focused our effort onto the number of reviews forecasting.
As methods, we applied the simple linear model and
the weighted moving average model as baselines for our
linear locality total average and linear locality daily average
models.
As the future improvements, busyness level measurements
should be done using the heuristic given in the methodology
section, and the relation between these measurements and
both the number of reviews and the polarity should be
explored.
Weighted moving average could be modified to distribute
weights in correlation with the locality relations.
Different time series models such as ARMA could be
applied using the locality effect.
One of the difficulties that we face is the significant
sparsity of the dataset, it is a serious obstacle to make a
qualified analysis. As future work, locality relation should
be explored in detail by examining the trends for different
factors including the number of reviews, the polarity, the
ratings among the businesses in the same neighborhood.
The proposed approaches resulted in reasonably well in
the generated datasets, however, we did not have a chance
to apply it to the real world data. Also, we would like to
extract more if there is a real effect between the businesses
or not. Since we use our own dataset, we worked on some
assumptions. Overall, we know what to apply first in the
case we have real data.
Another challenge we encounter is the lack of the publications that focuses on the prediction by taking the locality
effect into consideration. Therefore, we had to apply some
heuristics in our methodology. Apart from that, forecasting
a concept that is not countable prevents us to get concrete
results in terms of the accuracy.
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
Sitaram Asur and Bernardo A. Huberman. “Predicting
the Future with Social Media”. In: Proceedings of
the 2010 IEEE/WIC/ACM International Conference on
Web Intelligence and Intelligent Agent Technology Volume 01. WI-IAT ’10. Washington, DC, USA: IEEE
Computer Society, 2010, pp. 492–499. ISBN: 978-07695-4191-4.
Citysearch. 2013. URL: http://www.citysearch.com.
Daniel Gruhl et al. “The predictive power of online
chatter”. In: Proceedings of the eleventh ACM SIGKDD
international conference on Knowledge discovery in
data mining. KDD ’05. Chicago, Illinois, USA: ACM,
2005, pp. 78–87. ISBN: 1-59593-135-X.
Yahoo! Local. 2013. URL: http://local.yahoo.com.
MerchantCircle. 2013. URL: http : / / www .
merchantcircle.com.
Le T. Nguyen et al. “Predicting collective sentiment
dynamics from time-series social media”. In: Proceedings of the First International Workshop on Issues of
Sentiment Discovery and Opinion Mining. WISDOM
’12. Beijing, China: ACM, 2012, 6:1–6:8. ISBN: 9781-4503-1543-2.
TripAdvisor. 2013. URL: http://www.tripadvisor.com.
Yelp. 2013. URL: http://www.yelp.com.
Wenbin Zhang and Steven Skiena. “Improving Movie
Gross Prediction through News Analysis”. In: Proceedings of the 2009 IEEE/WIC/ACM International Joint
Conference on Web Intelligence and Intelligent Agent
Technology - Volume 01. WI-IAT ’09. Washington, DC,
USA: IEEE Computer Society, 2009, pp. 301–304.
ISBN : 978-0-7695-3801-3.
Download