Uploaded by Sai Krishna Lanka

BI.006.11.halfway.decoding airbnb in the big apple

advertisement
“Decoding Airbnb in the Big Apple”
Team:
MIS 6356 11 (Daksh Goyal, Nitin Kumar, Sai Krishna Lanka, Wen Lou)
Objective:
The aim of the proposal is to perform analyses on New York City Airbnb dataset and uncover insights
of the sharing economy in one of the biggest cities of the world.
Background:
Airbnb has been a one of the most successful company since its inception in 2008. According to Airbnb
Newsrooms, currently Airbnb has more than 6 million listings in more than 191 countries and regions
and operating in more than 100,000 cities. As the one of the most popular cities in the world, New
York City has been one of the hottest markets for Airbnb. With close to 50,000 listings in the city,
Airbnb has interwoven with the rental landscape within 10 years of its inception. Analyses on such a
dataset would not only provide intuition about the rental metrics but also shed some light on the
socio-economic setting of the city.
Dataset:
The second-hand dataset is taken from Inside Airbnb which provides non-commercial set of tools and
data that allows us to explore how Airbnb is really being used in cities around the world. The New York
Airbnb dataset is compiled on 12 September 2019
Completed Tasks:
Task:
Analyse the growth of Airbnb network in New York.
Response:
For this visualization task, we plotted a map for hosts that have registered with Airbnb from 2008 till
present in the city of New York. The plots infer that there continues to be a meteoric rise in the Airbnb
listings and the popularity never ceased at any point of time.
Task:
Find the median price of Airbnb listings across boroughs and neighbourhoods.
Response:
Since entities involving with price or income generally tend to be right skewed (outliers on the higher
end), median is considered to be the best measure of central tendency.
To capture the data well, we take
the logarithm of listing price per
single guest and plot a boxplot with
respect to the five boroughs in New
York city. Coinciding with intuition
of high cost of living in Manhattan
and Brooklyn, the Airbnb prices are
similar to the expectations.
To visualize in depth pricing
analysis of neighbourhoods in each
borough, a heatmap of prices with
respect to the neighbourhoods
having minimum of 10 listings are
plotted on the New York city. This
provides crucial insights on the
median
price
range
of
neighbourhoods. The region around East River including North Brooklyn and South Manhattan are the
costliest places.
Task:
Find the most common amenities provided across Airbnb listings.
Response:
Each listing includes a set of
amenities that are available for the
guests. These amenities include
from basic necessities like Air
conditioning to fancy stuff such as
electric profiling beds. The picture
on the right show the amenities
provided in first two listings. Our
aim is to determine a set of five
common amenities that are
mentioned across Airbnb listings. This provides a business value for hosts who are willing to sublet
their apartment and should include such amenities in their listing.
We are only interested in the support of the amenities as the other features such as confidence and
lift would not make much sense in this scenario. From the above results, we can clearly observe that
the listings must have air conditioning, heating, kitchen, smoke detector, essentials such as bathing
essentials, mattresses etc. and most importantly “WIFI”. In fact, more that 70% listings have the set
of these amenities.
Task:
What does it take to be an Airbnb Super Host?
Response:
According to Airbnb, hosts must have to meet following criteria to become a super host, which are as
follows,




4.8+ overall rating
10+ stays
<1% cancellation rate
90% response rate
For our analysis, we would like to find the characteristics of a Super Host with parameters such as,
The above dataframe is cleaned after considering following assumptions:
 Out of 48377 records, only 27123 records have complete cases i.e., do not have any null
values.


We are not considering 10+ stays and <1% cancellation for our analysis as we do not have data
from the Airbnb regarding those attributes.
We are considering a new parameter “review_frequency_per_month” which is a function of
three attributes namely, “first_review”, “last_review” and “number_of_reviews”. The
formula is given by,
number_of_reviews
review_frequency_per_month = month(last_review −first_review)

This attribute captures the attributes perfectly to give a single metric for our analysis.
Drop “first_review”, “last_review” and “number_of_reviews” from the dataframe.
We split the data into 70% training and 30% validation to develop a classification tree and measure its
performance.
The decision tree shows that the most important features are “review_score_rating”,
“host_response_rate” and “review_frequenct_per_month”. With 78.6% accuracy on training set and
77.92% accuracy on validation set, our model is decent enough to classify the superhosts.
Training set
Validation set
Second part of our analysis include classifying super host using random forest. With 80% accuracy on
validation set, random forest doesn’t increase the performance compared to the decision tree
developed earlier. Since our aim is to provide a simple classification model, we stick with the decision
tree which classifies pretty good with its simple design.
Random forest also provides a plot for the most important features that are important to classify.
Supporting our simple model, the three features “review_score_rating”, “host_response_rate” and
“review_frequenct_per_month” tend to be the most important predictors of the lot.
To-do Task:





Predict price based on “neighborhood_group_cleansed” (Manhattan and Brooklyn),
“neighborhood_cleansed” (include only Manhattan and Brooklyn), “property_type”
(prominent ones), “room_type” (= Entire home/apt) with 1 bedroom and 1 bathroom
Find correlation between point 5 and https://www.zumper.com/blog/2019/01/mapped-newyork-city-neighborhood-rent-prices-winter-2019/ in Manhattan and Brooklyn (real estate
setting)
Word cloud generator for “host_about”
Word cloud generator for different ranges of “review_scores_ratings” (sampling few ids from
the dataset per review bin)
Recommender system for listings based on “neighborhood_group_cleansed”
“neighborhood_cleansed” (2-mile radius), total number of guests (“guests_included”), price
per guest (“price”/“guests_included”), length of stay (>= “minimum_nights”)
References:






Get the Data. Retrieved from http://insideairbnb.com/get-the-data.html
About Us. Retrieved from https://press.airbnb.com/en-us/about-us/
About Inside Airbnb. Retrieved from http://insideairbnb.com/about.html
Airbnb Superhost. Retrieved from https://www.airbnb.com/superhost
Airbnb Rental Listings Dataset Mining. Retrieved from
https://towardsdatascience.com/airbnb-rental-listings-dataset-mining-f972ed08ddec
New York Maps. Retrieved from https://rpubs.com/jhofman/nycmaps
Download