Predict the booking price for a new property on Airbnb Seattle

8 min readDec 8, 2020

Topics:

Introduction
What is the data we have?
What is our goal?
Business understanding of the data
Data cleaning
Correlation between data
Model building, evaluation, and prediction
Summary

Introduction

In this blog, we will analyze the Airbnb dataset from Kaggle and answer a couple of questions for various stakeholders. The original data is present here. The dataset can be used to predict either the occupancy of the property or the best price for any listing for the upcoming month or year. Here are the various stakeholders who can use this dataset:

An existing host who already has one/ more properties listed on Airbnb.
A new host who wants to add his/ her property with Airbnb.
Airbnb wants to know how the listed properties will perform in the upcoming month/ year. They can further use it to make recommendations to users or to give relevant search results.
Since this is open data, even customers of Airbnb can use it to book a property on Airbnb for the best price or to find the best spot to stay in Seattle.

In this blog, we will be looking at the data from the point of view of hosts who are adding new properties (stakeholders 1 & 2).

What is the data we have?

Reviews: This file has all the reviews for each of the listings. In this project, we will not be using the reviews file.
Listings: This file has one record per ‘id’. It has details of the listing such as the number of rooms, neighborhood details, description, and so on. There are 3818 listings present here.
Calendar: This is the daily booking data and has 365 records for each listing. Each property can be identified using “listing_id” and the unique identifier of this file is listing_id + date. The file has data starting from 1st April 2016 till 1st Feb 2017. So, the file has 11 months of data. When a listing is available for booking, it has a price mentioned and is null otherwise.

What is our goal?

While we can predict various parameters, this blog explores how a host can use the dataset to predict the booking price for his/her property which is available for booking, using the historical booking data. This project is aimed to create a prediction model for new or existing hosts who want to add new properties to Airbnb. Hence we will not use the “reviews” file as new properties will not be having any reviews yet. Along with predicting the booking price, we will also answer a few questions to get a business understanding of the data.

Business understanding of the data

We will answer a couple of questions to get a business understanding of the data.

How is the price distribution among all the Airbnb Seattle properties?

From the above image, we can see that the booking price is pretty right-skewed. While most of the properties listed have their listing prices around $100, there are some very expensive properties too which are costing beyond $1000. We need to look more into these.

What are the most expensive/ cheapest neighbourhoods in Seattle?

Since there are numerous locations/neighbourhoods in Seattle, we can look at the neighbourhood groups which are the most expensive. The below graph shows the mean listing price of properties in Seattle per neighbourhood group.

Magnolia is the most expensive place to stay with a mean price per night going above $175. Queen Anne and Downtown are just behind with a mean price just above $150. Delridge seems to be very cheap for renting Airbnb rooms/ apartments with a mean price of just above $75. Rainier Valley and Northgate are cheap too.

Assuming that if a neighbourhood is more expensive than another one, there must be something touristy happening around, here are the top 10 neighbourhoods to stay in Seattle.

Do some property types cost more than others?

By looking at the below image, we can say that boats are significantly expensive to stay in and dormitory being the cheapest.

How is the property booking per month?

We are not using the records where available = ‘f’ for the prediction but we can use it to understand when are the properties booked the most.

From the above image, it can be inferred that January is the month when most of the properties are booked. December seems to be the month when most of the properties are available for booking.

Is there a peak season for Airbnb bookings?

While January topped in Airbnb bookings, December and February did not see a lot of bookings. Hence, winter is not a peak season for Seattle Airbnb. Instead summer seems to be the time when most of the Airbnb properties are booked. This is probably because the rains in Seattle stop during June, July, August and tourists can travel and bask in the natural beauty of the city of Seattle.

Data cleaning

Calendar file: Following are the steps are taken as part of data cleansing.

‘month’ column: Extract month of the year, day of the week, season from the ‘date’ column and drop the original ‘date’ column. Also, convert the month column to categorical by mapping numbers to names.

calendar['month'] = pd.DatetimeIndex(calendar['date']).month
calendar['day'] = pd.to_datetime(calendar['date']).dt.weekday_name
map_season = {3:'spring', 4:'spring', 5:'spring',
              6:'summer', 7:'summer', 8:'summer',
              9: 'autumn', 10: 'autumn', 11: 'autumn',             
              12: 'winter', 1:'winter', 2:'winter'
             }
calendar['season'] = calendar['month'].map(map_season)

‘available’ column: It has a value ‘t’ when the property is available for booking and a value ‘f’ if it’s not available for booking. It could be because the property is already booked or the property is not allowed to book by the property owner. All the records with “available” = “f” can be dropped because we do not know if the property is not available for booking as it is already booked or if it is not allowed to be booked by the host. Also, the column “available” can be dropped now as all records will be having “available” = “t’.

‘price’ column: This is the price for which the property is available for booking on a particular day. Note that there is another price column in the “listings” file and that is the price for which the property is listed in Airbnb by the host. These 2 prices may or may not be the same. Note that according to the season, the prices could go up or down. Since our goal is to predict the booking price, this “price” column will be renamed as “target”. Since this column is highly skewed, a log transformation is done to normalize it. After normalizing, the price column looks a lot more of a normal distribution.

All categorical columns are converted to dummies and original columns are dropped.

listings file: Following are the steps are taken as part of data cleansing.

Rename “id” to “listing_id” in the listings file so that it is in sync with the calendar’s file.
Replace null values with mode for categorical columns and mean for numerical columns.
Drop all *review* columns as a new property will not have any reviews and hence keeping these columns as independent variables to predict the booking price of a brand new property do not make sense.
Columns with values ‘t’ and ‘f’ are mapped to 1 and 0 respectively.
A new column ‘total_rooms’ is created as a sum of bedrooms and bathrooms.
Drop any irrelevant columns if any. Since text analysis has not been performed in this project, all columns with heavy text data have been dropped.
Even though ‘listings price’ will have a very high correlation with the target, we will do a prediction for this project without using it. Hence, ‘price’ is dropped from the listings data.
Create dummies for categorical columns.
Create separate columns for each amenity present in the ‘amenities’ column. If a particular amenity is present, map it to 1 or else 0. Drop the original column ‘amenities’. Also, a new column ‘total_amenities’ is created to hold the total number of amenities available in each property.

Correlation between data

From the below heatmap, we can see that there are many numerical columns that have a high correlation to the target (booking price). They are bedrooms, beds, total rooms.

Model building, evaluation, and prediction

I have used one linear model: Linear Regression and one ensemble model: Random Forest Regressor to train and test on the Airbnb dataset. The given data was split into train and test where the test dataset is 25% of the entire dataset.

Here is the result of the Linear Regression model:

Here is the result of the Random Forest Regressor:

It looks like there is not much of a linear correlation between the target-dependent variable and the independent variables and hence the low R² score of 0.75. Random Forest Regressor has performed very well in both train and test datasets. We do not have any new data to do fresh predictions and hence we wind up the project here.

Below are the most important features which influence the booking price according to Random Forest Regressor model.

Summary

Using Random Forest Regressor model, we will be able to do an accurate prediction of the booking price when fresh listings and calendar dataset are given. We have inferred that the most important factors which determine the booking price are room type (entire home/apartment) and the total number of rooms (including bedroom and bathroom).

You can find the complete code here in Github.