A Comparative Analysis of Ridge Regression and Random Forest Models for Predicting Airbnb Prices with Exploratory Data Analysis¶

by

Imonikhe Ayeni¶

Abstract¶

This study conducts a comprehensive analysis of Airbnb pricing predictions using machine learning techniques, specifically comparing Ridge Regression and Random Forest models. The research utilizes a dataset containing key features such as neighborhood, location coordinates, room type, minimum nights, review metrics, and availability. Through extensive exploratory data analysis, we identified and preprocessed critical pricing factors, including the logarithmic transformation of price data to handle skewness.

The comparative analysis reveals that the Random Forest model outperforms Ridge Regression across all performance metrics. The Random Forest achieved an R-squared value of 0.585, indicating it explains 58.5% of price variance, compared to Ridge Regression's 52.6%. Furthermore, the Random Forest model demonstrated superior accuracy with a Mean Absolute Error of 0.317 (in log scale) versus Ridge's 0.346, and a Root Mean Squared Error of 0.450 compared to Ridge's 0.482.

Feature importance analysis from the Random Forest model provides insights into the key determinants of Airbnb pricing, helping hosts and stakeholders better understand price influencing factors. The study employs robust preprocessing techniques, including one-hot encoding for categorical variables and mean imputation for missing values. These findings contribute to the growing body of research on data-driven pricing strategies in the sharing economy and provide practical insights for Airbnb stakeholders.

Introduction¶

The sharing economy has revolutionized traditional business models, with Airbnb emerging as a dominant force in the hospitality industry. Since its inception in 2008, Airbnb has transformed the way people find and book accommodations worldwide, creating a complex marketplace where pricing decisions significantly impact both hosts and guests. Understanding and predicting property prices on Airbnb has become increasingly crucial for hosts seeking to maximize their revenue and for guests looking for fair market values.

This research focuses on developing and comparing machine learning models to predict Airbnb prices, specifically examining the effectiveness of Ridge Regression and Random Forest algorithms. The study utilizes a dataset encompassing various property characteristics, including location data (neighborhood groups, specific neighborhoods, latitude, and longitude), property features (room type, minimum nights), and performance metrics (number of reviews, reviews per month, host listing count, and availability).

The challenge of accurate price prediction in the Airbnb market is particularly complex due to several factors:

  1. Dynamic market conditions that affect pricing
  2. Geographic variations in property values
  3. Diverse property characteristics and amenities
  4. Seasonal fluctuations in demand
  5. Host-specific pricing strategies

Our approach combines exploratory data analysis with advanced machine-learning techniques to address these challenges. By logarithmically transforming the price data, we account for the typical right-skewed distribution of property prices. The comparative analysis of Ridge Regression and Random Forest models provides insights into both linear and non-linear relationships within the data, while feature importance analysis helps identify the most significant factors influencing price determination.

This research contributes to the growing literature on data-driven pricing strategies in the sharing economy and offers practical applications for Airbnb hosts and potential investors. The findings can help hosts optimize their pricing strategies and assist guests in understanding fair market values across different locations and property types.

Aim and Objectives¶

By combining EDA and robust machine learning models, this study aims to deliver a comprehensive analysis of Airbnb price prediction and provide actionable insights for improving prediction accuracy. This project shall achieve this by focusing on five key objectives:

  1. Analyze the dataset to uncover patterns, trends, and relationships between features and prices while addressing data quality issues such as missing values and outliers.
  2. Build and evaluate machine learning models (Ridge Regression and Random Forest) to accurately predict Airbnb listing prices based on relevant features.
  3. Evaluate and compare Ridge Regression and Random Forest using metrics such as MAE, MSE, RMSE, and R² to determine which model performs better for this task.
  4. Assess which features most significantly influence Airbnb prices, providing actionable insights into factors such as location, room type, and availability.
  5. Provide data-driven recommendations to stakeholders (e.g., hosts, analysts, and platform managers) to help optimize pricing strategies.
In [89]:
!pip install category_encoders
Requirement already satisfied: category_encoders in c:\users\user\anaconda3\lib\site-packages (2.8.0)
Requirement already satisfied: numpy>=1.14.0 in c:\users\user\anaconda3\lib\site-packages (from category_encoders) (1.26.4)
Requirement already satisfied: pandas>=1.0.5 in c:\users\user\anaconda3\lib\site-packages (from category_encoders) (2.2.2)
Requirement already satisfied: patsy>=0.5.1 in c:\users\user\anaconda3\lib\site-packages (from category_encoders) (0.5.6)
Requirement already satisfied: scikit-learn>=1.6.0 in c:\users\user\anaconda3\lib\site-packages (from category_encoders) (1.6.1)
Requirement already satisfied: scipy>=1.0.0 in c:\users\user\anaconda3\lib\site-packages (from category_encoders) (1.13.1)
Requirement already satisfied: statsmodels>=0.9.0 in c:\users\user\anaconda3\lib\site-packages (from category_encoders) (0.14.2)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\user\anaconda3\lib\site-packages (from pandas>=1.0.5->category_encoders) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\user\anaconda3\lib\site-packages (from pandas>=1.0.5->category_encoders) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\users\user\anaconda3\lib\site-packages (from pandas>=1.0.5->category_encoders) (2023.3)
Requirement already satisfied: six in c:\users\user\anaconda3\lib\site-packages (from patsy>=0.5.1->category_encoders) (1.16.0)
Requirement already satisfied: joblib>=1.2.0 in c:\users\user\anaconda3\lib\site-packages (from scikit-learn>=1.6.0->category_encoders) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in c:\users\user\anaconda3\lib\site-packages (from scikit-learn>=1.6.0->category_encoders) (3.5.0)
Requirement already satisfied: packaging>=21.3 in c:\users\user\anaconda3\lib\site-packages (from statsmodels>=0.9.0->category_encoders) (23.2)
In [90]:
#Import necessary libraries:
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from category_encoders import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_absolute_error
from sklearn.impute import SimpleImputer
In [ ]:
 

Load and Explore the Data¶

In [153]:
airbnb_df = pd.read_csv(r"C:\Users\User\Desktop\Projects\AIRBNB\AB_NYC_2019.csv") #load data
In [155]:
airbnb_df.head()
Out[155]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0
In [156]:
# Display basic information about the dataset
print(airbnb_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     38843 non-null  object 
 13  reviews_per_month               38843 non-null  float64
 14  calculated_host_listings_count  48895 non-null  int64  
 15  availability_365                48895 non-null  int64  
dtypes: float64(3), int64(7), object(6)
memory usage: 6.0+ MB
None
In [157]:
airbnb_df.shape
Out[157]:
(48895, 16)
In [158]:
airbnb_df.describe() #Basic statistics about data
Out[158]:
id host_id latitude longitude price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365
count 4.889500e+04 4.889500e+04 48895.000000 48895.000000 48895.000000 48895.000000 48895.000000 38843.000000 48895.000000 48895.000000
mean 1.901714e+07 6.762001e+07 40.728949 -73.952170 152.720687 7.029962 23.274466 1.373221 7.143982 112.781327
std 1.098311e+07 7.861097e+07 0.054530 0.046157 240.154170 20.510550 44.550582 1.680442 32.952519 131.622289
min 2.539000e+03 2.438000e+03 40.499790 -74.244420 0.000000 1.000000 0.000000 0.010000 1.000000 0.000000
25% 9.471945e+06 7.822033e+06 40.690100 -73.983070 69.000000 1.000000 1.000000 0.190000 1.000000 0.000000
50% 1.967728e+07 3.079382e+07 40.723070 -73.955680 106.000000 3.000000 5.000000 0.720000 1.000000 45.000000
75% 2.915218e+07 1.074344e+08 40.763115 -73.936275 175.000000 5.000000 24.000000 2.020000 2.000000 227.000000
max 3.648724e+07 2.743213e+08 40.913060 -73.712990 10000.000000 1250.000000 629.000000 58.500000 327.000000 365.000000
In [159]:
# Check for missing values
airbnb_df.isnull().sum()
Out[159]:
id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

Data Cleaning and Preprocessing¶

Drop Redundant Columns

In [162]:
airbnb_df.drop(columns=['id', 'name', 'host_id', 'host_name',  'last_review'], inplace= True) #Drop redundant columns

Handle Missing Values

In [164]:
# Check for missing values
airbnb_df.isnull().sum()
Out[164]:
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

After dropping redundant columns, only one column (reviews per month) has missing values, a total 10052 null values. To resolve this I am going to fill the null values with 0, as the null values implies 0 review was giving for the incidence case.

In [166]:
airbnb_df['reviews_per_month'] = airbnb_df['reviews_per_month'].fillna(0) # This code will fill the missing values with 0
In [167]:
airbnb_df["reviews_per_month"].isna().sum() #Check for null values in the column of interest alone
Out[167]:
0
In [168]:
airbnb_df.isnull().sum() # checking for null values in the entire dataset
Out[168]:
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64

Exploratory Data Analysis¶

In [170]:
airbnb_df.columns
Out[170]:
Index(['neighbourhood_group', 'neighbourhood', 'latitude', 'longitude',
       'room_type', 'price', 'minimum_nights', 'number_of_reviews',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')

Spatial Count of Airbnb Listings in New York

The table and graph below show the five Neighborhood groups in New York city, Manhattan has the highest Airbnb listing, with a count of 21661 and Staten Island has the least number of listing at a count of 373.

In [172]:
airbnb_df['neighbourhood_group'].value_counts(ascending=False)
Out[172]:
neighbourhood_group
Manhattan        21661
Brooklyn         20104
Queens            5666
Bronx             1091
Staten Island      373
Name: count, dtype: int64

Graph Showing Spatial Count of airbnb Listings in New York¶

In [174]:
plt.figure(figsize=(10, 6))
sns.countplot(x='neighbourhood_group', data=airbnb_df, palette="husl", order=airbnb_df['neighbourhood_group'].value_counts().index)
plt.title('Count of Listings by Neighbourhood Group')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Count of Listings')
# Display values on top of each bar
# Get the counts for each neighborhood group
counts = airbnb_df['neighbourhood_group'].value_counts()
for i, value in enumerate(counts.index):
    count = counts[value]  # Get the count for the current neighborhood group
    plt.text(i, count + 1, f'{count:.0f}', ha='center', va='bottom') # Display the count as an integer
plt.show()
No description has been provided for this image

Spatial Distribution of Airbnb Listings by Neighborhood Group in New York¶

The figure below shows the geographical spread of Airbnb listing in New York city

In [176]:
# Spatial Distribution of Airbnb Listings
map = px.scatter_mapbox(
    airbnb_df,  # Our DataFrame
    lat='latitude',  # Latitude column
    lon='longitude',  # Longitude column
    color='neighbourhood_group',  # Different colors for each neighbourhood group
    center={"lat": 40.75362, "lon": -73.98377},  # Map will be centered on Midtown, Manhattan
    width=1000,  # Width of map
    height=800,  # Height of map
    hover_data=["neighbourhood"],  # When you hover your mouse over the house, it will display the neighborhood.
    size_max=5,  # Maximum marker size
    opacity=0.5,  # Adjust opacity for better visualization
    zoom=10  # Adjust the zoom level
)
# Add mapbox_style to figure layout
map.update_layout(mapbox_style="open-street-map")

# Show figure
map.show()

Spatial Distribution Airbnb Listing by neighbourhood in Manhattan¶

In [178]:
# Spatial Distribution airbnb Listing in Manhattan
filtered_data = airbnb_df[airbnb_df['neighbourhood_group'] == 'Manhattan']

map = px.scatter_mapbox(
    filtered_data,
    lat='latitude',
    lon='longitude',
    center={"lat": 40.75362, "lon": -73.98377},
    width=800,
    height=800,
    color='neighbourhood',
    hover_data=["price"],
    size_max=5,
    opacity=0.5,
)
# Add mapbox_style to figure layout
map.update_layout(mapbox_style="open-street-map", mapbox_zoom=12)

# Show figure
map.show()

Room Type, Minimum Nights, and Availabilty Analysis.¶

Lets profile our dataset set for the type of rooms listed and the availability of these listings

Count of Listings by Room Type¶

In [182]:
airbnb_df['room_type'].value_counts(ascending=False)
Out[182]:
room_type
Entire home/apt    25409
Private room       22326
Shared room         1160
Name: count, dtype: int64

Graph Showing Count of Listings by Room Type

In [184]:
# Room type analysis
plt.figure(figsize=(10, 6))
sns.countplot(x='room_type', data=airbnb_df, palette="husl", order=airbnb_df['room_type'].value_counts().index)
plt.title('Count of Listings by Room Type')
counts = airbnb_df['room_type'].value_counts()
for i, value in enumerate(counts.index):
    count = counts[value]
    plt.text(i, count + 1, f'{count:.0f}', ha='center', va='bottom')

plt.show()
No description has been provided for this image

Average Availability by Neighbourhood Group¶

In [186]:
average_avail=airbnb_df.groupby('neighbourhood_group')['availability_365'].mean().sort_values(ascending=False).round(2).reset_index()
average_avail
Out[186]:
neighbourhood_group availability_365
0 Staten Island 199.68
1 Bronx 165.76
2 Queens 144.45
3 Manhattan 111.98
4 Brooklyn 100.23
In [187]:
# Create a bar chart with Plotly Express
fig = px.bar(
    average_avail,
    x='neighbourhood_group',  # Neighbourhood group on the x-axis
    y='availability_365',    # Average availability on the y-axis
    text ='availability_365',  # Text labels for each bar
    title='Average Availability by Neighbourhood Group',
    labels={'availability_365': 'Average Availability (Days)', 'neighbourhood_group': 'Neighbourhood Group'},
    color='neighbourhood_group',  # Add color differentiation for each neighbourhood group
)


fig.update_layout(
    xaxis_title='Neighbourhood Group',
    yaxis_title='Average Availability (Days)',
    showlegend=False  # Removing legend since it might be redundant in this case
)


fig.show()

Average Availability by Room Type and Neighbourhood Group¶

In [189]:
# Calculate the average availability for each neighbourhood group and room type
average_avail_gr=airbnb_df.groupby(["neighbourhood_group", "room_type"])['availability_365'].mean().reset_index().sort_values('availability_365',ascending=False).round(2)
average_avail_gr
Out[189]:
neighbourhood_group room_type availability_365
13 Staten Island Private room 226.36
11 Queens Shared room 192.19
12 Staten Island Entire home/apt 178.07
5 Brooklyn Shared room 178.01
1 Bronx Private room 171.33
0 Bronx Entire home/apt 158.00
2 Bronx Shared room 154.22
10 Queens Private room 149.22
8 Manhattan Shared room 138.57
9 Queens Entire home/apt 132.27
6 Manhattan Entire home/apt 117.14
7 Manhattan Private room 101.85
4 Brooklyn Private room 99.92
3 Brooklyn Entire home/apt 97.21
14 Staten Island Shared room 64.78
In [190]:
# Create a bar chart with Plotly Express
fig = px.bar(
    average_avail_gr,
    x="room_type",                # Room type on x-axis
    y="availability_365",         # Average availability on y-axis
    color="neighbourhood_group",  # Different colors for neighbourhood groups
    barmode="group",              # Group bars by neighbourhood group
    title="Average Availability by Room Type and Neighbourhood Group",
    labels={
        "availability_365": "Average Availability (Days)",
        "room_type": "Room Type",
        "neighbourhood_group": "Neighbourhood Group"
    }
)

# Customize the layout
fig.update_layout(
    xaxis_title="Room Type",
    yaxis_title="Average Availability (Days)",
    legend_title="Neighbourhood Group",
    width=900,
    height=600
)

# Show the plot
fig.show()

Average Minimum Nights by Neighbourhood Group and Room Type¶

In [192]:
average_minimum_nights=airbnb_df.groupby(["neighbourhood_group", "room_type"])['minimum_nights'].mean().reset_index().sort_values('minimum_nights',ascending=False).round(2)
average_minimum_nights
Out[192]:
neighbourhood_group room_type minimum_nights
6 Manhattan Entire home/apt 10.54
5 Brooklyn Shared room 7.75
8 Manhattan Shared room 6.77
3 Brooklyn Entire home/apt 6.53
12 Staten Island Entire home/apt 6.24
0 Bronx Entire home/apt 5.96
4 Brooklyn Private room 5.54
7 Manhattan Private room 5.45
9 Queens Entire home/apt 5.37
10 Queens Private room 5.12
11 Queens Shared room 4.23
1 Bronx Private room 3.86
13 Staten Island Private room 3.63
2 Bronx Shared room 3.37
14 Staten Island Shared room 2.33
In [193]:
fig = px.bar(
    average_minimum_nights,
    x="room_type",
    y="minimum_nights",           # Average minimum nights on the y-axis
    color="neighbourhood_group",  # Color bars by neighbourhood group
    barmode="group",
    text='minimum_nights',# Group bars for each room type
    title="Average Minimum Nights by Neighbourhood Group and Room Type",
    labels={
        "minimum_nights": "Average Minimum Nights",
        "room_type": "Room Type",
        "neighbourhood_group": "Neighbourhood Group"
    }
)

# Customize the layout
fig.update_layout(
    xaxis_title="Room Type",
    yaxis_title="Average Minimum Nights",
    legend_title="Neighbourhood Group",
    width=1000,  # Set the width of the chart
    height=600   # Set the height of the chart
)

# Show the figure
fig.show()

Price Analysis of Airbnb Listings¶

In [ ]:
 
In [195]:
plt.hist(airbnb_df['price'], bins=1000, color='skyblue')
plt.title('Distribution of airbnb_df Prices')
plt.xlabel('Price')
plt.show()
No description has been provided for this image
In [196]:
ax = sns.histplot(airbnb_df, x='price', bins=50, stat="density", color='skyblue')
ax1 =sns.kdeplot(airbnb_df['price'], color='red')
ax1.set_title('Distribution of Airbnb Prices')
ax1.set_xlabel('Price');
No description has been provided for this image

Top 5 Price Listings¶

In [198]:
top_price = airbnb_df.price.value_counts().nlargest(5).reset_index()
top_price
Out[198]:
price count
0 100 2051
1 150 2047
2 50 1534
3 60 1458
4 200 1401
In [199]:
plt.figure(figsize=(10, 6))
sns.barplot(x='count', y='price', data=top_price, orient='h', palette='plasma', order=top_price['price'])  # Changed x and y, and order
plt.title('Top 5 Price Listings')
plt.xlabel('Count')  # xlabel remains 'Count' as it represents the frequency
plt.ylabel('Price (USD)')
Out[199]:
Text(0, 0.5, 'Price (USD)')
No description has been provided for this image

The plot above shows that the most frequently used prices are all round numbers ending with zero. $100 has the most common price listings with a count of 2051

Top 20 Neighbourhoods by Mean Price¶

In [202]:
top_neighbourhoods=airbnb_df.groupby(['neighbourhood', 'neighbourhood_group'])['price'].mean().reset_index().sort_values('price',ascending=False).head(20).round(2)
top_neighbourhoods
Out[202]:
neighbourhood neighbourhood_group price
82 Fort Wadsworth Staten Island 800.00
219 Woodrow Staten Island 700.00
197 Tribeca Manhattan 490.64
174 Sea Gate Brooklyn 487.86
167 Riverdale Bronx 442.09
157 Prince's Bay Staten Island 409.50
6 Battery Park City Manhattan 367.56
75 Flatiron District Manhattan 341.92
161 Randall Manor Staten Island 336.00
144 NoHo Manhattan 295.72
178 SoHo Manhattan 287.10
127 Midtown Manhattan 282.72
139 Neponsit Queens 274.67
209 West Village Manhattan 267.68
92 Greenwich Village Manhattan 263.41
34 Chelsea Manhattan 249.74
215 Willowbrook Staten Island 249.00
191 Theater District Manhattan 248.01
145 Nolita Manhattan 230.14
73 Financial District Manhattan 225.49

Heat Map showiing Top 20 Neighbourhoods by Mean Price¶

In [204]:
plt.figure(figsize=(20,20))
# Group by neighbourhood and neighbourhood_group, calculate the mean price
heatmap_data = airbnb_df.groupby(['neighbourhood', 'neighbourhood_group']).price.mean()

# Sort values and select the top 20 by price
heatmap_data = heatmap_data.sort_values(ascending= False).head(20).round(0)

# Unstack the data to create a pivot table (matrix format required for heatmaps)
heatmap_data_matrix = heatmap_data.unstack()

# Create the heatmap with Plotly Express
# Use 'rocket' instead of px.colors.sequential.Rocket
fig = px.imshow(
    heatmap_data_matrix,
    labels=dict(x="Neighbourhood Group", y="Neighbourhood", color="Mean Price"),
    title="Top 20 Neighbourhoods by Mean Price",
    text_auto=True,
    color_continuous_scale='viridis',  # Changed to 'rocket'
)

# Customize the layout
fig.update_layout(
    width=1200,  # Adjust figure size
    height=1000,  # Adjust figure size
    xaxis_title="Neighbourhood Group",
    yaxis_title="Neighbourhood",
    coloraxis_colorbar=dict(title="Mean Price")  # Customize colorbar label
)

# Show the heatmap
fig.show()
<Figure size 2000x2000 with 0 Axes>

Open-Street Map showiing Top 20 Neighbourhoods by Mean Price¶

In [206]:
# Group by neighbourhood and neighbourhood_group, calculate the mean price
top_neighbourhoods = (
    airbnb_df.groupby(['neighbourhood', 'neighbourhood_group'])['price']
    .mean()
    .sort_values(ascending=False)
    .head(20)
    .reset_index()
)

# Merge with the original airbnb_df dataset to include latitude and longitude
top_neighbourhoods = top_neighbourhoods.merge(airbnb_df[['neighbourhood', 'latitude', 'longitude']].drop_duplicates(),
                                              on='neighbourhood',
                                              how='left')

# Create a scatter mapbox
fig = px.scatter_mapbox(
    top_neighbourhoods,
    lat='latitude',                # Latitude for map points
    lon='longitude',               # Longitude for map points
    size='price',                  # Use price to size the markers
    color='neighbourhood_group',   # Color points by neighbourhood_group
    hover_name='neighbourhood',    # Display neighbourhood on hover
    hover_data={'price': True, 'latitude': False, 'longitude': False},  # Show price, hide lat/lon
    title="Top 20 Neighbourhoods by Mean Price",
    color_continuous_scale='viridis',  # Rocket color scale
    zoom=10,                     # Adjust map zoom level
    height=600                   # Set map height
)

# Set map style
fig.update_layout(mapbox_style="open-street-map")

# Show the map
fig.show()

Average Price by Neighbourhood Group and Room Type¶

In [208]:
average_price=airbnb_df.groupby(["neighbourhood_group", "room_type"])['price'].mean().reset_index().sort_values('price',ascending=False).round(2)
average_price
Out[208]:
neighbourhood_group room_type price
6 Manhattan Entire home/apt 249.24
3 Brooklyn Entire home/apt 178.33
12 Staten Island Entire home/apt 173.85
9 Queens Entire home/apt 147.05
0 Bronx Entire home/apt 127.51
7 Manhattan Private room 116.78
8 Manhattan Shared room 88.98
4 Brooklyn Private room 76.50
10 Queens Private room 71.76
11 Queens Shared room 69.02
1 Bronx Private room 66.79
13 Staten Island Private room 62.29
2 Bronx Shared room 59.80
14 Staten Island Shared room 57.44
5 Brooklyn Shared room 50.53
In [209]:
fig = px.bar(
    average_price,
    x="room_type",                # Room type on the x-axis
    y="price",                    # Average price on the y-axis
    color="neighbourhood_group",  # Color bars by neighbourhood group
    barmode="group",              # Group bars for each room type
    title="Average Price by Neighbourhood Group and Room Type",
    labels={
        "price": "Average Price (USD)",
        "room_type": "Room Type",
        "neighbourhood_group": "Neighbourhood Group"
    }
)

# Customize the layout
fig.update_layout(
    xaxis_title="Room Type",
    yaxis_title="Average Price (USD)",
    legend_title="Neighbourhood Group",
    width=1000,  # Adjust chart width
    height=600   # Adjust chart height
)

# Show the figure
fig.show()

This shows that the mean price of an Airbnb listing is 153. The min price of listing is 0 and the maximum is $10000. 75% of price listing is below 175 USD and below, and the maximum price is 10000. The price is greatly skewed with huge outliers.

Machine Learning¶

In [212]:
airbnb_df['price'].describe()
Out[212]:
count    48895.000000
mean       152.720687
std        240.154170
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
max      10000.000000
Name: price, dtype: float64

This shows that the mean price of an Airbnb listing is 153. The min price of listing is 0 and the maximum is $10000. 75% of price listing is below 175 USD and below, and the maximum price is 10000. The price is greatly skewed with huge outliers.

In [214]:
plt.hist(airbnb_df['price'], bins=1000, color='skyblue')
plt.title('Distribution of airbnb_df Prices')
plt.xlabel('Price')
plt.show()
No description has been provided for this image

Logarithm Transformation of Price¶

The price of listing is badly skewed, I will do a log transformation of the null zero price values. Models do not functional properly on skewed data

Remove Roles Where Price Equals Zero

In [217]:
airbnb_df['price'].loc[airbnb_df['price'] == 0].count() #count zero values in price column
Out[217]:
11
In [218]:
airbnb_df = airbnb_df[airbnb_df['price'] != 0]
In [219]:
airbnb_df['price'].loc[airbnb_df['price'] == 0].count()
Out[219]:
0
In [220]:
airbnb_df['log_price'] = np.log(airbnb_df['price']) #log transform and create a new column
C:\Users\User\AppData\Local\Temp\ipykernel_20484\1995869874.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [221]:
airbnb_df.head()
Out[221]:
neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365 log_price
0 Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 0.21 6 365 5.003946
1 Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 0.38 2 355 5.416100
2 Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 0.00 1 365 5.010635
3 Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 4.64 1 194 4.488636
4 Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 0.10 1 0 4.382027
In [222]:
ax = sns.histplot(airbnb_df, x='log_price', bins=50, stat="density", color='skyblue')
ax1 =sns.kdeplot(airbnb_df['log_price'], color='red')
ax1.set_title('Distribution of airbnb Prices')
ax1.set_xlabel('LogPrice');
No description has been provided for this image
In [ ]:
 

Let's place the plots side by side and compare the distribution before and after log transformation

In [224]:
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))

# Pass ax=axs[0] to plot on the first subplot
sns.histplot(airbnb_df, x='price', bins=50, stat="density", color='skyblue', ax=axs[0])
sns.kdeplot(airbnb_df['price'], color='red', ax=axs[0])
axs[0].set_title('Distribution of airbnb_df Prices')
axs[0].set_xlabel('Price');

# Pass ax=axs[1] to plot on the second subplot
sns.histplot(airbnb_df, x='log_price', bins=50, stat="density", color='skyblue', ax=axs[1])
sns.kdeplot(airbnb_df['log_price'], color='red', ax=axs[1])
axs[1].set_title('Distribution of airbnb_df log of Price')
axs[1].set_xlabel('Log of Price');
No description has been provided for this image
In [225]:
airbnb_df["log_price"].describe().round(2)
Out[225]:
count    48884.00
mean         4.73
std          0.70
min          2.30
25%          4.23
50%          4.66
75%          5.16
max          9.21
Name: log_price, dtype: float64

From the above, log transformation has given us that parametric or bell curve distribution that is important for our analysis to be statistically meaningful and it has also greatly removed outliers. If your target variable is continuous log transformation can help to normalise your data and give you that parametric distribution necessary for statistical significance. Other methods of normalization include square and cube root transformation

# This is formatted as code

Check Price Outliers¶

I will use quantile to show the bottom 10% values and the top 10% price of listing, this will further help me to determine if we have outliers in price

In [228]:
low, high = airbnb_df["log_price"].quantile([0.1, 0.9]) # Take quantile values and assign to variables name
In [229]:
print(low,
high)
3.8918202981106265 5.594711379601839

Check price distribution after removing outliers¶

In [232]:
top_price_log = airbnb_df.log_price.value_counts().nlargest(5).reset_index()
top_price_log
Out[232]:
log_price count
0 4.605170 2051
1 5.010635 2047
2 3.912023 1534
3 4.094345 1458
4 5.298317 1401
In [233]:
fig = px.bar(
    top_price_log,
    x='count',                  # x-axis represents 'count' (frequency)
    y='log_price',                  # y-axis represents 'price'
    orientation='h',            # Horizontal orientation
    title='Top 5 Price Listings',  # Title of the chart
    color='log_price',              # Color bars by 'price' using a gradient
    color_continuous_scale='plasma'  # Use the 'plasma' color scale
)

# Customize the layout
fig.update_layout(
    xaxis_title='Count',         # Label for the x-axis
    yaxis_title='Log of Price',         # Label for the y-axis
    yaxis=dict(categoryorder='total ascending'),  # Order bars by price
    width=800,                   # Set chart width
    height=600                   # Set chart height
)

# Show the figure
fig.show()
In [ ]:
 
In [236]:
# Calculate the average price for each neighbourhood group
average_price=airbnb_df.groupby('neighbourhood_group')['price'].mean().sort_values(ascending=False).round(2)
average_price
Out[236]:
neighbourhood_group
Manhattan        196.88
Brooklyn         124.44
Staten Island    114.81
Queens            99.52
Bronx             87.58
Name: price, dtype: float64

Neighbourhood with the highest number of airbnb

In [ ]:
 
In [238]:
# Scatter plot of prices vs. number of reviews
plt.figure(figsize=(10, 6))
plt.scatter(airbnb_df['number_of_reviews'], airbnb_df['price'], alpha=0.5)
plt.xlabel('Number of Reviews')
plt.ylabel('Price')
plt.title('Price vs. Number of Reviews')
plt.show()
No description has been provided for this image

Price Correlation

The figure below shows the coefficient of the correlation between the target variable, price ('log of price'), and the predictor variables (features). It shows that the price has the highest correlation with count of host listings, the higher the price the higher the count of listings.

In [240]:
plt.figure(figsize=(4,8))
# Calculate the correlation before dropping the 'price' column
correlation = airbnb_df.select_dtypes('number').drop(columns='price').corr()['log_price'].sort_values(ascending=False).to_frame()
# Now you can drop 'price' if you don't want it in the heatmap itself
correlation = correlation.drop(index='log_price')
# Plot heatmap of `correlation`
sns.heatmap(correlation, annot=True, linewidth=2)
Out[240]:
<Axes: >
No description has been provided for this image

Model Building¶

`# This is formatted as code`

Create Feature Matrix X and Target Vector y

Creating my feature matrix X and target vector y. My target is "log_price"

In [243]:
target = "log_price"
X = airbnb_df.drop(columns=[target, 'price'])
y = airbnb_df[target]
In [244]:
X.columns
Out[244]:
Index(['neighbourhood_group', 'neighbourhood', 'latitude', 'longitude',
       'room_type', 'minimum_nights', 'number_of_reviews', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365'],
      dtype='object')

Split Dataset (80%)¶

Divide data (X and y) into training and test sets using randomized train-test split. Test set is 20% of total data. Random_state for reproducibility is 42

In [247]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)
X_train shape: (39107, 10)
y_train shape: (39107,)
X_test shape: (9777, 10)
y_test shape: (9777,)
In [248]:
y_train.mean()
Out[248]:
4.727598311272868

Baseline Mean Absolute Error¶

What is Baseline MAE? The Baseline MAE serves as a benchmark to evaluate how much better (or worse) a regression model performs compared to its bseline.

If a model's MAE is significantly lower than the Baseline MAE, it indicates that the model has learned meaningful patterns from the features, and if a model's MAE is close to or worse than the Baseline MAE, it suggests that:

  1. The model is not effectively learning patterns.
  2. The features might not be informative.
  3. The model might require improvement (e.g., feature engineering, hyperparameter tuning, or switching to a more complex model).
In [250]:
# Baseline MAE using mean of y
baseline_mae = abs(y_test - y_test.mean()).mean()
print(f"Baseline MAE: {baseline_mae}")
Baseline MAE: 0.5563692349799404

Machine Learing Pipeline¶

Ridge Regression¶

In [253]:
# Build model
model = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    StandardScaler(),
    SimpleImputer(strategy='mean'), #categorical data, you can use SimpleImputer(strategy='most_frequent')
    Ridge(),


)
# Fit model to training data
model.fit(X_train, y_train)
Out[253]:
Pipeline(steps=[('onehotencoder',
                 OneHotEncoder(cols=['neighbourhood_group', 'neighbourhood',
                                     'room_type'],
                               use_cat_names=True)),
                ('standardscaler', StandardScaler()),
                ('simpleimputer', SimpleImputer()), ('ridge', Ridge())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('onehotencoder',
                 OneHotEncoder(cols=['neighbourhood_group', 'neighbourhood',
                                     'room_type'],
                               use_cat_names=True)),
                ('standardscaler', StandardScaler()),
                ('simpleimputer', SimpleImputer()), ('ridge', Ridge())])
OneHotEncoder(cols=['neighbourhood_group', 'neighbourhood', 'room_type'],
              use_cat_names=True)
StandardScaler()
SimpleImputer()
Ridge()
In [254]:
## Evaluate
In [255]:
#Recall Baseline
print(f"Baseline MAE: {baseline_mae}")
Baseline MAE: 0.5563692349799404
In [256]:
acc_train = mean_absolute_error(y_train, model.predict(X_train))
acc_test = model.score(X_test, y_test )

print("Training Accuracy:", round(acc_train, 2))
print("Test Accuracy:", round(acc_test, 2))
Training Accuracy: 0.34
Test Accuracy: 0.53
In [257]:
ridge_pred = model.predict(X_test)
In [258]:
mae = mean_absolute_error(y_test, ridge_pred)
mse = mean_squared_error(y_test, ridge_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, ridge_pred)
In [259]:
print("Performance Metrics:")
print(f"Baseline MAE: {baseline_mae}")
print(f'Mean Absolute Error: {mae}')
print(f'Mean Squared Error: {mse}')
print(f'Root Mean Squared Error: {rmse}')
print(f'R-squared: {r2}')
Performance Metrics:
Baseline MAE: 0.5563692349799404
Mean Absolute Error: 0.3463794700498891
Mean Squared Error: 0.2322500436804255
Root Mean Squared Error: 0.48192327571972027
R-squared: 0.5253740675116862
In [ ]:
 

Evaluate¶

Communicate Results¶

Create a Series named feat_imp. The index should contain the names of all the features your model considers when making predictions; the values should be the coefficient values associated with each feature. The Series should be sorted ascending by absolute value.

In [263]:
coefficients = model.named_steps["ridge"].coef_
features = model.named_steps["onehotencoder"].get_feature_names()
feat_imp = pd.Series(coefficients, index=features)
feat_imp.head()
Out[263]:
neighbourhood_group_Queens           0.017574
neighbourhood_group_Brooklyn        -0.041865
neighbourhood_group_Bronx           -0.001412
neighbourhood_group_Manhattan        0.039135
neighbourhood_group_Staten Island   -0.049619
dtype: float64
In [265]:
feat_imp.sort_values(key=abs)
Out[265]:
neighbourhood_Pelham Gardens    -0.000143
neighbourhood_Middle Village    -0.000246
neighbourhood_North Riverdale    0.000439
neighbourhood_Wakefield          0.000440
neighbourhood_Willowbrook       -0.000447
                                   ...   
availability_365                 0.103653
room_type_Shared room           -0.107806
room_type_Private room          -0.160956
room_type_Entire home/apt        0.193366
longitude                       -0.219073
Length: 233, dtype: float64
In [266]:
feat_imp.sort_values(key=abs).tail(10).plot(kind='barh')

# Label axes
plt.xlabel("Importance [USD]")
plt.ylabel("Feature")

# Add title
plt.title("Feature Importances for Apartment Price")


# Don't delete the code below 👇
Out[266]:
Text(0.5, 1.0, 'Feature Importances for Apartment Price')
No description has been provided for this image

RANDOM FOREST¶

Build Model

In [ ]:
# Build model
model = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    StandardScaler(),
    RandomForestRegressor(n_estimators=100, random_state=42)
)
# Fit model to training data
model.fit(X_train, y_train)
In [ ]:
rf_pred = model.predict(X_test)
In [ ]:
mae = mean_absolute_error(y_test,rf_pred)
mse = mean_squared_error(y_test, rf_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, rf_pred)
In [ ]:
print("Performance Metrics:")
print(f'Mean Absolute Error: {mae}')
print(f'Mean Squared Error: {mse}')
print(f'Root Mean Squared Error: {rmse}')
print(f'R-squared: {r2}')
In [ ]:
# Access the RandomForestRegressor within the pipeline
feature_importance = pd.DataFrame({'feature': model.named_steps['onehotencoder'].get_feature_names_out(X.columns), # Get feature names after OneHotEncoding
                                  'importance': model.named_steps['randomforestregressor'].feature_importances_})

# Rest of the code remains the same
feature_importance = feature_importance.sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feature_importance.head(10))
plt.title('Most Important Features for Price Prediction')
plt.show()

COMPARE MODELS: RIDGE REGRESSION AND RANDOM FOREST REGRESSOR¶

In [ ]:
    # Plot predictions vs actual
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

    ax1.scatter(y_test, ridge_pred, alpha=0.5)
    ax1.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
    ax1.set_title('Ridge: Predicted vs Actual')
    ax1.set_xlabel('Actual Values')
    ax1.set_ylabel('Predicted Values')

    ax2.scatter(y_test, rf_pred, alpha=0.5)
    ax2.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
    ax2.set_title('Random Forest: Predicted vs Actual')
    ax2.set_xlabel('Actual Values')
    ax2.set_ylabel('Predicted Values')

    plt.tight_layout()
    plt.show()
In [ ]:
metrics = {}
for name, pred in [("Ridge", ridge_pred), ("Random Forest", rf_pred)]:
  metrics[name] = {"MAE": mean_absolute_error(y_test, pred),
            "MSE": mean_squared_error(y_test, pred),
            "RMSE": np.sqrt(mean_squared_error(y_test, pred)),
            "R2": r2_score(y_test, pred)}
In [ ]:
 # Print detailed metrics comparison
print("\nDetailed Metrics Comparison:")
for metric in ["MAE", "MSE", "RMSE", "R2"]:
  print(f"\n{metric}:")
  for model in metrics:
    print(f"{model}: {metrics[model][metric]:.4f}")

Performance Analysis:¶

Overall Performance¶

Random Forest performs better across all metrics:

  • 8.4% improvement in MAE
  • 6.6% improvement in RMSE
  • 11.2% improvement in R² score

Specific Improvements¶

  • MAE reduced from 0.346 to 0.317---

Since we're working with log prices, this means predictions are typically off by exp(0.317) ≈ 1.37 compared to Ridge's 1.41

  • R² increased from 0.526 to 0.585

--- Random Forest explains 58.5% of price variance vs Ridge's 52.6% ---This suggests better capture of non-linear relationships in the data

Why Random Forest Performs Better¶

  • Better handles non-linear relationships between features
  • Automatically captures feature interactions
  • More robust to outliers in the dataset
  • Can model complex patterns in neighborhood and location data
In [ ]:
# Create a DataFrame for the metrics
metrics_data = {
    'Metric': ['MAE', 'MSE', 'RMSE', 'R2'],
    'Ridge': [0.3464, 0.2323, 0.4819, 0.5254],
    'Random Forest': [0.3169, 0.2028, 0.4504, 0.5855]
}

metrics_df = pd.DataFrame(metrics_data)

# Melt the DataFrame for Plotly
metrics_melted = metrics_df.melt(id_vars='Metric',
                                  value_vars=['Ridge', 'Random Forest'],
                                  var_name='Model',
                                  value_name='Value')

# Create a grouped bar chart
fig = px.bar(
    metrics_melted,
    x='Metric',          # Metrics on the x-axis (e.g., MAE, MSE, etc.)
    y='Value',           # Corresponding values on the y-axis
    color='Model',       # Different colors for Ridge and Random Forest
    barmode='group',     # Grouped bars for easy comparison
    title='Detailed Metrics Comparison: Ridge vs Random Forest',
    labels={'Value': 'Metric Value', 'Metric': 'Metric'},
    text='Value'         # Display values on top of the bars
)

# Customize the layout
fig.update_layout(
    xaxis_title='Metric',
    yaxis_title='Value',
    legend_title='Model',
    width=800,
    height=500
)

# Show the plot
fig.show()

The graph clearly shows that Random Forest (red bars) outperforms Ridge Regression (purple bars) across all metrics:

  1. Lower error metrics (MAE, MSE, RMSE)
  2. Higher R² score

Save Random Forest Model¶

In [ ]:
import pickle
with open('model.pkl' , 'wb') as file : #model.pkl is my pickle file in binary write mode('wb')
    pickle.dump(model, file)

Summary and Conclusion:¶

Exploratory Data Analysis (EDA)¶

The EDA process was instrumental in uncovering patterns and relationships in the dataset. Key observations include:

  1. Neighbourhood Group Distribution:¶

The dataset revealed that most listings were concentrated in Manhattan and Brooklyn, with fewer listings in Queens, Bronx, and Staten Island. Manhattan listings exhibited higher average prices, reflecting its premium market status.

  1. Room Type Insights:¶

Listings were categorised into Entire home/apt, Private room, and Shared room. Entire homes commanded the highest average prices, while shared rooms were the least expensive. Private rooms had significant variability in pricing, influenced by location and other factors.

  1. Price Distribution:¶

Price data exhibited a right-skewed distribution, with most listings priced below $200 per night. Outliers included luxury properties with prices exceeding $1,000 per night. These outliers were addressed by doing a log transformation of price during data preprocessing.

  1. Availability:¶

The availability_365 feature showed that many properties were available for fewer than 100 days annually, indicating the presence of part-time rentals.

  1. A heatmap of correlations revealed that:¶

availability_365 and number_of_reviews had weak correlations with price. Categorical features like neighbourhood group and room type appeared to play a more significant role.

  1. Feature Importance:¶

Random Forest's feature importance analysis highlighted neighbourhood group, room type, and minimum nights as the most influential factors in predicting price.

Machine Learning¶

This project demonstrated the comparative strengths of Ridge Regression and Random Forest models in predicting Airbnb prices. The analysis revealed that Random Forest outperformed Ridge Regression across all evaluation metrics. Random Forest achieved a lower MAE (0.3169 vs. 0.3464) and RMSE (0.4504 vs. 0.4819), indicating smaller prediction errors. Additionally, its higher R² (0.5855 vs. 0.5254) showed that it explained more variance in the target variable.

The results suggest that Random Forest is better suited for this dataset because it can capture complex, non-linear relationships between features and the target variable. In contrast, Ridge Regression, while effective for linear relationships and mitigating multicollinearity, struggled to achieve comparable accuracy in this context.

Future work could explore further hyperparameter tuning, feature engineering, and the use of advanced boosting models like XGBoost or LightGBM to enhance prediction accuracy. These findings reinforce the importance of selecting models based on the nature of the data and the problem at hand, particularly in domains such as real estate where diverse and complex factors influence price prediction.

In [ ]: