Skip to content

Repository withcase study of the properties listed in Toronto, Ontario, Canada as of 2021, February 08. The model uses data from Airbnb, but also uses geolocation data (MapQuest) and natural language processing though VADER lexicon. The repository is fully open-sourced under the MIT License.

License

Notifications You must be signed in to change notification settings

josmarcristello/Airbnb-Toronto-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prediction of Airbnb prices with Machine Learning in the Region of Toronto, Ontario, Canada

Introduction

The notebooks in this repository are a case study of the properties listed in Toronto, Ontario, Canada as of 2021, February 08. The model uses data from Airbnb, but also uses geolocation data (MapQuest) and natural language processing though VADER lexicon. The repository is fully open-sourced under the MIT License.

This study was originally submitted by the author as a capstone project to obtain the title of Scientist of Data Scientist at PUC/MG University (Pontíficia Universidade Católica de Minas Gerais, Brazil).

Table of Contents

  1. Installation
  2. Usage
  3. File Descriptions
  4. Results
  5. Limitations
  6. Acknowledgements

Installation

All necessary libraries are shown in the beginning of each notebook. The code was tested with Python 3.9.1 64-bit, but should work with version 3.x.

The recommended way to install the libraries is with Pypi, using pip:

> pip install <libraryname>

Before opening an issue, make sure your libraries are updated:

> pip install --upgrade <libraryname>

Usage

Either:

  • Clone the repository, which will give you all the data sources used in the original study. Please see Acknowledgements for licensing of each data source information.

  • Use the notebooks, which will download and process all the data, and additionally allow for selecting different dates or regions.

File Descriptions

  1. download_airbnb.ipynb - First notebook to run. Downloads data from InsideAirbnb.com. If necessary, configure a different region and/or dates here.

  2. GEO_process_coordinates.ipynb - Query MapQuest API to obtain geography data about each property longitude-latitude pair. Please note that free accounts are limited to 15k requests/month.

  3. NLP_process_descriptions.ipynb and NLP_process_reviews.ipynb - Apply sentiment analysis using VADER lexicon on, respectively, descriptions and reviews text.

  4. analysis.ipynb - Requires all notebooks to have been successfully run (or repository to be cloned), and is responsible for all the data cleaning, merging of previously generated data sources, exploratory data analysis (EDA), feature engineering, model creation, training and results.

  5. TCC_PUCMG_Report.pdf and TCC_PUCMG_Presentation.pptx - Respectively, the report and presentation of this study. Both in Portuguese (PT-BR). Rest of the repository is in English.

Results

The selected model used Gradient Boosting (XGBoost Library) and had a R² of 0.691 with MSE of 0.125. The notebook had a very extensive data analysis, and contains recommendations for a host that wants to maximize their profit, or a guest that wants to maximize their cost-benefit when renting a place. For the guest example, in Toronto:

  • Always rent a place for the exact number of people traveling with you.
    • E.g. In general, its cheaper (price per person) to rent a place sized for 3-people while traveling with 3 people than travelling alone and renting a smaller place for one person.
  • Properties in the south of Toronto City are more expensive (Particularly in Old Toronto).
  • Houses are cheaper than apartments. Hotel rooms seemed to be in between, but they are not numerous enough for this conclusion to be statistically significant.
  • Renting Shared rooms is cheaper than renting private rooms which is also cheaper than renting the entire property.
  • If possible, always choose listings from Superhosts. Surprisingly, this does not have a significant impact on pricing, and is a good way to get a more guaranteed quality of service.
  • Exclude amenities that aren't going to be used.
    • E.g. If you do not intend to use the building's Gym, try to get a property without. Other amenities with significant price impact: Dishwasher, TV, Dryer, Pool.

Limitations

The 'price' variable available in the dataset is, exclusively, the listing price at the moment of collection. This is a problem because Airbnb uses dynamic pricing, and also because it is disconsidering all of historical data. For example, there may be situations where a host started with a lower pricing to rack up many good reviews, and gradually increased pricing to current value. Such cases would not be considered by the model, since historical prices are not available. This is, in the opinion of the author, the biggest difficulty in achieving a higher precision with the model.

Acknowledgements

About

Repository withcase study of the properties listed in Toronto, Ontario, Canada as of 2021, February 08. The model uses data from Airbnb, but also uses geolocation data (MapQuest) and natural language processing though VADER lexicon. The repository is fully open-sourced under the MIT License.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published