Skip to content

Redfin scraper using filters and proxies

License

Notifications You must be signed in to change notification settings

clarsen/redfin-scraper

 
 

Repository files navigation

redfin-scraper

redfin-scraper is a proxy-based scraper to extract properties from Redfin with filters. It is especially useful when you want to crawl all recently sold properties (e.g., properties sold in past 3 years) in a given state or city.

Scraping Algorithm

Please refer to algorithm_sketch.md.

Prerequisites

  1. Have sqlite installed. If you are using mac, you do not need to install.
  2. Your OS system has python 3.6
  3. You have a file of proxies. You can buy proxies online, or use a free service like proxybroker. The repo assumes the use of proxies with user and password authorization. If your proxies do not need authorization, you can just have the csv file like
ip,port
a.b.c.d,2345
e.f.g.h,1234
...

Otherwise, your csv proxy file can be

ip,port,user,password
a.b.c.d,2345,user1,pass1
e.f.g.h,1234,user2,pass2
...

Environment Setup

  1. Create Python virtual environment first with python3.
python3.6 -m venv /path/to/venv
  1. Activate venv.
source /path/to/venv/bin/activate
  1. pip install -r requirements.txt

How to use

Once you successfully have all the prerequisites ready and set up the Python environment, you can scrape the Redfin data based on your needs. In the following I will demonstrate redfin-scraper usage by scraping a small city called Belmont (https://www.redfin.com/city/1362/CA/Belmont).

Property Summary URLs Only

If you want to get all Redfin summary URLs in a given city, you can just run

python redfin_crawler.py https://www.redfin.com/city/18823/WA/Vancouver --proxy_csv residential_proxy.csv --property_prefix https://www.redfin.com/city/18823/WA/Vancouver --type pages

Scraping Property Short Details

If you need to get minimal property details, including price, address, property type, and number of rooms, you can just run with type properties. This will not only generate the summary URLs containing the properties, but extract the property metadata from those urls.

python redfin_crawler.py https://www.redfin.com/city/18823/WA/Vancouver --proxy_csv residential_proxy.csv --property_prefix https://www.redfin.com/city/18823/WA/Vancouver --type properties

Scraping Property Full Details

If you need to get the full property details, including price, number of rooms, number of bathrooms, square feet, lot size, redfin estimate, price/sqrt, and year, you can just run with type property_details. This will not only generate the summary URLs containing the properties, but extract the property metadata from those urls.

python redfin_crawler.py https://www.redfin.com/city/18823/WA/Vancouver --proxy_csv residential_proxy.csv --property_prefix https://www.redfin.com/city/18823/WA/Vancouver --type property_details

Known Issues and Bugs

Safe folk issue on Mac

If Mac user experiences errors like

may have been in progress in another thread when fork() was called.
We cannot safely call it or ignore it in the fork() child process. Crashing instead

Try setting the following env before running the program

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

Scraping with proxies returns 403 error code.

Most likely this proxy is blocked by the detection algorithm of the corresponding websites. You can temporarily remove the proxy out of your proxy pool.

But how do I know whether a proxy is good or not?

I put a proxy_checker.py in the tools repo. You can use this script to eliminate the proxies that are currently blocked by external website. To use, run

python tools/proxy_checker.py --proxy_csv_path proxy.csv

Disclaimer

Scraping websites can violate website term of service. Use at your own risk.

TODO

  1. Add free proxy integration so no external proxy file is needed.
  2. Make it a package so users can easily install it with pip.
  3. Add Docker environment.

About

Redfin scraper using filters and proxies

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 56.7%
  • Jupyter Notebook 41.7%
  • Dockerfile 1.6%