Fabien's WebScraping Anti-Ban Workshop

Introduction

Our goal is to understand how anti-bot protections work and how to bypass them.

I created a dedicated website for this workshop https://trekky-reviews.com. This website provides a list of hotels for every city, including reviews.

We will collect name, email and reviews for each hotel.

During this workshop, we will use the following open-source software:

Framework	Description
Scrapy	the leading framework for web scraping
Scrapoxy	the super proxies aggregator
Playwright	the latest headless browser framework that integrates seamlessly with Scrapy
Babel.js	a transpiler used for deobfuscation purposes

The scraper can be found at scrapers/spiders/trekky.py.

All solutions are located in solutions. If you have any difficulties implementing a solution, feel free to copy and paste it. However, I recommend taking some time to search and explore to get the most out of the workshop, rather than rushing through it in 10 minutes.

Preflight Checklist

VirtualBox (Linux and Windows)

To simplify the installation process, I've pre-configured an Ubuntu virtual machine for you with all the necessary dependencies for this workshop.

This virtual machine is compatible only with AMD64 architecture (Linux, Windows, and Intel-based macOS).

For macOS M1 (ARM64), please manually install the dependencies.

On Windows, avoid using WSL2 (it doesn't work with Playwright)

You can download it from this link.

The virtual machine is in OVA format and can be easily imported into VirtualBox.

It requires 8 Go RAM and 2 vCPU.

Click on Import Appliance and choose the OVA file you downloaded.

Credentials are: vboxguest / changeme.

I recommend switching the network setting from NAT to Bridge Adapter for improved performance.

Note: If the network is too slow, I have USB drives available with the VM.

Full Installation (Linux, Windows, and macOS)

You can manually install the required dependencies, which include:

Python (version 3 or higher) with virtualenv
Node.js (version 20 or higher)
Docker

Python

If you need Python, I recommend using Anaconda.

To install a Virtual Environment, run the following command:

python3 -m venv venv
source venv/bin/activate

Node.js

To install Node.js, follow the instructions on: https://nodejs.org/en/download

Docker

To install Docker, follow the instructions for Mac, Windows or Linux.

Setting up

This step is necessary even if you are using the VM.

Step 1: Clone the Repository

Clone this repository:

git clone https://github.com/scrapoxy/scraping-workshop.git
cd scraping-workshop

Step 2: Install Python libraries

Open a shell and install libraries:

pip install -r requirements.txt

Step 3: Install Playwright

After installing the Python libraries, run the follow command:

playwright install --with-deps chromium

Step 4: Install Node.js

Install Node.js from the official website or through the version management NVM

Step 5: Install Node.js libraries

Open a shell and install libraries from package.json:

npm install

Step 6: Scrapoxy

Run the following command to download Scrapoxy:

sudo docker pull scrapoxy/scrapoxy

Challenge 1: Run your first Scraper

The URL to scrape is: https://trekky-reviews.com/level1

Our goal is to collect names, emails, and reviews for each hotel listed.

Open the file scrapers/spiders/trekky.py.

In Scrapy, a spider is a Python class with a unique name property. Here, the name is trekky.

The spider class includes a method called start_requests, which defines the initial URLs to scrape. When a URL is fetched, the Scrapy engine triggers a callback function. This callback function handles the parsing of the data. It's also possible to generate new requests from within the callback function, allowing for chaining of requests and callbacks.

The structure of items is defined in the file scrapers/items.py. Each item type is represented by a dataclass containing fields and a loader:

HotelItem: name, email, review with the loader HotelItemLoader
ReviewItem: rating with the loader ReviewItemLoader

To run the spider, open a terminal at the project's root directory and run the following command:

scrapy crawl trekky

Scrapy will collect data from 50 hotels:

2024-07-05 23:11:43 [trekky] INFO: 

We got: 50 items

Check the results.csv file to confirm that all items were collected.

Challenge 2: First Anti-Bot protection

The URL to scrape is: https://trekky-reviews.com/level2

Update the URL in your scraper to target the new page and execute the spider:

scrapy crawl trekky

Data collection may fail due to an anti-bot system.

Pay attention to the messages explaining why access is blocked and use this information to correct the scraper.

Hint: It relates to HTTP request headers 😉

Soluce is here

Open the soluce

Challenge 3: Rate limit

The URL to scrape is: https://trekky-reviews.com/level4 (we will skip level3)

Update the URL in your scraper to target the new page and execute the spider:

scrapy crawl trekky

Data collection might fail due to rate limiting on our IP address.

Please don't adjust the delay between requests or the number of concurrent requests; that is not our goal. Imagine we need to collect millions of items within a few hours, and delaying our scraping session is not an option. Instead, we will use proxies to distribute requests across multiple IP addresses.

Use Scrapoxy to bypass the rate limit with a cloud provider (not a proxy service).

Step 1: Install Scrapoxy

Follow this guide or run the following command in the project directory:

sudo docker run -p 8888:8888 -p 8890:8890 -e AUTH_LOCAL_USERNAME=admin -e AUTH_LOCAL_PASSWORD=password -e BACKEND_JWT_SECRET=secret1 -e FRONTEND_JWT_SECRET=secret2 -e STORAGE_FILE_FILENAME=/scrapoxy.json -v ./scrapoxy.json:/scrapoxy.json scrapoxy/scrapoxy:latest

Step 2: Create a new project

In the new project, keep the default settings and click the Create button:

Step 3: Add a Proxy Provider

See the slides to set up the proxies provider account.

Use 10 proxies from the United States of America:

If you don't have an account with these cloud providers, you can create one.

They typically require a credit card, and you may need to pay a nominal fee of $1 or $2 for this workshop. Such charges are common when using proxies. Don't worry; in the next challenge, I'll provide you with free credit.

Step 4: Run the connector

Step 5: Connect Scrapoxy to the spider

Follow this guide.

Step 6: Execute the spider

Run your spider and check that Scrapoxy is handling the requests:

You should observe an increase in the count of received and sent requests.

Soluce is here

Open the soluce

Challenge 4: Fingerprint

The URL to scrape is: https://trekky-reviews.com/level6 (we will skip level5).

Update the URL in your scraper to target the new page and execute the spider:

scrapy crawl trekky

Data collection may fail due to fingerprinting.

Use the Network Inspector in your browser to view all requests. You will notice a new request appearing, which is a POST request instead of a GET request.

Inspect the website's code to identify the JavaScript that triggers this request.

It's clear that we need to execute JavaScript. Simply using Scrapy to send HTTP requests is not enough.

Also, it's important to maintain the same IP address throughout the session. Scrapoxy offers a sticky session feature for this purpose when using a browser.

Navigate to the Project options and enable both Intercept HTTPS requests with MITM and Keep the same proxy with cookie injection:

We will use the headless framework Playwright along with Scrapy's plugin scrapy-playwright.

scrapy-playwright should already be installed.

Our goal is to adapt the spider to integrate Playwright.

You can now completely replace the code in solutions/challenge-4.py due to extensive modifications needed.

The updates include:

Settings: Updated to add a custom DOWNLOAD_HANDLERS.
Playwright Configuration: PLAYWRIGHT_LAUNCH_OPTIONS now:
- Disables headless mode, allowing you to view Playwright’s actions.
- Configures a proxy to route traffic through Scrapoxy.
Request Metadata: Each request now includes metadata to enable Playwright and ignore HTTPS errors (using ignore_https_errors).

Challenge 5: Consistency

The URL to scrape is: https://trekky-reviews.com/level7

Update the URL in your scraper to target the new page and execute the spider:

scrapy crawl trekky

You will notice that data collection may fail due to inconsistency errors.

Anti-bot checks consistency across various layers of the browser stack.

Try to solve these errors!

Hint: It involves adjusting timezones 😉

Soluce is here

Open the soluce

Challenge 6: Deobfuscation

The URL to scrape is: https://trekky-reviews.com/level8

Update the URL in your scraper to target the new page and execute the spider:

scrapy crawl trekky

Step 1: Find the Anti-Bot Javascript

Use the Network Inspector to review all requests. Among them, you'll spot some unusual ones. By inspecting the payload, you'll notice that the content is encrypted:

Inspect the website's code to find the JavaScript responsible for sending this requests. In this case, the source code is obfuscated.

Obfuscated code appears to be:

var _qkrt1f=window,_uqvmii="tdN",_u5zh1i="UNM",_p949z3="on",_eu2jji="en",_vnsd5q="bto",_bi4e9="a",_f1e79r="e",_w13dld="ode",_vbg0l7="RSA-",_6uh486="ki"...

To understand which information is being sent and how to emulate it, we need to deobfuscate the code.

Step 2: Understand the code structure

To understand the structure of the code, copy/paste some code into the website AST Explorer

Don't forget to select @babel/parser and enable Transform:

AST Explorer parses the source code and generates a visual tree:

For the record, I only obfuscated strings, not the code flow.

Step 3: Deobfuscate the Javascript

Copy/paste the whole obfuscated code to tools/obfuscated.js.

And run the deobfucator script:

node tools/deobfuscator.js

This script will deobfuscate specificaly this code.

You can use online tools to deobfuscate this script, given that it's a straightforward obfuscated script. Also, GitHub Copilot can be incredibly helpful in writing AST operations, just as Claude Sonnet 3.5 is valuable for deciphering complex functions.

Step 4: Payload generation

Here’s a summary of the script’s behavior:

It collects WebGL information;
It encrypts the data using RSA encryption with an obfuscated public key;
It sends the payload to the webserver via a POST request.

We need to implement the same approach in our spider.

Since we will be crafting the payload ourselves, there is no need to use Playwright anymore. We will simply send the payload before initiating any requests.

You can now completely replace the code in solutions/challenge-6-1-partial.py and fill in the missing parts.

Soluce is here

Open the soluce

Challenge 7: Playwright Detection

The URL to scrape is: https://trekky-reviews.com/level9

For this challenge, directly use a Pure Playwright scraper from playwright_spider.py (don't use Scrapy).

Run it:

python playwright_spider.py

You will notice that data collection may fail due to playwright detection.

Anti-bot checks if CDP protocol or network inspector is opened.

Try to replace Playwright by another framework!

Hint: Use Camoufoux 😉

Soluce is here

Open the soluce

Conclusion

Thank you so much for participating in this workshop.

Your feedback is incredibly valuable to me. Please take a moment to complete this survey; your insights will greatly assist in enhancing future workshops:

👉 GO TO SURVEY 👈

Licence

Credit must be given to the creator;
Only noncommercial use of your work is permitted;
No derivatives or adaptations of your work are permitted.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.idea		.idea
.vscode		.vscode
images		images
scrapers		scrapers
solutions		solutions
tools		tools
.editorconfig		.editorconfig
.gitignore		.gitignore
README.md		README.md
header.jpg		header.jpg
package-lock.json		package-lock.json
package.json		package.json
playwright_spider.py		playwright_spider.py
requirements.txt		requirements.txt
scrapoxy.json		scrapoxy.json
scrapy.cfg		scrapy.cfg

scrapoxy/scraping-workshop

Folders and files

Latest commit

History

Repository files navigation

Fabien's WebScraping Anti-Ban Workshop

Introduction

Preflight Checklist

VirtualBox (Linux and Windows)

Full Installation (Linux, Windows, and macOS)

Python

Node.js

Docker

Setting up

Step 1: Clone the Repository

Step 2: Install Python libraries

Step 3: Install Playwright

Step 4: Install Node.js

Step 5: Install Node.js libraries

Step 6: Scrapoxy

Challenge 1: Run your first Scraper

Challenge 2: First Anti-Bot protection

Challenge 3: Rate limit

Step 1: Install Scrapoxy

Step 2: Create a new project

Step 3: Add a Proxy Provider

Step 4: Run the connector

Step 5: Connect Scrapoxy to the spider

Step 6: Execute the spider

Challenge 4: Fingerprint

Challenge 5: Consistency

Challenge 6: Deobfuscation

Step 1: Find the Anti-Bot Javascript

Step 2: Understand the code structure

Step 3: Deobfuscate the Javascript

Step 4: Payload generation

Challenge 7: Playwright Detection

Conclusion

Licence

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Languages

Packages