Our goal is to understand how anti-bot protections work and how to bypass them.
I created a dedicated website for this workshop https://trekky-reviews.com. This website provides a list of hotels for every city, including reviews.
We will collect name, email and reviews for each hotel.
During this workshop, we will use the following open-source software:
Framework | Description |
---|---|
Scrapy | the leading framework for web scraping |
Scrapoxy | the super proxies aggregator |
Playwright | the latest headless browser framework that integrates seamlessly with Scrapy |
Babel.js | a transpiler used for deobfuscation purposes |
The scraper can be found at scrapers/spiders/trekky.py.
All solutions are located in solutions. If you have any difficulties implementing a solution, feel free to copy and paste it. However, I recommend taking some time to search and explore to get the most out of the workshop, rather than rushing through it in 10 minutes.
To simplify the installation process, I've pre-configured an Ubuntu virtual machine for you with all the necessary dependencies for this workshop.
You can download it from this link.
The virtual machine is in OVA format and can be easily imported into VirtualBox.
It requires 8 Go RAM and 2 vCPU.
Click on Import Appliance
and choose the OVA file you downloaded.
Credentials are: vboxguest / changeme
.
I recommend switching the network setting from NAT to Bridge Adapter for improved performance.
Note: If the network is too slow, I have USB drives available with the VM.
You can manually install the required dependencies, which include:
- Python (version 3 or higher) with virtualenv
- Node.js (version 20 or higher)
- Docker
If you need Python, I recommend using Anaconda.
To install a Virtual Environment, run the following command:
python3 -m venv venv
source venv/bin/activate
To install Node.js, follow the instructions on: https://nodejs.org/en/download
To install Docker, follow the instructions for Mac, Windows or Linux.
This step is necessary even if you are using the VM.
Clone this repository:
git clone https://github.com/scrapoxy/scraping-workshop.git
cd scraping-workshop
Open a shell and install libraries:
pip install -r requirements.txt
After installing the Python libraries, run the follow command:
playwright install --with-deps chromium
Install Node.js from the official website or through the version management NVM
Open a shell and install libraries from package.json
:
npm install
Run the following command to download Scrapoxy:
sudo docker pull scrapoxy/scrapoxy
The URL to scrape is: https://trekky-reviews.com/level1
Our goal is to collect names, emails, and reviews for each hotel listed.
Open the file scrapers/spiders/trekky.py
.
In Scrapy, a spider is a Python class with a unique name
property. Here, the name is trekky
.
The spider class includes a method called start_requests
, which defines the initial URLs to scrape.
When a URL is fetched, the Scrapy engine triggers a callback function.
This callback function handles the parsing of the data.
It's also possible to generate new requests from within the callback function, allowing for chaining of requests and callbacks.
The structure of items is defined in the file scrapers/items.py
.
Each item type is represented by a dataclass containing fields and a loader:
HotelItem
: name, email, review with the loaderHotelItemLoader
ReviewItem
: rating with the loaderReviewItemLoader
To run the spider, open a terminal at the project's root directory and run the following command:
scrapy crawl trekky
Scrapy will collect data from 50 hotels:
2024-07-05 23:11:43 [trekky] INFO:
We got: 50 items
Check the results.csv
file to confirm that all items were collected.
The URL to scrape is: https://trekky-reviews.com/level2
Update the URL in your scraper to target the new page and execute the spider:
scrapy crawl trekky
Data collection may fail due to an anti-bot system.
Pay attention to the messages explaining why access is blocked and use this information to correct the scraper.
Hint: It relates to HTTP request headers 😉
Soluce is here
Open the soluceThe URL to scrape is: https://trekky-reviews.com/level4 (we will skip level3)
Update the URL in your scraper to target the new page and execute the spider:
scrapy crawl trekky
Data collection might fail due to rate limiting on our IP address.
Use Scrapoxy to bypass the rate limit with a cloud provider (not a proxy service).
Follow this guide or run the following command in the project directory:
sudo docker run -p 8888:8888 -p 8890:8890 -e AUTH_LOCAL_USERNAME=admin -e AUTH_LOCAL_PASSWORD=password -e BACKEND_JWT_SECRET=secret1 -e FRONTEND_JWT_SECRET=secret2 -e STORAGE_FILE_FILENAME=/scrapoxy.json -v ./scrapoxy.json:/scrapoxy.json scrapoxy/scrapoxy:latest
In the new project, keep the default settings and click the Create
button:
See the slides to set up the proxies provider account.
Use 10 proxies from the United States of America:
If you don't have an account with these cloud providers, you can create one.
Follow this guide.
Run your spider and check that Scrapoxy is handling the requests:
You should observe an increase in the count of received and sent requests.
Soluce is here
Open the soluceThe URL to scrape is: https://trekky-reviews.com/level6 (we will skip level5).
Update the URL in your scraper to target the new page and execute the spider:
scrapy crawl trekky
Data collection may fail due to fingerprinting.
Use the Network Inspector in your browser to view all requests. You will notice a new request appearing, which is a POST request instead of a GET request.
Inspect the website's code to identify the JavaScript that triggers this request.
It's clear that we need to execute JavaScript. Simply using Scrapy to send HTTP requests is not enough.
Also, it's important to maintain the same IP address throughout the session. Scrapoxy offers a sticky session feature for this purpose when using a browser.
Navigate to the Project options and enable both Intercept HTTPS requests with MITM
and Keep the same proxy with cookie injection
:
We will use the headless framework Playwright along with Scrapy's plugin scrapy-playwright.
![]() |
scrapy-playwright should already be installed. |
Our goal is to adapt the spider to integrate Playwright.
You can now completely replace the code in solutions/challenge-4.py due to extensive modifications needed.
The updates include:
- Settings: Updated to add a custom
DOWNLOAD_HANDLERS
. - Playwright Configuration:
PLAYWRIGHT_LAUNCH_OPTIONS
now:- Disables headless mode, allowing you to view Playwright’s actions.
- Configures a proxy to route traffic through Scrapoxy.
- Request Metadata: Each request now includes metadata to enable Playwright and ignore HTTPS errors (using ignore_https_errors).
The URL to scrape is: https://trekky-reviews.com/level7
Update the URL in your scraper to target the new page and execute the spider:
scrapy crawl trekky
You will notice that data collection may fail due to inconsistency errors.
Anti-bot checks consistency across various layers of the browser stack.
Try to solve these errors!
Hint: It involves adjusting timezones 😉
Soluce is here
Open the soluceThe URL to scrape is: https://trekky-reviews.com/level8
Update the URL in your scraper to target the new page and execute the spider:
scrapy crawl trekky
Use the Network Inspector to review all requests. Among them, you'll spot some unusual ones. By inspecting the payload, you'll notice that the content is encrypted:
Inspect the website's code to find the JavaScript responsible for sending this requests. In this case, the source code is obfuscated.
Obfuscated code appears to be:
var _qkrt1f=window,_uqvmii="tdN",_u5zh1i="UNM",_p949z3="on",_eu2jji="en",_vnsd5q="bto",_bi4e9="a",_f1e79r="e",_w13dld="ode",_vbg0l7="RSA-",_6uh486="ki"...
To understand which information is being sent and how to emulate it, we need to deobfuscate the code.
To understand the structure of the code, copy/paste some code into the website AST Explorer
Don't forget to select @babel/parser
and enable Transform
:
AST Explorer parses the source code and generates a visual tree:
![]() |
For the record, I only obfuscated strings, not the code flow. |
Copy/paste the whole obfuscated code to tools/obfuscated.js
.
And run the deobfucator script:
node tools/deobfuscator.js
This script will deobfuscate specificaly this code.
![]() |
You can use online tools to deobfuscate this script, given that it's a straightforward obfuscated script. Also, GitHub Copilot can be incredibly helpful in writing AST operations, just as Claude Sonnet 3.5 is valuable for deciphering complex functions. |
Here’s a summary of the script’s behavior:
- It collects WebGL information;
- It encrypts the data using RSA encryption with an obfuscated public key;
- It sends the payload to the webserver via a POST request.
We need to implement the same approach in our spider.
Since we will be crafting the payload ourselves, there is no need to use Playwright anymore. We will simply send the payload before initiating any requests.
You can now completely replace the code in solutions/challenge-6-1-partial.py and fill in the missing parts.
Soluce is here
Open the soluceThe URL to scrape is: https://trekky-reviews.com/level9
For this challenge, directly use a Pure Playwright scraper from playwright_spider.py (don't use Scrapy).
Run it:
python playwright_spider.py
You will notice that data collection may fail due to playwright detection.
Anti-bot checks if CDP protocol or network inspector is opened.
Try to replace Playwright by another framework!
Hint: Use Camoufoux 😉
Soluce is here
Open the soluceThank you so much for participating in this workshop.
Your feedback is incredibly valuable to me. Please take a moment to complete this survey; your insights will greatly assist in enhancing future workshops:
👉 GO TO SURVEY 👈
WebScraping Anti-Ban Workshop © 2024 by Fabien Vauchelles is licensed under CC BY-NC-ND 4.0:
- Credit must be given to the creator;
- Only noncommercial use of your work is permitted;
- No derivatives or adaptations of your work are permitted.