Skip to content
forked from laiso/site2pdf

Generate comprehensive PDFs of entire websites, ideal for RAG.

License

Notifications You must be signed in to change notification settings

SidSata/site2pdf

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

site2pdf

This tool generates a PDF file containing the main page and all sub-pages of a website that match a provided URL pattern.

📗The PDF generated by this tool is particularly well-suited for AI-based Retrieval-Augmented Generation (RAG) and Question Answering (QA) tasks.📗

pdf preview chatgpt qa

Motivation

🧳Portability: Combining multiple pages of a website into a single file enhances portability, making it easier to share and use the information.
🤖AI Integration: In some use cases, such as with Google NotebookLM and ChatGPT GPTs, providing a master dataset in PDF format helps in creating more efficient bots.
🖼️Visual Information Preservation: By generating results in PDF format, visual information like images is preserved, ensuring better recognition by multimodal models.

Prerequisites

To run this software, you need to have Node.js installed on your machine. You can download and install the latest version of Node.js from the official Node.js website.

Dependencies(Linux)

This project uses the following dependencies:

sudo apt-get update
sudo apt-get install -y libxkbcommon0
sudo apt-get install -y libnss3 libxss1 libasound2
sudo apt-get install -y fonts-liberation libappindicator3-1 libatk-bridge2.0-0 libatspi2.0-0 libgtk-3-0 libgbm-dev

Usage

npx site2pdf-cli <main_url> [url_pattern]

Arguments

  • <main_url>: The main URL of the website to be converted to PDF.
  • [url_pattern]: Optional regular expression to filter sub-links. Defaults to matching only links within the main URL domain.

Example

npx site2pdf-cli "https://www.typescriptlang.org/docs/handbook/" "https://www.typescriptlang.org/docs/handbook/2/"
> [email protected] start
> tsx index.ts https://www.typescriptlang.org/docs/handbook/ https://www.typescriptlang.org/docs/handbook/2/

Generating PDF for: https://www.typescriptlang.org/docs/handbook/
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/basic-types.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/everyday-types.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/narrowing.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/functions.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/objects.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/classes.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/modules.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/types-from-types.html
PDF saved to ./out/www-typescriptlang-org-docs-handbook.pdf

This command will generate a PDF file named www.typescriptlang.org-docs-handbook.pdf containing all pages on the https://www.typescriptlang.org/docs/handbook/ domain that match the pattern https://www.typescriptlang.org/docs/handbook/2/.

Troubleshooting for Windows

When running Puppeteer on Windows, you may encounter permission issues related to generating PDFs. To resolve this, you need to grant appropriate permissions. Follow these steps:

icacls %USERPROFILE%/.cache/puppeteer/chrome /grant *S-1-15-2-1:(OI)(CI)(RX)

Troubleshooting - Chrome reports sandbox errors on Windows| Puppeteer

Implementation Details

  • Navigates to the main page using puppeteer.
  • Finds all sub-links matching the provided url_pattern.
  • Generates a PDF for each sub-link using pdf-lib and merges them into a single document.
  • Saves the final PDF file with a slugified name based on the main URL.

Note: The provided url_pattern should be a valid regular expression. If no url_pattern is provided, the tool will default to matching only links within the main URL domain.

This tool is still under development and may have limitations. Feel free to contribute to the project by opening issues or pull requests!

Development

Prerequisites

Ensure you have Node.js and npm installed. You will also need a modern version of TypeScript and other dependencies specified in package.json.

Setup

Clone the repository and install the dependencies:

git clone https://github.com/laiso/site2pdf.git
cd site2pdf
npm install

Building

The project uses TypeScript. To compile the TypeScript files, run:

npx tsc

Running the Project

You can run the project in development mode with:

npm run dev

This command uses tsx to watch for changes and recompile as necessary.

Testing

The project uses Jest for testing. To run the tests, execute:

npm test

Linting

Linting is configured using Biome. To check for linting issues, run:

npx biome lint

Code Formatting

To format the code according to the project's style guidelines, run:

npx biome format

Contributing

Feel free to open issues or pull requests. Make sure to follow the existing code style and include tests for new features or bug fixes.

Notes

  • The project uses ES modules. Ensure your Node.js version supports this.
  • Update dependencies as necessary, and ensure compatibility with existing code.

About

Generate comprehensive PDFs of entire websites, ideal for RAG.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TypeScript 93.0%
  • JavaScript 7.0%