This uses Crawlee, Cheerio, and Breakdance to do a simple crawler that inserts the content of a website as Markdown in a Xata database. The data can be used then for search, semantic search, or Q&A with ChatGPT.
To crawl a new website:
- Create a new Xata database, with this schema (you can create it with
xata init --schema schema.json
):
{
"tables": [
{
"name": "content",
"columns": [
{
"name": "url",
"type": "string"
},
{
"name": "title",
"type": "string"
},
{
"name": "website",
"type": "string"
},
{
"name": "content",
"type": "text"
}
]
}
]
}
- Update
.xatarc
with your Xata DB URL, or usexata init
to connect it. - Make sure you have the
XATA_API_KEY
exported in the environment. - Edit the
websites
array insrc/main.ts
to add the website you want to crawl. - Run
npm start
to crawl the website.