For this project, you will be building an AI Answer Engine with Next.js and TypeScript that can scrape content from websites and mitigates hallucinations by citing its sources when providing answers. This project is inspired by Perplexity.ai, a company currently valued at over $9 Billion.
Here is an example of what you can expect to build: https://www.webchat.so/
Completed the following:
- src/app/page.tsx: Update the UI and handle the API response as needed
- src/app/api/chat/route.ts: Implement the chat API with Groq and web scraping with Cheerio
- src/middleware.ts: Implement the code here to add rate limiting with Redis
Project Requirements:
A chat interface where a user can:
- Paste in a set of URLs and get a response back with the context of all the URLs through an LLM
- Ask a question and get an answer with sources cited
- Share their conversation with others, and let them continue with their conversation
TODOs: (When making changes create a new 'todo/task-name' branch)
- Fix prompt for groq response to: use context provided at any point by user or search google when no user provided context. Server sent events
- Add share conversation feature, can look at message helper file for beginning code
- Incorporate Puppeteer
- How to Build a Web Scraper API with Puppeteer
- API Routes with Next.js
- Connect to your Upstash Client
- Connect with Upstash-Redis
- Rate Limiting your Nextjs 14 APIs using Upstash
- Rate Limit - Methods
- How to use Redis Caching
- Nextjs Middleware
- Write a Regular Expression for a URL
- Web Scrape Example
- Web Scraping Packages/Web Scraping Tools:
- Selenium
- Beautiful Soup
- Puppeteer - browser automation
- React
- TypeScript
- Next.js
- Caching
- Middleware
- API Design
This error is provided during web scraping when the url you are trying to scrap is preventing you from accessing it. This error can be negated by using Puppeteer that spins up an instance of chrome, goes to provided URL, and extracts important properties. This is also considered to be using a headless browser.
This command will automatically fix the formatting of ALL files in your project
This is used do get the users IP address from the header. Not really recommended to use, better to use the users' session for tracking.
iron-session <-- TODO
A secure, stateless, and cookie-based session library for javascript. The session data is stored in signed and encrypted cookies. View Github Vercel Examples
- Build a comprehensive solution to extract content from any kind of URL or data source, such as YouTube videos, PDFs, CSV files, and images
- Generate visualizations from the data such as bar charts, line charts, histograms, etc.
- Implement a hierarchical web crawler that starts at a given URL and identifies all relevant links on the page (e.g., hyperlinks, embedded media links, and scrapes the content from those links as well