myGPTReader is a slack bot that reads web pages and summarizes them with chatGPT. You can use it to read news and summarize them in your slack channel.
For now it is in development, but you can try it out by join this channel.
The exciting part is that the development of this project is also paired with chatGPT. I document the development process in this CDDR file.
- Integrated with slack bot
- Bot replies messages in the same thread
- Support web page reading with chatGPT
- Consider to use cloudflare worker to scrape the html content
- Self-hosting Web Scraper
- Restrict to access the web scraper, only allow the API server to access it by Cloudflare Access
- Consider to use a headless browser to scrape the web page content like twitter thread
- Consider to use OCR to scrape the web page content (Web crawler to get the screenshot, then OCR to get the text)
Azure OCR- Google Vision
- may use GPT4
- Consider to use cloudflare worker to scrape the html content
- Support RSS reading with chatGPT
- RSS is a bunch of links, so it is equivalent to reading a web page to get the content.
-
Support newsletter reading with chatGPT- Most newsletters are public and can be accessed online, so we can just give the url to the slack bot.
- Prompt fine-tue
- Support for custom
prompt
- Show
prompt
templates by slack app slash commands - Auto collect the good
prompt
to#gpt-prompt
channel by message shortcut
- Support for custom
- Cost saving
- by caching the web page llama index
Consider to use sqlite-vss to store and search the text embeddingsUse chromadb to store and search the text embeddings- Use the llama index file to restore the index
- Consider to use sentence-transformers or txtai to generate embeddings (vectors)
- Not good as the embeddings of OpenAI, rollback to use the OpenAI embeddings, and if enable to use the custom embeddings, the minimum of server's memory is 2GB which still increase the cost.
- Consider to fine-tue the chunk size of index node and prompt to save the cost
- If the chunk size is too big, it will cause the index node to be too large and the cost will be high.
- by caching the web page llama index
- Bot can read historical messages from the same thread, thus providing context to chatGPT
- Index fine-tune
- Use the GPTListIndex to summarize multiple URLs
- Use the
GPTTreeIndex
withsummarize
mode to summarize a single web page
- Bot regularly send hot
summarizes(expensive cost)news in the slack channel (#daily-news
)Refer to this approach- World News
- Zhihu daily hot answers
- V2EX daily hot topics
- 1point3acres daily hot topics
- Reddit world hot news
- Dev News
- Hacker News daily hot topics
- Product Hunt daily hot topics
- Invest News
- Xueqiu daily hot topics
- Jisilu daily hot topics
- World News
- Support file reading and analysis 💥 🚩
- Considering the expensive billing, it needs to use the slack userID whitelist to restrict the access this feature
- Need to cache the file Documents to save extract cost
- EPUB
- DOCX
- TEXT
- PDF
- Use Google Vision to handle the PDF reading
- Image
- may use GPT4
- Support voice reading with self-hosting whisper
- (whisper -> chatGPT -> azure text2speech) to play language speaking practices 💥
- Integrated with Azure OpenAI Service
- User access limit
- Limit the number of requests to bot per user per day to save the cost
- Support discord bot ❓
- Rewrite the code in Typescript ❓
- Upgrade chat model (gpt-3.5-turbo) to GPT4 (gpt-4-0314) 💥
- Documentation
- Publish bot to make it can be used in other workspaces
- Slack marketplace