Welcome, developer! You've arrived at the repository for STC, the library, search engine and AI tooling offering free access to academic knowledge and works of fictional literature.
- Explore our search features at Web STC, or through one of the Telegram bots listed in the bio of our channel (not an ad, just a safety)
- Discover how to set up your own STC instance, enabling you to enjoy the same search capabilities in your local environment
- Learn about how to access large corpus of high-quality scholarly texts using Python
In essence, STC is a search engine Summa coupled with databanks. These databanks reside on IPFS in a format that allows for searching without necessitating the download of the entire dataset. The search engine library can function as a standalone server, an embeddable Python library (requiring no additional software!), and a WASM-compiled module that can be used in a browser. Last way allows to embed search engine in a static site that further can be deployed over IPFS too. This is how Web STC is live.
Putting everything to IPFS allows you to open STC in your browser or on your server and avoid the use of centralized servers that may lose or censor data.
- GECK is a Python library and Bash tool for setting up and interacting with STC programmatically
- Cybrex AI library pairs STC with AI tools such as OpenAI or free LLM for processing stored data
- Telegram Bot allows users to access STC via Telegram, one of the most popular messaging platforms.
Part | Task | Description |
---|---|---|
Library Stewardship | ||
✅ Assimilation of LibGen corpus | Transition of all items to nexus_science |
|
🚧 Assimilation of SciMag corpus | Significant task of transferring scimag corpus to IPFS | |
✅ Structured content | Enhance GROBID extraction (headers + content) and store content in structured_content JSON column. Extract entities for cross-linking in Web STC | |
🚧 Implementing classification (articles, books) | ||
Web STC | ||
UX improvement | STC often requires loading of large data chunks, currently reflected only by a spinner. The UX needs improvement. Following structured content implementation, we can highlight headers and generate cross-links in abstracts/content | |
Enhancing availability | Further testing needed on diverse devices and networks | |
Bookshelf | STC has all tools for generating bookshelves that may offer users high-quality suggestions on read. | |
Cybrex AI | ||
First-class support of local LLM | Extensive testing of prompts with documents is required to identify the smallest model capable of efficiently executing QA and summarization tasks. Most 13-15B models are currently failing (quantized, on CPU) | |
Building an embeddings dataset | The goal is to build a comprehensive dataset with DOIs and document embeddings. Currently, the Instructor XL model appears most promising, but further testing is necessary | |
Refining and fixing metadata (cleaning content ) |
Areas for improvement include: detected language, tags, keywords, automated abstracts, Dewey classification | |
Build QA on local LLM | Such a system should be independently operable and also accessible via Telegram. | |
Fine-tuning LLMs on STC | ||
Distribution | ||
Building STC Box | Develop and maintain a definitive guide and scripts for replicating and launching STC on compact devices like PI computers or TV Boxes | |
Global replication | The goal is to replicate STC (including the search database and papers) a minimum of 100 times across at least 30 countries | |
Establishing Frontier Outposts | Investigate strategies to replicate STC on an orbiting satellite or another planet in the solar system (Mars or Europa preferred) | |
Communities | ||
✅ Forming Science Communities on Telegram | Initiate the first version of Telegram-based forums focusing on specific scientific topics | |
Addressing Copyright Issues | Organize more activities aimed at challenging the copyright laws for scholarly and educational writings |