Initial commit

VikParuchuri · Sep 26, 2023 · 152fc3b · 152fc3b
commit 152fc3b
Show file tree

Hide file tree

Showing 76 changed files with 7,933 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,296 @@
+# Project files
+.DS_Store
+*.env
+.env
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+.idea/
+
+# Logs
+logs
+*.log
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+lerna-debug.log*
+.pnpm-debug.log*
+
+# Diagnostic reports (https://nodejs.org/api/report.html)
+report.[0-9]*.[0-9]*.[0-9]*.[0-9]*.json
+
+# Runtime data
+pids
+*.pid
+*.seed
+*.pid.lock
+
+# Directory for instrumented libs generated by jscoverage/JSCover
+lib-cov
+
+# Coverage directory used by tools like istanbul
+coverage
+*.lcov
+
+# nyc test coverage
+.nyc_output
+
+# Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files)
+.grunt
+
+# Bower dependency directory (https://bower.io/)
+bower_components
+
+# node-waf configuration
+.lock-wscript
+
+# Compiled binary addons (https://nodejs.org/api/addons.html)
+build/Release
+
+# Dependency directories
+node_modules/
+jspm_packages/
+
+# Snowpack dependency directory (https://snowpack.dev/)
+web_modules/
+
+# TypeScript cache
+*.tsbuildinfo
+
+# Optional npm cache directory
+.npm
+
+# Optional eslint cache
+.eslintcache
+
+# Optional stylelint cache
+.stylelintcache
+
+# Microbundle cache
+.rpt2_cache/
+.rts2_cache_cjs/
+.rts2_cache_es/
+.rts2_cache_umd/
+
+# Optional REPL history
+.node_repl_history
+
+# Output of 'npm pack'
+*.tgz
+
+# Yarn Integrity file
+.yarn-integrity
+
+# dotenv environment variable files
+.env
+.env.development.local
+.env.test.local
+.env.production.local
+.env.local
+
+# parcel-bundler cache (https://parceljs.org/)
+.cache
+.parcel-cache
+
+# Next.js build output
+.next
+out
+
+# Nuxt.js build / generate output
+.nuxt
+dist
+
+# Gatsby files
+.cache/
+# Comment in the public line in if your project uses Gatsby and not Next.js
+# https://nextjs.org/blog/next-9-1#public-directory-support
+# public
+
+# vuepress build output
+.vuepress/dist
+
+# vuepress v2.x temp and cache directory
+.temp
+.cache
+
+# Docusaurus cache and generated files
+.docusaurus
+
+# Serverless directories
+.serverless/
+
+# FuseBox cache
+.fusebox/
+
+# DynamoDB Local files
+.dynamodb/
+
+# TernJS port file
+.tern-port
+
+# Stores VSCode versions used for testing VSCode extensions
+.vscode-test
+
+# yarn v2
+.yarn/cache
+.yarn/unplugged
+.yarn/build-state.yml
+.yarn/install-state.gz
+.pnp.*
diff --git a/README.md b/README.md
@@ -0,0 +1,98 @@
+# Textbook Quality
+
+This project generates very long, textbook quality pretraining data.  [Here's](https://huggingface.co/datasets/vikp/textbook_quality_programming) a 70M token example.  It can run generations in parallel, against OpenAI, or your own API.  It can generate the topics from scratch, or use a set of seeds you provide.
+
+The generator uses retrieval to improve quality.  By default, it will use [Serply](https://serply.io) to do the retrieval, but you can also use [SerpAPI](https://serpapi.com), or disable retrieval.
+
+The core is extensible, so you can add your own adaptors to connect to new APIs and retrieval backends.
+
+# Installing
+
+## Prerequisites
+
+- Python 3.8+ (ideally 3.11)
+- You will need postgres installed. You can install it with `brew install postgres` on a Mac.
+
+## Setup
+
+- `psql postgres -c "create database textbook;"`
+- `git clone https://github.com/VikParuchuri/textbook_quality.git`
+- `cd textbook_quality`
+- `poetry install`
+- `invoke migrate-dev`
+
+## Configuration
+
+First, create a `local.env` file in the root directory of the repo to store your secret keys.  Alternatively, you can set any key below as an env var.
+
+You can see all the available configuration values in `app/settings.py`.
+
+### With OpenAI and retrieval (highest quality)
+
+- Add your OpenAI key, like `OPENAI_KEY=sk-xxxxxx`
+- Add your serply key (`SERPLY_KEY="..."`) or serpapi key (`SERPAPI_KEY="..."`).
+- Add `SEARCH_BACKEND=serply` or `SEARCH_BACKEND=serpapi` to use the appropriate backend.
+
+### With vllm or other openai-compatible API and retrieval
+
+- Set `OPENAI_KEY` to the value of your API key, or a dummy value.
+- Set `OPENAI_BASE_URL` to the url of your API (like https://vllm-api.com/v1)
+- Set the `LLM_TYPE`, `LLM_INSTRUCT_TYPE`, and `LLM_EXTENDED_TYPE` settings to your model name (like `llama`)
+- Set the model name and max tokens in the `LLM_TYPES` setting.
+- Follow the instructions above for the retrieval setup.
+
+The generator ideally needs a context length of up to `16k`, but you can get away with `12k` if you need to.
+
+### Without retrieval
+
+- Set `SEARCH_BACKEND=none`
+
+# Usage
+
+There are three main scripts in the repo.  You can run each script on the output of the previous one.  All outputs will appear by default in `app/data`, which is the specified `DATA_DIR` in settings.
+
+## Generate topics from scratch
+
+You enter a subject, a file you want to save the topics to, and the number of iterations.  The topics will be deduplicated.
+
+Usage example:
+
+`python topic_generator.py "computer science with python" python_cs_titles.json --iterations 50`
+
+## Augment topics from seeds
+
+Take a file with existing seeds (in a flat json list), and augment them.  You can pass in the output file from the topic generator as the seed file, or use your own seeds.  Domain is an optional flag to constrain the topics within a domain.
+
+This will also deduplicate the topics semantically.
+
+Usage example:
+
+`python topic_augmentor.py python_titles.json python_topics.json --domain python`
+
+## Generate textbooks
+
+This will take a file with a flat json list of topics, and generate one textbook per topic.  The workers flag controls the number of parallel generations.  Lower it if you hit rate limits.
+
+Usage example:
+
+`python book_generator.py topics.json books.jsonl --workers 5`
+
+You can also override settings with environment variables (instead of using `local.env`).  This example will use a vllm api instead of openai:
+
+`LLM_TYPE=llama LLM_INSTRUCT_TYPE=llama LLM_EXTENDED_TYPE=llama OPENAI_KEY="llama" OPENAI_BASE_URL="https://vllm-api.com/v1" python book_generator.py topics.json books.jsonl --workers 10`
+
+Note that courses are cached by default, so regenerating a course with the same name twice will not hit the API again.  The cache is specific to each model and each topic.
+
+# Extending
+
+You can extend this to add in new LLM adaptors, retrieval methods, or tasks.  PRs are very welcome.
+
+- LLM adapters are in `app/llm/adaptors`
+- Retrieval methods are in `app/services/adaptors`.  You may also need to adjust settings in `services/generators/pdf.py`
+- Tasks are in `app/llm/generators`
+
+# Debugging
+
+By default, a lot of exceptions will be hidden to avoid console noise.  Use `DEBUG=true` to display them, like this:
+
+`DEBUG=true python book_generator.py python_topics.json books.jsonl --max 5 --workers 5`