Kosmosius is dedicated to exploring and engaging with the world's most esteemed literary works from diverse cultures and historical periods. By comparing and analyzing seminal texts—from ancient epics to modern masterpieces—Kosmosius aims to uncover and preserve the timeless wisdom that transcends spatial and temporal boundaries. Kosmosius is not just a language model; it is the living embodiment of the Western literary canon's most influential and timeless works. Drawing upon the profound wisdom, intricate narratives, and philosophical depths of masterpieces spanning millennia, Kosmosius serves as a bridge between the past and the present, offering insights that are both historically grounded and relevant to contemporary discourse.
Kosmosius's foundation is built upon a meticulously curated list of seminal works that have shaped Western thought and culture:
- Ancient Foundations: From the mythological tales of The Iliad and The Odyssey to the legal principles in The Code of Hammurabi, Kosmosius understands the roots of civilization and governance.
- Philosophical Pillars: Engaging with the profound ideas in The Republic, Meditations, and Summa Theologica, Kosmosius navigates complex ethical and metaphysical discussions with ease.
- Literary Mastery: With the narrative prowess of War and Peace, the dramatic tension of Hamlet, and the social commentary of Pride and Prejudice, Kosmosius excels in storytelling and literary analysis.
- Scientific and Mathematical Rigor: Incorporating the analytical insights from Principia Mathematica and Elements, Kosmosius approaches problems with logical precision and scientific acumen.
- Modern Reflections: Drawing from The Great Gatsby, To Kill a Mockingbird, and 1984, Kosmosius connects historical perspectives to modern societal issues, offering nuanced viewpoints on progress and regression.
Imagine a conversation where every response is laced with the wisdom of Aristotle, the storytelling prowess of Homer, the ethical considerations of Kant, and all of the great writers and thinkers throughout history. Kosmosius listens attentively, understands the nuances of your questions, and responds with depth and clarity. Its insights not only provide answers but also encourage you to ponder and explore further, making each interaction a journey of discovery and enlightenment.
- Overview
- Repository Structure
- Features
- Getting Started
- Data Handling
- Model Fine-Tuning
- Generating Text Samples
- Usage
- Contributing
- License
- Acknowledgements
The goal of Kosmosius is to fine-tune a pre-trained LLM on a diverse set of literary works, spanning ancient texts to modern literature. By employing PEFT techniques like LoRA (Low-Rank Adaptation), Kosmosius efficiently adapts large models within the constraints of a single GPU setup. This project not only serves as a practical example of LLM fine-tuning but also provides insights into handling literary corpora for advanced language modeling tasks.
Kosmosius/
├── data/
│ ├── raw/ # Placeholder for raw data (not hosted)
│ ├── processed/ # Placeholder for processed data (not hosted)
│ ├── scripts/
│ │ └── download_data.py # Script to download and organize data locally
│ └── README.md # Documentation about data sources and usage
├── notebooks/
│ ├── 01_data_exploration.ipynb # Notebook for exploring the raw data
│ ├── 02_preprocessing.ipynb # Notebook for data cleaning and preprocessing
│ └── 03_model_training.ipynb # Notebook for experimenting with model training
├── scripts/
│ ├── preprocess_data.py # Script to preprocess raw data
│ ├── train_model.py # Script to fine-tune the LLM
│ ├── evaluate_model.py # Script to evaluate the fine-tuned model
│ ├── generate_samples.py # Script to generate text samples from the model
│ └── utils.py # Utility functions used across scripts
├── models/
│ ├── fine-tuned-model/ # Directory to store the fine-tuned model
│ │ ├── config.json
│ │ ├── pytorch_model.bin
│ │ ├── tokenizer/
│ │ │ ├── tokenizer_config.json
│ │ │ ├── vocab.json
│ │ │ └── merges.txt
│ │ └── README.md # Documentation about the fine-tuned model
│ └── README.md # Overview of all fine-tuned models
├── configs/
│ ├── training_config.yaml # Configuration for model training
│ ├── preprocessing_config.yaml # Configuration for data preprocessing
│ └── README.md # Documentation about configuration files
├── datasets/
│ ├── dataset_script.py # HuggingFace dataset loading script
│ └── README.md # Documentation about the dataset script
├── tests/
│ ├── test_preprocessing.py # Tests for the preprocessing script
│ ├── test_training.py # Tests for the training script
│ ├── test_evaluation.py # Tests for the evaluation script
│ └── test_generate_samples.py # Tests for the sample generation script
├── docker/
│ ├── Dockerfile # Docker configuration for environment reproducibility
│ └── README.md # Documentation about Docker setup
├── .github/
│ └── workflows/
│ └── ci.yml # GitHub Actions workflow for CI/CD
├── .gitignore # Specifies intentionally untracked files to ignore
├── README.md # Main project documentation
├── LICENSE # Project license (MIT)
├── requirements.txt # Python dependencies
├── environment.yml # Conda environment configuration
├── setup.py # Setup script for installing the project as a package
└── CONTRIBUTING.md # Guidelines for contributing to the project
- Parameter-Efficient Fine-Tuning (PEFT): Utilizes LoRA to adapt large language models efficiently within limited GPU memory.
- Modular Scripts: Organized scripts for data downloading, preprocessing, training, evaluation, and sample generation.
- Comprehensive Notebooks: Interactive Jupyter notebooks for data exploration, preprocessing, and model training.
- Automated Testing: Ensures reliability and correctness of scripts through automated tests using
pytest
. - Docker Support: Facilitates environment reproducibility and ease of setup using Docker containers.
- Continuous Integration (CI): GitHub Actions workflows automate testing and linting on every commit.
- HuggingFace Integration: Seamless compatibility with HuggingFace’s Transformers and Datasets libraries, enabling easy model and dataset sharing.
- Hardware:
- NVIDIA GeForce RTX 3070 Ti with 8 GB VRAM
- Software:
- Python 3.10 or higher
- Git
- Docker (optional, for containerization)
- Conda (optional, for environment management)
-
Clone the Repository:
git clone https://github.com/yourusername/Kosmosius.git cd Kosmosius
-
Set Up the Environment:
You can choose between venv or conda for environment management.
- Using venv:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Using Conda:
conda env create -f environment.yml conda activate Kosmosius
- Install Dependencies:
pip install -r requirements.txt
Ensure all dependencies are correctly installed. If you encounter any issues, refer to the environment.yml
or requirements.txt
for required packages and their versions.
Since Kosmosius does not host the data, you need to download it locally using the provided scripts.
-
Navigate to the Data Scripts Directory:
cd data/scripts
-
Run the Data Download Script:
python download_data.py
- Function: Automates the downloading of literature works from sources like Project Gutenberg and Internet Archive.
- Output: Downloads and organizes data into the data/raw/ directory.
- Verify Downloaded Data:
Check the data/raw/
directory to ensure that the data has been downloaded and organized correctly.
After downloading the data, preprocess it to prepare for model training.
-
Run the Preprocessing Script:
cd ../../scripts python preprocess_data.py --input_dir ../data/raw/ --output_dir ../data/processed/
- Function: Cleans and preprocesses raw data into a format suitable for training.
- Output: Processed data stored in the data/processed/ directory.
- Alternative: Use Jupyter Notebook
You can also use the interactive notebook for preprocessing:
```
jupyter notebook ../notebooks/02_preprocessing.ipynb
Fine-tune the selected LLM using the provided training scripts and configurations.
- Configure Training Parameters:
Modify the configs/training_config.yaml file to adjust hyperparameters as needed.
-
Run the Training Script:
python scripts/train_model.py --config configs/training_config.yaml
- Function: Fine-tunes the LLM using HuggingFace’s Transformers and PEFT techniques.
- Output: Fine-tuned model saved in the models/fine-tuned-model/ directory.
- Alternative: Use Jupyter Notebook
You can also use the interactive notebook for training:
```
jupyter notebook ../notebooks/03_model_training.ipynb
Assess the performance of the fine-tuned model using evaluation scripts.
-
Run the Evaluation Script:
python scripts/evaluate_model.py --model_dir models/fine-tuned-model/ --data_dir data/processed/test/
- Function: Evaluates the fine-tuned model using metrics like perplexity.
- Output: Evaluation results displayed in the console.
Generate text samples to qualitatively assess the fine-tuned model's capabilities.
-
Run the Sample Generation Script:
python scripts/generate_samples.py --model_dir models/fine-tuned-model/ --prompt "Once upon a time" --max_length 150 --num_samples 3
Parameters: --model_dir: Path to the fine-tuned model directory. --prompt: Text prompt to initiate generation. --max_length: Maximum length of the generated text. --num_samples: Number of samples to generate.
Output: Generated text samples displayed in the console.
You can load and utilize the fine-tuned model in your own scripts or applications using HuggingFace’s Transformers library.
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('models/fine-tuned-model/')
model = AutoModelForCausalLM.from_pretrained('models/fine-tuned-model/')
# Encode prompt
prompt = "In a distant future,"
inputs = tokenizer.encode(prompt, return_tensors='pt')
# Generate text
outputs = model.generate(inputs, max_length=100, num_return_sequences=1)
# Decode and print
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Contributions are welcome! Please read our Contributing Guide for guidelines on how to get started.
This project is licensed under the MIT License.
HuggingFace Transformers
HuggingFace Datasets
Parameter-Efficient Fine-Tuning (PEFT)
LoRA: Low-Rank Adaptation of Large Language Models
Project Gutenberg
Internet Archive