backgpt & backchat

This is an experiment built on a fork of smol-gpt to train a 'previous word/token' type gpt text generation instead of 'next word/token'.

Training Options

Option 1: TinyStories

A smaller dataset perfect for testing and initial experiments.

Download Assets

# Download tokenizer
# The tokenizer vocab size is 4096
# The file size is 65KB
TODO TODO TODO
# wget https://huggingface.co/isaac-art/backgpt/resolve/main/tok4096_tiny.model -P data/

# Download pre-trained checkpoint
# The file size is 327.3MB
TODO TODO TODO
# wget https://huggingface.co/isaac-art/backgpt/resolve/main/best_checkpoint_tiny.pt -P out/checkpoints_tiny/

Run Inference

python sample.py \
    --dataset tiny \
    --prompt "The end." \
    --num_samples 3 \
    --max_new_tokens 200 \
    --temperature 0.7

Option 2: Fineweb

Huggingface fineweb dataset

Prepare Dataset

# This will:
# 1. Download Fineweb from HuggingFace (~131GB uncompressed)
# 2. Train tokenizer (vocab size 50,257)
# 3. Preprocess and tokenize the data
python preprocess.py prepare-dataset --dataset fineweb --vocab-size 8000

Train Model

# Train on Fineweb
python train.py --dataset fineweb

Run Inference

python sample.py \
    --dataset fineweb \
    --prompt "The end." \
    --num_samples 3 \
    --max_new_tokens 200 \
    --temperature 0.8

Model Details 🔍

TinyStories Model

Trained on the TinyStories dataset.

Architecture:

4096-token vocabulary
8 heads
8-layer transformer
512 embedding dimension
Trained for ~4 hours on L40 48GB VRAM
Validation Loss: ~1.2

Fineweb Model

Trained on Fineweb, a high-quality web content dataset.

Architecture:

8000 token vocabulary
8 heads
8-layer transformer
512 embedding dimension
Training for ~16 hours on H100:
- ~20000 steps to a loss of 2.9
- Batch size: 128
- Gradient accumulation steps: 3
- Learning rate: 7e-4
- Block size: 1024

Future Work: Instruction Tuning

After training on Fineweb, we'll create an instruction-tuned version of BackGPT called BackChat

BackChat

Prompt: 

Output:

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
assets		assets
static		static
templates		templates
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
config_fine10.py		config_fine10.py
config_trained_backup_tinystories.py		config_trained_backup_tinystories.py
dataset.py		dataset.py
instruct_tune.py		instruct_tune.py
model.py		model.py
paper_notes.md		paper_notes.md
preprocess.py		preprocess.py
preprocess_chat.py		preprocess_chat.py
preprocess_fine10.py		preprocess_fine10.py
requirements.txt		requirements.txt
sample.py		sample.py
server.py		server.py
server_chat.py		server_chat.py
server_fine.py		server_fine.py
tokenizer.py		tokenizer.py
train.py		train.py
train_chat.py		train_chat.py
train_fine10.py		train_fine10.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

backgpt & backchat

Training Options

Option 1: TinyStories

Option 2: Fineweb

Model Details 🔍

TinyStories Model

Fineweb Model

Future Work: Instruction Tuning

BackChat

About

Releases

Packages

Languages

License

isaac-art/smolGPT_back

Folders and files

Latest commit

History

Repository files navigation

backgpt & backchat

Training Options

Option 1: TinyStories

Option 2: Fineweb

Model Details 🔍

TinyStories Model

Fineweb Model

Future Work: Instruction Tuning

BackChat

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages