Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
push_to_hf.py	push_to_hf.py
requirements.txt	requirements.txt
tiny_programs.py	tiny_programs.py
train_model.py	train_model.py

TinyPrograms

This experiment was inspired by the paper TinyStories: How Small Can Language Models Be and Still Speak Coherent English?. The authors used OpenAI GPT models to generate simple, synthetic children's books. They showed that training on this data allowed extremely tiny models (~10M parameters) to outperform 10x larger models. TinyProgram's is the coding equivalent. It has ~1,000 tiny python programs generated by Antropic's Haiku model. I decided to deviate from the original paper by finetuning a model rather than train from scratch since I had such a small dataset. I used the following definition of what a tiny program is when using fastdata to generate them:

class TinyProgram(BasicRepr):
    """
    A tiny program that is a valid python program that satisfies the requirements.
    """
    def __init__(
            self,
            requirements: str, # A description of the requirements for the program to help the persona.
            code: str, # The code that satisfies the requirements.
    ): store_attr()

I also followed the advice in the blog post that accompanies this experiment, "How To ~~Train~~ Synthesize Your ~~Dragon~~ Data." It includes examples of the tiny programs I want. I also used the recently released PersonaHub to create a diverse set of tiny programs. Here is what the prompt looks like:

prompt_template = """\
<examples>
{examples}
</examples

Create requirements and the python program that satisfies them for the following persona:
<persona>{persona}</persona>
"""

Here is one of the programs that Haiku created:

Persona

A striker who admires the player's defensive prowess and values their support on the field

Requirements

A Python program that simulates a soccer game between a striker and a defensive player. The striker admires the defensive player's prowess and values their support on the field. The program should model the interactions between the striker and the defensive player, demonstrating how the defensive player's support enables the striker to be more effective in scoring goals.

Program

import random

class SoccerPlayer:
    def init(self, name, position):
        self.name = name
        self.position = position
        self.skill_level = random.randint(1, 10)

class Striker(SoccerPlayer):
    def init(self, name):
        super().__init__(name, "Striker")
        self.scoring_chance = 0.5

    def shoot(self, defender):
        if defender.skill_level >= self.skill_level:
            print(f"{self.name} shoots, but {defender.name} blocks the shot!")
        else:
            if random.random() < self.scoring_chance:
                print(f"{self.name} scores a goal!")
            else:
                print(f"{self.name} misses the shot.")

class Defender(SoccerPlayer):
    def init(self, name):
        super().__init__(name, "Defender")
        self.support_level = 7

    def support(self, striker):
        striker.scoring_chance += self.support_level / 100
        print(f"{self.name} provides support to {striker.name}, increasing their scoring chance.")

def simulate_game():
    striker = Striker("Alex")
    defender = Defender("Sarah")
    print(f"{striker.name} (Striker) vs. {defender.name} (Defender)")

    for _ in range(5):
        defender.support(striker)
        striker.shoot(defender)
        print()

simulate_game()

The Experiment

I took TinyPrograms and tried to finetune a strong LLM model to see if I could improve its coding ability. I used Huggingface's awesome SmolLM-360M. It's small and works well on coding tasks. Out of the box, SmolLM-360M scores 11.6% on a popular coding test called HumanEval. HumanEval is a popular coding test. I created 5 configurations of datasets to test which improves my model the most:

The first one is simply the 992 tiny Python programs.
The second is 992 Python files that have been taken from the popular Stack dataset.
The third is a high-quality, filtered version of the tiny Python programs. It uses an LLM to score the programs based on a rubric.
The fourth is the same as the third, but on the Python files taken from the Stack.
Finally, the fifth mixes half of the high quality filtered tiny Python programs and the high quality filtered Python files from the Stack.

Filtering for Quality

To filter the tiny programs, I used fastdata

class TinyProgramCritique(BasicRepr):
    """
    A critique of a tiny program.
    """
    def __init__(
            self,
            critique: str, # A critique of the code.
            score: Literal[1, 2, 3, 4, 5], # A score of the code from 1 to 5.
    ): store_attr()

And here is the prompt I used to guide to model to generating a score:

critique_template = """\
Below is a code snippet. Evaluate its educational value for teaching programming to beginners in this language, using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:

- Add 1 point if the code is syntactically correct and runs without errors, providing a basic example of working code in the language.
- Add another point if the code demonstrates fundamental programming concepts (e.g., variables, control structures, functions) in a straightforward manner, even if it's not optimized or doesn't follow all best practices.
- Award a third point if the code is well-commented, explaining key concepts and the purpose of different code sections. It should be readable and illustrate good naming conventions, making it easier for beginners to understand.
- Grant a fourth point if the code showcases language-specific features or common programming patterns in an accessible way. It should provide clear examples of how to apply these concepts practically.
- Bestow a fifth point if the code is an exemplary teaching tool, striking an excellent balance between simplicity and real-world applicability. It should inspire further learning, possibly including deliberate mistakes or opportunities for improvement that a teacher could use as discussion points.

The code snippet:
<code>
{code}
</code>

After examining the code:
- Briefly justify your total score, up to 100 words, focusing on its effectiveness as a teaching tool for beginners.
- Conclude with the score.
"""

This is the distribution of the scores for the 992 tiny Python programs:

Score	Count
1	25
2	117
3	96
4	256
5	498

And here is the same for 10,000 of the Python files:

Score	Count
1	2239
2	5230
3	1545
4	618
5	236

I only kept a score of 4 and 5 as high quality data for both the tiny python programs and python files from the Stack.

Results

Setup	pass@1
Baseline	11.6%
TinyPrograms	09.1%
The Stack	11.0%
TinyPrograms Filtered	12.2%
The Stack Filtered	08.5%
Mixed Filtered	09.8%

Key findings from the experiment:

Training on synthetic data is better than training on random GitHub programs when performing quality filtering, i.e., TinyPrograms Filtered vs The Stack Filtered.
Only high-quality synthetic data (TinyPrograms Filtered) improve performance over the baseline.
All other setups degrade performance. High-quality Python files from the Stack show the biggest drop. This warrants further investigation. Possible explanations include:
- The scoring system may not be as effective for GitHub programs as it is for synthetic ones.
- There might be a lack of diversity in the GitHub programs.

For further exploration, I encourage you to:

Replicate this experiment with your own task.
Experiment with larger datasets to see how they affect model performance.
Share your findings with the community and reach out if you need help!

To do this yourself, follow the rest of this README. It shows how to reproduce my results and serves as a starting point for your project.

Install

Make sure you have installed fastdata with the following command from the root of the repo:

pip install -e .

If you want to train a model, install the following dependencies in the examples folder:

pip install -r requirements.txt

Then run the following if you will use flash attention:

pip install flash-attn --no-build-isolation

Run

Data Synthesis

We have a script to generate our tiny programs dataset. It can be run with this command:

python tiny_programs.py

You can see all the command-line arguments by running:

python tiny_programs.py --help

Training

To train a model, you can use the following command:

python train.py

You can view all the command-line arguments by executing the following command:

python train.py --help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

README.md

TinyPrograms

Persona

Requirements

Program

The Experiment

Filtering for Quality

Results

Key findings from the experiment:

Install

Run

Data Synthesis

Training

Files

examples

Directory actions

More options

Directory actions

More options

Latest commit

History

examples

Folders and files

parent directory

README.md

TinyPrograms

Persona

Requirements

Program

The Experiment

Filtering for Quality

Results

Key findings from the experiment:

Install

Run

Data Synthesis

Training