This experiment was inspired by the paper TinyStories: How Small Can Language Models Be and Still Speak Coherent English?. The authors used OpenAI GPT models to generate simple, synthetic children's books. They showed that training on this data allowed extremely tiny models (~10M parameters) to outperform 10x larger models. TinyProgram's is the coding equivalent. It has ~1,000 tiny python programs generated by Antropic's Haiku model. I decided to deviate from the original paper by finetuning a model rather than train from scratch since I had such a small dataset. I used the following definition of what a tiny program is when using fastdata
to generate them:
class TinyProgram(BasicRepr):
"""
A tiny program that is a valid python program that satisfies the requirements.
"""
def __init__(
self,
requirements: str, # A description of the requirements for the program to help the persona.
code: str, # The code that satisfies the requirements.
): store_attr()
I also followed the advice in the blog post that accompanies this experiment, "How To Train Synthesize Your Dragon Data." It includes examples of the tiny programs I want. I also used the recently released PersonaHub to create a diverse set of tiny programs. Here is what the prompt looks like:
prompt_template = """\
<examples>
{examples}
</examples
Create requirements and the python program that satisfies them for the following persona:
<persona>{persona}</persona>
"""
Here is one of the programs that Haiku created:
A striker who admires the player's defensive prowess and values their support on the field
A Python program that simulates a soccer game between a striker and a defensive player. The striker admires the defensive player's prowess and values their support on the field. The program should model the interactions between the striker and the defensive player, demonstrating how the defensive player's support enables the striker to be more effective in scoring goals.
import random
class SoccerPlayer:
def init(self, name, position):
self.name = name
self.position = position
self.skill_level = random.randint(1, 10)
class Striker(SoccerPlayer):
def init(self, name):
super().__init__(name, "Striker")
self.scoring_chance = 0.5
def shoot(self, defender):
if defender.skill_level >= self.skill_level:
print(f"{self.name} shoots, but {defender.name} blocks the shot!")
else:
if random.random() < self.scoring_chance:
print(f"{self.name} scores a goal!")
else:
print(f"{self.name} misses the shot.")
class Defender(SoccerPlayer):
def init(self, name):
super().__init__(name, "Defender")
self.support_level = 7
def support(self, striker):
striker.scoring_chance += self.support_level / 100
print(f"{self.name} provides support to {striker.name}, increasing their scoring chance.")
def simulate_game():
striker = Striker("Alex")
defender = Defender("Sarah")
print(f"{striker.name} (Striker) vs. {defender.name} (Defender)")
for _ in range(5):
defender.support(striker)
striker.shoot(defender)
print()
simulate_game()
I took TinyPrograms and tried to finetune a strong LLM model to see if I could improve its coding ability. I used Huggingface's awesome SmolLM-360M. It's small and works well on coding tasks. Out of the box, SmolLM-360M scores 11.6% on a popular coding test called HumanEval. HumanEval is a popular coding test. I created 5 configurations of datasets to test which improves my model the most:
- The first one is simply the 992 tiny Python programs.
- The second is 992 Python files that have been taken from the popular Stack dataset.
- The third is a high-quality, filtered version of the tiny Python programs. It uses an LLM to score the programs based on a rubric.
- The fourth is the same as the third, but on the Python files taken from the Stack.
- Finally, the fifth mixes half of the high quality filtered tiny Python programs and the high quality filtered Python files from the Stack.
To filter the tiny programs, I used fastdata
class TinyProgramCritique(BasicRepr):
"""
A critique of a tiny program.
"""
def __init__(
self,
critique: str, # A critique of the code.
score: Literal[1, 2, 3, 4, 5], # A score of the code from 1 to 5.
): store_attr()
And here is the prompt I used to guide to model to generating a score:
critique_template = """\
Below is a code snippet. Evaluate its educational value for teaching programming to beginners in this language, using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:
- Add 1 point if the code is syntactically correct and runs without errors, providing a basic example of working code in the language.
- Add another point if the code demonstrates fundamental programming concepts (e.g., variables, control structures, functions) in a straightforward manner, even if it's not optimized or doesn't follow all best practices.
- Award a third point if the code is well-commented, explaining key concepts and the purpose of different code sections. It should be readable and illustrate good naming conventions, making it easier for beginners to understand.
- Grant a fourth point if the code showcases language-specific features or common programming patterns in an accessible way. It should provide clear examples of how to apply these concepts practically.
- Bestow a fifth point if the code is an exemplary teaching tool, striking an excellent balance between simplicity and real-world applicability. It should inspire further learning, possibly including deliberate mistakes or opportunities for improvement that a teacher could use as discussion points.
The code snippet:
<code>
{code}
</code>
After examining the code:
- Briefly justify your total score, up to 100 words, focusing on its effectiveness as a teaching tool for beginners.
- Conclude with the score.
"""
This is the distribution of the scores for the 992 tiny Python programs:
Score | Count |
---|---|
1 | 25 |
2 | 117 |
3 | 96 |
4 | 256 |
5 | 498 |
And here is the same for 10,000 of the Python files:
Score | Count |
---|---|
1 | 2239 |
2 | 5230 |
3 | 1545 |
4 | 618 |
5 | 236 |
I only kept a score of 4 and 5 as high quality data for both the tiny python programs and python files from the Stack.
Setup | pass@1 |
---|---|
Baseline | 11.6% |
TinyPrograms | 09.1% |
The Stack | 11.0% |
TinyPrograms Filtered | 12.2% |
The Stack Filtered | 08.5% |
Mixed Filtered | 09.8% |
- Training on synthetic data is better than training on random GitHub programs when performing quality filtering, i.e., TinyPrograms Filtered vs The Stack Filtered.
- Only high-quality synthetic data (TinyPrograms Filtered) improve performance over the baseline.
- All other setups degrade performance. High-quality Python files from the Stack show the biggest drop. This warrants further investigation. Possible explanations include:
- The scoring system may not be as effective for GitHub programs as it is for synthetic ones.
- There might be a lack of diversity in the GitHub programs.
For further exploration, I encourage you to:
- Replicate this experiment with your own task.
- Experiment with larger datasets to see how they affect model performance.
- Share your findings with the community and reach out if you need help!
To do this yourself, follow the rest of this README. It shows how to reproduce my results and serves as a starting point for your project.
Make sure you have installed fastdata
with the following command from the root of the repo:
pip install -e .
If you want to train a model, install the following dependencies in the examples
folder:
pip install -r requirements.txt
Then run the following if you will use flash attention:
pip install flash-attn --no-build-isolation
We have a script to generate our tiny programs dataset. It can be run with this command:
python tiny_programs.py
You can see all the command-line arguments by running:
python tiny_programs.py --help
To train a model, you can use the following command:
python train.py
You can view all the command-line arguments by executing the following command:
python train.py --help