Implementing decoder-only GPT style transformer in C
Note: unfinished dev
the computational graph can be plotted as well using graphviz (since it's all in slots array)
dataset is within the repo itself
gcc gpt.c; ./a.out
- Currently it's very slow, need update the codebase with CUDA; the last training run is present in assets/train.log
- loss graph visualisation: loss
- Implement matrix operations
- Build a basic feed-forward neural network
- Develop backpropagation
- Gradient descent
- Implement ReLU and Softmax
- Loss function MSE
- XOR Test
- Add memory management
(object tracking, cleanup)(slot system, objects occupy limited slots) - Construct forward and backward pass logic
- MNIST Test
- Implement Batching (major speedups)
- Implemented GELU, Leaky RELU (all done as part of testing)
- Implement iterative stack based backward pass (didn't do much benefit/ so removed)
- Test the MLP with character prediction (Issues encounters: network stabiliy)
- Tinystories Test
- Implement n dimensional tensors
- Implement Self-Attention Mechanism
- Build a tokenization system (BPE)
- Stack Transformer blocks (works by repetition of layers)
- Multi-Head Attention
- positional Encoding
- learnable embeddings (one-hot X matrix = embedding)
- adam optim
- add dropout
- LEARN CUDA
- add back seq_len param to attention and ffn
- add masking in attention
- residual
- layernorms
- handle resume/restart of training
- allow inference from saved file
- the build model function is messy, can simply with a matrix abstraction; otherwise rest of the features would be hard to implement correctly; good point to learn cuda and implement matmuls
dropout at 0 is not behaving correctly, which means there is something wrong in impl of ittoo much object reallocation, design needs to changeGradients are not converging properlyMNIST Test failed because of memory leaks.Slow network convergence for large MLPNetwork facing vanishing gradient issuevanishing gradients after adding attention;