Skip to content

LAB3: CUDA out of memory for Google Collab free plan #201

@NakedRaccoon

Description

@NakedRaccoon

I first tried running the lab using the free version of Google Collab, and received the CUDA out of memory error during training of the first model. The output is as follows. I eventually resolved this by purchasing computing units and using a better GPU, but I still think for the next year, maybe a smaller model that is less hardware demanding can be chosen so that we don't have to spend money on this.

I enjoyed all 3 labs though. Props to the lecturers and the TAs!

The capital of France is **Paris**. 🇫🇷 

step 0 loss: 2.3113996982574463
/usr/local/lib/python3.12/dist-packages/jupyter_client/session.py:151: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
  return datetime.utcnow().replace(tzinfo=utc)
The capital of France is **Paris**. 🇫🇷 

step 10 loss: 2.1250462532043457
The capital of France is **Paris**. 🇫🇷 

step 20 loss: 1.6766815185546875
Top o' the mornin' now, me hearty! Ye want to know about the capital o' the grand old nation o' France, do ye?
step 30 loss: 1.399606466293335
Top o' the mornin' to ye now! Ye want to know about the capital o' France, do ye? Well, listen up, me hearty
step 40 loss: 1.5028210878372192
Top o' the mornin' to ye! Now, why, the capital o' France, ye ask? Why, it's Paris, that'
step 50 loss: 1.5027029514312744
Top o' the mornin' to ye! Now, if ye're askin' about the capital o' France, well, that's Paris
step 60 loss: 1.7211472988128662
Ah, me hearty! Ye want to know about the capital of France, do ye? Well, listen up, me lad! The capital of France is Paris
step 70 loss: 1.5601969957351685
Ah, ye want to know about the capital of France, do ye? Well, listen here, the capital of France is Paris, you hear? So there
step 80 loss: 1.6766023635864258
Top o' the mornin' to ye! Now, the capital o' France, ye ask? Well, listen up, me hearty. It's
step 90 loss: 1.5294233560562134
Top o' the mornin' to ye now, me hearty! Ye want to know about the capital of ol' France, do ye? Why, it
step 100 loss: 1.4099100828170776
Top o' the mornin' to ye now! The capital o' France, ye ask? Why, it be Paris, me hearty! Isn't
step 110 loss: 1.4719858169555664
Top o' the mornin' to ye, me hearty! The capital o' France, ye ask? Why, it's Paris, sure as the
step 120 loss: 1.362978219985962
Top o' the mornin' to ye, me hearty! Ye askin' about the capital o' France, well, let me tell ye, it
step 130 loss: 1.4489622116088867
Top o' the mornin' to ye, me hearty! The capital o' France, ye ask? Why, it's Paris, that's
step 140 loss: 1.4968031644821167
Top o' the mornin' to ye, me hearty! Now, the capital o' the fine Republic o' France as ye asked, why it'
step 150 loss: 1.456976294517517
Top o' the mornin' to ye, me hearty! The grand ol' capital of France is Paris, now where's she be now? Ah
step 160 loss: 1.6469882726669312
Top o' the mornin' to ye, me hearty! Ye want to know what the capital of France is? Why, why then, I'll
step 170 loss: 1.4203708171844482
Top o' the mornin' to ye, me hearty! Ye want to know about the capital of France, do ye? Well, listen up, me
step 180 loss: 1.6720683574676514
---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
[/tmp/ipython-input-3093342917.py](https://localhost:8080/#) in <cell line: 0>()
      1 # Call the train function to fine-tune the model! Hint: you'll start to see results after a few dozen steps.
----> 2 model = train(model, train_loader, tokenizer) # TODO

17 frames
[/usr/local/lib/python3.12/dist-packages/transformers/models/gemma2/modeling_gemma2.py](https://localhost:8080/#) in forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, cache_position, logits_to_keep, **kwargs)
    564         logits = self.lm_head(hidden_states[:, slice_indices, :])
    565         if self.config.final_logit_softcapping is not None:
--> 566             logits = logits / self.config.final_logit_softcapping
    567             logits = torch.tanh(logits)
    568             logits = logits * self.config.final_logit_softcapping

OutOfMemoryError: CUDA out of memory. Tried to allocate 480.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 438.12 MiB is free. Process 9264 has 14.31 GiB memory in use. Of the allocated memory 13.62 GiB is allocated by PyTorch, with 28.00 MiB allocated in private pools (e.g., CUDA Graphs), and 514.04 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions