Xcloud #27

Hanjun-Dai · 2023-01-27T05:56:44Z

stablize jobs

xm train/load
fix grad explosion issue

kensens

Thanks Hanjun! I left a few comments

kensens · 2023-01-27T06:27:32Z

crossbeam/xm_train.py

@@ -48,8 +48,8 @@ def main(argv) -> None:
  uname = getpass.getuser()

  with xm_abc.create_experiment(experiment_title=_EXP_NAME.value) as experiment:
-    job_requirements = xm.JobRequirements(ram=26 * FLAGS.num_gpus * xm.GiB,
-                                          cpu=7 * FLAGS.num_gpus,
+    job_requirements = xm.JobRequirements(ram=8 * FLAGS.num_gpus * xm.GiB,


Is it needed to reduce our resource consumption here? We can request the amount we were using before: https://g3doc.corp.google.com/company/teams/brain-frameworks/xcloud/guide.md?cl=head

kensens · 2023-01-27T06:34:01Z

crossbeam/experiment/train_eval.py

@@ -191,7 +191,8 @@ def _gather_eval_info(rank, device, local_acc, local_num):


 def train_eval_loop(args, device, model, train_files, eval_tasks,
-                    task_gen, trace_gen):
+                    task_gen, trace_gen, checkpoint):
+  random.shuffle(train_files)


I think we shouldn't shuffle train_files so that training runs are more consistent/reproducible

kensens · 2023-01-27T06:35:59Z

crossbeam/experiment/train_eval.py

+  if checkpoint is not None:
+    model.load_state_dict(checkpoint['model'])
+    optimizer.load_state_dict(checkpoint['optimizer'])
+    starting_step = checkpoint['step']


Can we skip starting_step batches from the training dataset, so that we don't train on the same data again upon restart? I think this is just:

for _ in range(starting_step): next(train_gen)

Hanjun-Dai added 4 commits January 12, 2023 23:17

allow ckpt restore

ead2b98

update logging

409eb15

fix grad explotion issue

714a694

Merge branch 'main' into xcloud

3b0af2d

Hanjun-Dai requested a review from kensens January 27, 2023 05:56

kensens approved these changes Jan 27, 2023

View reviewed changes

curriculum scheduled data loader

67396f8

Hanjun-Dai merged commit 3d6d921 into main Feb 2, 2023

kensens deleted the xcloud branch December 8, 2023 08:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xcloud #27

Xcloud #27

Hanjun-Dai commented Jan 27, 2023

kensens left a comment

kensens Jan 27, 2023

kensens Jan 27, 2023

kensens Jan 27, 2023

Xcloud #27

Xcloud #27

Conversation

Hanjun-Dai commented Jan 27, 2023

kensens left a comment

Choose a reason for hiding this comment

kensens Jan 27, 2023

Choose a reason for hiding this comment

kensens Jan 27, 2023

Choose a reason for hiding this comment

kensens Jan 27, 2023

Choose a reason for hiding this comment