-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xcloud #27
Xcloud #27
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Hanjun! I left a few comments
@@ -48,8 +48,8 @@ def main(argv) -> None: | |||
uname = getpass.getuser() | |||
|
|||
with xm_abc.create_experiment(experiment_title=_EXP_NAME.value) as experiment: | |||
job_requirements = xm.JobRequirements(ram=26 * FLAGS.num_gpus * xm.GiB, | |||
cpu=7 * FLAGS.num_gpus, | |||
job_requirements = xm.JobRequirements(ram=8 * FLAGS.num_gpus * xm.GiB, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it needed to reduce our resource consumption here? We can request the amount we were using before: https://g3doc.corp.google.com/company/teams/brain-frameworks/xcloud/guide.md?cl=head
crossbeam/experiment/train_eval.py
Outdated
@@ -191,7 +191,8 @@ def _gather_eval_info(rank, device, local_acc, local_num): | |||
|
|||
|
|||
def train_eval_loop(args, device, model, train_files, eval_tasks, | |||
task_gen, trace_gen): | |||
task_gen, trace_gen, checkpoint): | |||
random.shuffle(train_files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't shuffle train_files so that training runs are more consistent/reproducible
if checkpoint is not None: | ||
model.load_state_dict(checkpoint['model']) | ||
optimizer.load_state_dict(checkpoint['optimizer']) | ||
starting_step = checkpoint['step'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we skip starting_step
batches from the training dataset, so that we don't train on the same data again upon restart? I think this is just:
for _ in range(starting_step):
next(train_gen)
stablize jobs