The job stops restarting workers and exits if the traceback is a code bug. #1068
Labels
enhancement
New feature or request
question
Further information is requested
todo
issue or pr with 'todo' will ignore expiration
Milestone
The restarted worker will fail again if the training fails due to a code bug. The job should exit as soon as possible to release resources on a cluster.
The text was updated successfully, but these errors were encountered: