mirror of
https://github.com/checkpoint-restore/criu.git
synced 2026-01-23 02:14:37 +00:00
cuda: prevent task lockup on timeout error
When creating a checkpoint of large models, the `checkpoint` action of `cuda-checkpoint` can exceed the CRIU timeout. This causes CRIU to fail with the following error, leaving the CUDA task in a locked state: cuda_plugin: Checkpointing CUDA devices on pid 84145 restore_tid 84202 Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0 Error (cuda_plugin.c:139): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call Error (cuda_plugin.c:396): cuda_plugin: CHECKPOINT_DEVICES failed with net: Unlock network cuda_plugin: finished cuda_plugin stage 0 err -1 cuda_plugin: resuming devices on pid 84145 cuda_plugin: Restore thread pid 84202 found for real pid 84145 Unfreezing tasks into 1 Unseizing 84145 into 1 Error (criu/cr-dump.c:2111): Dumping FAILED. To fix this, we set `task_info->checkpointed` before invoking the `checkpoint` action to ensure that the CUDA task is resumed even if CRIU times out. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This commit is contained in:
parent
f83931542a
commit
02056bf41a
1 changed files with 2 additions and 2 deletions
|
|
@ -391,14 +391,14 @@ int cuda_plugin_checkpoint_devices(int pid)
|
|||
if (resume_restore_thread(restore_tid, &save_sigset)) {
|
||||
return -1;
|
||||
}
|
||||
|
||||
task_info->checkpointed = 1;
|
||||
status = cuda_process_checkpoint_action(pid, ACTION_CHECKPOINT, 0, msg_buf, sizeof(msg_buf));
|
||||
if (status) {
|
||||
pr_err("CHECKPOINT_DEVICES failed with %s\n", msg_buf);
|
||||
goto interrupt;
|
||||
}
|
||||
|
||||
task_info->checkpointed = 1;
|
||||
|
||||
interrupt:
|
||||
int_ret = interrupt_restore_thread(restore_tid, &save_sigset);
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue