criu/plugins at 28c2cb3fd6121f3280484665915d1ef5d8b9df14 - Mirrors/criu

mirror of https://github.com/checkpoint-restore/criu.git synced 2026-01-23 02:14:37 +00:00

History

Radostin Stoyanov 28c2cb3fd6 cuda: enable checkpoint support for paused tasks If a CUDA process is already in a "locked" or "checkpointed" state during criu dump, the CUDA plugin currently fails with an error because it attempts an unnecessary "lock" action using the cuda-checkpoint tool. This patch extends the CUDA plugin to handle such cases by first verifying the initial state of the CUDA processes and skipping unnecessary "lock" and "checkpoint" actions when a process has been locked or checkpointed before CRIU is invoked. In particular, CUDA tasks may already be in a "locked" or "checkpointed" state to ensure consistent checkpoint/restore for distributed workloads, such as model training, where multiple containers run across different cluster nodes. Another use case for this functionality is optimizing resource utilization, where CUDA tasks with low-priority are preempted immediately to release GPU resources needed by high-priority tasks, and the paused workloads are later resumed or migrated to another node. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-03-21 12:40:31 -07:00
..
amdgpu	images/inventory: add field for enabled plugins	2025-03-21 12:40:31 -07:00
cuda	cuda: enable checkpoint support for paused tasks	2025-03-21 12:40:31 -07:00

Radostin Stoyanov 28c2cb3fd6 cuda: enable checkpoint support for paused tasks

If a CUDA process is already in a "locked" or "checkpointed" state
during criu dump, the CUDA plugin currently fails with an error because
it attempts an unnecessary "lock" action using the cuda-checkpoint tool.

This patch extends the CUDA plugin to handle such cases by first
verifying the initial state of the CUDA processes and skipping
unnecessary "lock" and "checkpoint" actions when a process has been
locked or checkpointed before CRIU is invoked.

In particular, CUDA tasks may already be in a "locked" or "checkpointed"
state to ensure consistent checkpoint/restore for distributed workloads,
such as model training, where multiple containers run across different
cluster nodes.

Another use case for this functionality is optimizing resource
utilization, where CUDA tasks with low-priority are preempted
immediately to release GPU resources needed by high-priority
tasks, and the paused workloads are later resumed or migrated
to another node.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

2025-03-21 12:40:31 -07:00

amdgpu

images/inventory: add field for enabled plugins

2025-03-21 12:40:31 -07:00

cuda

cuda: enable checkpoint support for paused tasks

2025-03-21 12:40:31 -07:00