Mirrors/criu

mirror of https://github.com/checkpoint-restore/criu.git synced 2026-07-21 01:06:58 +00:00

Author	SHA1	Message	Date
David Francis	ff35a9126e	plugins/amdgpu: remove excessive debug messages These pr_info lines begin with "CC3" and "TWI" were not meant to be included in the patch. Co-authored-by: Andrei Vagin <avagin@google.com> Signed-off-by: David Francis <David.Francis@amd.com> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-11-14 18:31:37 +00:00
David Francis	9e404e2083	plugin/amdgpu: Support for checkpoint of dmabuf fds amdgpu libraries that use dmabuf fd to share GPU memory between processes close the dmabuf fds immediately after using them. However, it is possible that checkpoint of a process catches one of the dmabuf fds open. In that case, the amdgpu plugin needs to handle it. The checkpoint of the dmabuf fd does require the device file it was exported from to have already been dumped To identify which device this dmabuf fd was exprted from, attempt to import it on each device, then record the dmabuf handle it imports as. This handle can be used to restore it. Signed-off-by: David Francis <David.Francis@amd.com>	2025-11-14 18:31:37 +00:00
David Francis	d43217dadb	plugin: Add DUMP_DEVICES_LATE callback The amdgpu plugin was counting how many files were checkpointed to determine when it should close the device files. The number of device files is not consistent; a process may have multiple copies of the drm device files open. Instead of doing this counting, add a new callback after all files are checkpointed, so plugins can clean up their resources at an appropriate time. Signed-off-by: David Francis <David.Francis@amd.com>	2025-11-14 18:31:37 +00:00
David Francis	db0ec806d1	plugin/amdgpu: Add handling for amdgpu drm buffer objects Buffer objects held by the amdgpu drm driver are checkpointed with the new BO_INFO and MAPPING_INFO ioctls/ioctl options. Handling is in amdgpu_plugin_drm.h Handling of imported buffer objects may require dmabuf fds to be transferred between processes. These occur over fdstore, with the handle-fstore id relationships kept in shread memory. There is a new plugin callback: RESTORE_INIT to create the shared memory. During checkpoint, track shared buffer objects, so that buffer objects that are shared across processes can be identified. During restore, track which buffer objects have been restored. Retry restore of a drm file if a buffer object is imported and the original has not been exported yet. Skip buffer objects that have already been completed or cannot be completed in the current restore. So drm code can use sdma_copy_bo, that function no longer requires kfd bo structs Update the protobuf messages with new amdgpu drm information. Signed-off-by: David Francis <David.Francis@amd.com>	2025-11-14 18:31:36 +00:00
David Francis	5eb61e1b14	plugin/amdgpu: Add drm header The amdgpu plugin usually calls drm ioctls through the libdrm wrappers. However, amdgpu restore requires dealing with dmabufs and gem handles directly, which means drm ioctls must be called directly. Add the drm.h header (from the kernel's uapi). Signed-off-by: David Francis <David.Francis@amd.com>	2025-11-14 18:31:36 +00:00
David Francis	0b7ca29c19	plugin/amdgpu: Add amdgpu drm header For amdgpu plugin to call the new amdgpu drm CRIU ioctls, it needs the amdgpu drm header file, copied from the kernel's includes. Signed-off-by: David Francis <David.Francis@amd.com>	2025-11-14 18:31:36 +00:00
David Francis	fb02dbf685	files-ext: Allow plugin files to retry amdgpu dmabuf CRIU requires the ability of the amdgpu plugin to retry. Change files_ext.c to read a response of 1 from a plugin restore function to mean retry. Signed-off-by: David Francis <David.Francis@amd.com>	2025-11-14 18:31:36 +00:00
Yanning Yang	920437205c	plugins/amdgpu: Update `README.md` and `criu-amdgpu-plugin.txt` Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>	2025-11-14 18:27:31 +00:00
Yanning Yang	4a3a695dfb	plugins/amdgpu: Implement parallel restore This patch implements the entire logic to enable the offloading of buffer object content restoration. The goal of this patch is to offload the buffer object content restoration to the main CRIU process so that this restoration can occur in parallel with other restoration logic (mainly the restoration of memory state in the restore blob, which is time-consuming) to speed up the restore phase. The restoration of buffer object content usually takes a significant amount of time for GPU applications, so parallelizing it with other operations can reduce the overall restore time. It has three parts: the first replaces the restoration of buffer objects in the target process by sending a parallel restore command to the main CRIU process; the second implements the POST_FORKING hook in the amdgpu plugin to enable buffer object content restoration in the main CRIU process; the third stops the parallel thread in the RESUME_DEVICES_LATE hook. This optimization only focuses on the single-process situation (common case). In other scenarios, it will turn to the original method. This is achieved with the new `parallel_disabled` flag. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>	2025-11-14 18:27:31 +00:00
Yanning Yang	33ed774c8d	plugins/amdgpu: Add parallel restore command Currently the restore of buffer object comsumes a significant amount of time. However, this part has no logical dependencies with other restore operations. This patch introduce some structures and some helper functions for the target process to offload this task to the main CRIU process. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>	2025-11-14 18:27:31 +00:00
Yanning Yang	6386140754	plugins/amdgpu: Add socket operations When enabling parallel restore, the target process and the main CRIU process need an IPC interface to communicate and transfer restore commands. This patch adds a Unix domain TCP socket and stores this socket in `fdstore`. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>	2025-11-14 18:27:31 +00:00
Andrei Vagin	ce680fc6c7	Revert "plugins/amdgpu: Implement parallel restore" This functionality (#2527) is being reverted and excluded from this release due to issue #2812. It will be included in a subsequent release once all associated issues are resolved. Signed-off-by: Andrei Vagin <avagin@google.com>	2025-11-13 08:40:46 -08:00
Andrei Vagin	2b8951a9cf	image: use `protoc` instead of `protoc-c` The new protoc 1.5.2 reports warnings: `protoc-c` is deprecated. Please use `protoc` instead! Signed-off-by: Andrei Vagin <avagin@gmail.com>	2025-11-02 07:48:22 -08:00
Yanning Yang	7a5b3d1f41	plugins/amdgpu: Update `README.md` and `criu-amdgpu-plugin.txt` Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>	2025-11-02 07:48:22 -08:00
Yanning Yang	a61116fd93	plugins/amdgpu: Implement parallel restore This patch implements the entire logic to enable the offloading of buffer object content restoration. The goal of this patch is to offload the buffer object content restoration to the main CRIU process so that this restoration can occur in parallel with other restoration logic (mainly the restoration of memory state in the restore blob, which is time-consuming) to speed up the restore phase. The restoration of buffer object content usually takes a significant amount of time for GPU applications, so parallelizing it with other operations can reduce the overall restore time. It has three parts: the first replaces the restoration of buffer objects in the target process by sending a parallel restore command to the main CRIU process; the second implements the POST_FORKING hook in the amdgpu plugin to enable buffer object content restoration in the main CRIU process; the third stops the parallel thread in the RESUME_DEVICES_LATE hook. This optimization only focuses on the single-process situation (common case). In other scenarios, it will turn to the original method. This is achieved with the new `parallel_disabled` flag. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>	2025-11-02 07:48:22 -08:00
Yanning Yang	e8ba7c103a	plugins/amdgpu: Add parallel restore command Currently the restore of buffer object comsumes a significant amount of time. However, this part has no logical dependencies with other restore operations. This patch introduce some structures and some helper functions for the target process to offload this task to the main CRIU process. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>	2025-11-02 07:48:22 -08:00
Yanning Yang	1fd1b670c4	plugins/amdgpu: Add socket operations When enabling parallel restore, the target process and the main CRIU process need an IPC interface to communicate and transfer restore commands. This patch adds a Unix domain TCP socket and stores this socket in `fdstore`. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>	2025-11-02 07:48:22 -08:00
Radostin Stoyanov	6805841660	cuda: remove redundant goto label The `goto interrupt` label is unnecessary as the code directly returns after `cuda_process_checkpoint_action()`. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-11-02 07:42:55 -08:00
Radostin Stoyanov	e7aee3c5c7	cuda: use pr_perror for libc function errors When handing errors for functions such as `ptrace()`, `pipe()`, and `fork()` it would be better to use `pr_perror` instead of `pr_err` as it would include a message describing the encountered error. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-11-02 07:42:55 -08:00
Radostin Stoyanov	82b03429b7	cuda: disable CUDA plugin for pre-dump Temporarily disable CUDA plugin for `criu pre-dump`. pre-dump currently fails with the following error: Handling VMA with the following smaps entry: 1822c000-18da5000 rw-p 00000000 00:00 0 [heap] Handling VMA with the following smaps entry: 200000000-200200000 ---p 00000000 00:00 0 Handling VMA with the following smaps entry: 200200000-200400000 rw-s 00000000 00:06 895 /dev/nvidia0 Error (criu/proc_parse.c:116): handle_device_vma plugin failed: No such file or directory Error (criu/proc_parse.c:632): Can't handle non-regular mapping on 705693's map 200200000 Error (criu/cr-dump.c:1486): Collect mappings (pid: 705693) failed with -1 We plan to enable support for pre-dump by skipping nvidia mappings in a separate patch. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-03-21 12:40:31 -07:00
Radostin Stoyanov	02056bf41a	cuda: prevent task lockup on timeout error When creating a checkpoint of large models, the `checkpoint` action of `cuda-checkpoint` can exceed the CRIU timeout. This causes CRIU to fail with the following error, leaving the CUDA task in a locked state: cuda_plugin: Checkpointing CUDA devices on pid 84145 restore_tid 84202 Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0 Error (cuda_plugin.c:139): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call Error (cuda_plugin.c:396): cuda_plugin: CHECKPOINT_DEVICES failed with net: Unlock network cuda_plugin: finished cuda_plugin stage 0 err -1 cuda_plugin: resuming devices on pid 84145 cuda_plugin: Restore thread pid 84202 found for real pid 84145 Unfreezing tasks into 1 Unseizing 84145 into 1 Error (criu/cr-dump.c:2111): Dumping FAILED. To fix this, we set `task_info->checkpointed` before invoking the `checkpoint` action to ensure that the CUDA task is resumed even if CRIU times out. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-03-21 12:40:31 -07:00
Jesus Ramos	dc6cef0b4c	cuda: Fix return value from CHECKPOINT_DEVICES hook so that dump's fail properly cuda-checkpoint returns the positive CUDA error code when it runs into an issue and passing that along as the return value would cause errors to get ignored Signed-off-by: Jesus Ramos <jeramos@nvidia.com>	2025-03-21 12:40:31 -07:00
Radostin Stoyanov	28c2cb3fd6	cuda: enable checkpoint support for paused tasks If a CUDA process is already in a "locked" or "checkpointed" state during criu dump, the CUDA plugin currently fails with an error because it attempts an unnecessary "lock" action using the cuda-checkpoint tool. This patch extends the CUDA plugin to handle such cases by first verifying the initial state of the CUDA processes and skipping unnecessary "lock" and "checkpoint" actions when a process has been locked or checkpointed before CRIU is invoked. In particular, CUDA tasks may already be in a "locked" or "checkpointed" state to ensure consistent checkpoint/restore for distributed workloads, such as model training, where multiple containers run across different cluster nodes. Another use case for this functionality is optimizing resource utilization, where CUDA tasks with low-priority are preempted immediately to release GPU resources needed by high-priority tasks, and the paused workloads are later resumed or migrated to another node. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-03-21 12:40:31 -07:00
Radostin Stoyanov	b1cac7a8e5	cuda: fix check for GPU device availability The check for `/dev/nvidiactl` to determine if the CUDA plugin can be used is unreliable because in some cases the default path for driver installation is different [1]. This patch changes the logic to check if a GPU device is available in `/proc/driver/nvidia/gpus/`. This approach is similar to `torch.cuda.is_available()` and it is a more accurate indicator. The subsequent check for support of the `cuda-checkpoint --action` option would confirm if the driver supports checkpoint/restore. [1] https://github.com/NVIDIA/gpu-operator Fixes: #2509 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-03-21 12:40:31 -07:00
Radostin Stoyanov	4196268eef	seize: enable support for frozen containers Container runtimes like CRI-O and containerd utilize the freezer cgroup to create a consistent snapshot of container root filesystem (rootfs) changes. In this case, the container is frozen before invoking CRIU. After CRIU successfully completes, a copy of the container rootfs diff is saved, and the container is then unfrozen. However, the `cuda-checkpoint` tool is not able to perform a 'lock' action on frozen threads. To support GPU checkpointing with these container runtimes, we need to unfreeze the cgroup and return it to its original state once the checkpointing is complete. To reflect this new behavior, the following changes are applied: - `dont_use_freeze_cgroup(void)` -> `set_compel_interrupt_only_mode(void)` - `bool freeze_cgroup_disabled` -> `bool compel_interrupt_only_mode` - `check_freezer_cgroup(void)` -> `prepare_freezer_for_interrupt_only_mode(void)` Note that when `compel_interrupt_only_mode` is set to `true`, `compel_interrupt_task()` is used instead of `freeze_processes()` to prevent tasks from running during `criu dump`. Fixes: #2508 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-03-21 12:40:31 -07:00
Radostin Stoyanov	5335b35f72	images/inventory: add field for enabled plugins This patch extends the inventory image with a `plugins` field that contains an array of plugins which were used during checkpoint, for example, to save GPU state. In particular, the CUDA and AMDGPU plugins are added to this field only when the checkpoint contains GPU state. This allows to disable unnecessary plugins during restore, show appropriate error messages if required CRIU plugin are missing, and migrate a process that does not use GPU from a GPU-enabled system to CPU-only environment. We use the `optional plugins_entry` for backwards compatibility. This entry allows us to distinguish between unset and missing field: - When the field is missing, it indicates that the checkpoint was created with a previous version of CRIU, and all plugins should be enabled during restore. - When the field is empty, it indicates that no plugins were used during checkpointing. Thus, all plugins can be disabled during restore. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-03-21 12:40:31 -07:00
Radostin Stoyanov	4f8f6f2883	Makefile.config: set CR_PLUGIN_DEFAULT variable By default, CRIU uses the path "/usr/lib/criu" to install and load plugins at runtime. This path is defined by the `PLUGINDIR` variable in Makefile.install and `CR_PLUGIN_DEFAULT` in `criu/include/plugin.h`. However, some distribution packages might install the CRIU plugins at "/usr/lib64/criu" instead. This patch updates the makefile to align the path defined by `CR_PLUGIN_DEFAULT` with the value of `PLUGINDIR`. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-03-21 12:40:31 -07:00
Radostin Stoyanov	f1d465448f	amdgpu: remove exec permissions on source files This patch fixes the following warnings that appear when building an RPM package: + /usr/lib/rpm/redhat/brp-mangle-shebangs * WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.c is executable but has no shebang, removing executable bit * WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.h is executable but has no shebang, removing executable bit Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-03-21 12:40:31 -07:00
David Francis	096c1f7a4d	plugins/amdgpu - Increase maximum parameter length The topology parsing assumed that all parameter names were 30 characters or fewer, but recommended_sdma_engine_id_mask is 31 characters. Make the maximum length a macro, and set it to 64. Signed-off-by: David Francis <David.Francis@amd.com>	2024-09-19 15:23:42 -07:00
David Francis	60ee5ebd9d	plugins/amdgpu: Zero ib_info on initialization This struct was being used un-initialized, meaning it was filled with random garbage. Mea culpa. Signed-off-by: David Francis <David.Francis@amd.com>	2024-09-19 15:23:42 -07:00
Andrei Vagin	6918998897	plugin/cuda: disable CUDA plugin if /dev/nvidiactl isn't present The presence of /dev/nvidiactl indicates that the system has a compatible NVIDIA GPU driver installed and that the GPU is accessible to the operating system. Signed-off-by: Andrei Vagin <avagin@google.com>	2024-09-19 15:23:42 -07:00
Andrei Vagin	651df375bd	criu: Allow disabling freeze cgroups Some plugins (e.g., CUDA) may not function correctly when processes are frozen using cgroups. This change introduces a mechanism to disable the use of freeze cgroups during process seizing, even if explicitly requested via the --freeze-cgroup option. The CUDA plugin is updated to utilize this new mechanism to ensure compatibility. Signed-off-by: Andrei Vagin <avagin@google.com>	2024-09-19 15:23:42 -07:00
Radostin Stoyanov	b1b3c14b17	cuda: unlock on timeout error When attempting to checkpoint a container with CUDA processes, CRIU could fail with the following error: Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 1 Error (cuda_plugin.c:143): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call Error (cuda_plugin.c:384): cuda_plugin: PAUSE_DEVICES failed with In this situation, the target process is locked, but CRIU fails due to a timeout and exits with an error. We need to make sure that the target PID is unlocked in such case. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2024-09-19 15:23:42 -07:00
Radostin Stoyanov	21ea718f9f	plugins/amdgpu: fix printf format specifiers Errors on aarch64: In file included from amdgpu_plugin_drm.h:10, from amdgpu_plugin.c:33: amdgpu_plugin.c: In function 'amdgpu_plugin_dump_file': amdgpu_plugin_util.h:24:20: error: format '%lld' expects argument of type 'long long int', but argument 6 has type '__u64' {aka 'long unsigned int'} [-Werror=format=] 24 \| #define LOG_PREFIX "amdgpu_plugin: " \| ^~~~~~~~~~~~~~~~~ ../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX' 47 \| #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__) \| ^~~~~~~~~~ amdgpu_plugin.c:1236:9: note: in expansion of macro 'pr_info' 1236 \| pr_info("devices:%d bos:%d objects:%d priv_data:%lld\n", args.num_devices, args.num_bos, args.num_objects, \| ^~~~~~~ cc1: all warnings being treated as errors Errors on ppc64: In file included from amdgpu_plugin_drm.h:10, from amdgpu_plugin.c:33: amdgpu_plugin.c: In function 'amdgpu_plugin_dump_file': amdgpu_plugin_util.h:24:20: error: format '%llu' expects argument of type 'long long unsigned int', but argument 6 has type '__u64' {aka 'long unsigned int'} [-Werror=format=] 24 \| #define LOG_PREFIX "amdgpu_plugin: " \| ^~~~~~~~~~~~~~~~~ ../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX' 47 \| #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__) \| ^~~~~~~~~~ amdgpu_plugin.c:1236:9: note: in expansion of macro 'pr_info' 1236 \| pr_info("devices:%u bos:%u objects:%u priv_data:%llu\n", \| ^~~~~~~ cc1: all warnings being treated as errors In file included from amdgpu_plugin_util.c:38: amdgpu_plugin_util.c: In function 'print_kfd_bo_stat': amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=] 24 \| #define LOG_PREFIX "amdgpu_plugin: " \| ^~~~~~~~~~~~~~~~~ ../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX' 47 \| #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__) \| ^~~~~~~~~~ amdgpu_plugin_util.c:196:17: note: in expansion of macro 'pr_info' 196 \| pr_info("%s(), %d. KFD BO Addr: %llx \n", __func__, idx, bo->addr); \| ^~~~~~~ amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=] 24 \| #define LOG_PREFIX "amdgpu_plugin: " \| ^~~~~~~~~~~~~~~~~ ../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX' 47 \| #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__) \| ^~~~~~~~~~ amdgpu_plugin_util.c:197:17: note: in expansion of macro 'pr_info' 197 \| pr_info("%s(), %d. KFD BO Size: %llx \n", __func__, idx, bo->size); \| ^~~~~~~ amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=] 24 \| #define LOG_PREFIX "amdgpu_plugin: " \| ^~~~~~~~~~~~~~~~~ ../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX' 47 \| #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__) \| ^~~~~~~~~~ amdgpu_plugin_util.c:198:17: note: in expansion of macro 'pr_info' 198 \| pr_info("%s(), %d. KFD BO Offset: %llx \n", __func__, idx, bo->offset); \| ^~~~~~~ amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=] 24 \| #define LOG_PREFIX "amdgpu_plugin: " \| ^~~~~~~~~~~~~~~~~ ../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX' 47 \| #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__) \| ^~~~~~~~~~ amdgpu_plugin_util.c:199:17: note: in expansion of macro 'pr_info' 199 \| pr_info("%s(), %d. KFD BO Restored Offset: %llx \n", __func__, idx, bo->restored_offset); \| ^~~~~~~ cc1: all warnings being treated as errors Co-developed-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2024-09-19 15:23:42 -07:00
Radostin Stoyanov	3e2ed18790	plugins/amdgpu: use C99-standard types Co-developed-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2024-09-19 15:23:42 -07:00
Radostin Stoyanov	2ee5844411	plugins/amdgpu: fix cross-compilation To enable cross-compile we need to use the CC definition from criu/scripts/nmk/scripts/tools.mk: CC := $(CROSS_COMPILE)$(HOSTCC) Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2024-09-19 15:23:42 -07:00
Radostin Stoyanov	ad66c27a11	cuda: fix launch cuda-checkpoint When the cuda-checkpoint tool is not installed, execvp() is expected to fail and return -1. In this case, we need to call exit() to terminate the child process that was created earlier with fork(). Since CRIU can be used with applications that do not use CUDA, even when the CUDA plugin is installed, this patch also updates the log messages to show debug and warning (instead of error) when the cuda-checkpoint tool is not found in $PATH. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org> Signed-off-by: Andrei Vagin <avagin@google.com>	2024-09-11 16:02:11 -07:00
Radostin Stoyanov	fde0b7ac69	cuda: don't leak fds to cuda-checkpoint Leaking open file descriptors to third-party tools can lead to security risks. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2024-09-11 16:02:11 -07:00
Radostin Stoyanov	c42b58f4fb	plugin: enable multiple plugins for the same hook CRIU provides two plugins for checkpoint/restore of GPU applications: amdgpu and cuda. Both plugins use the `RESUME_DEVICES_LATE` hook to enable restore: CR_PLUGIN_REGISTER_HOOK(CR_PLUGIN_HOOK__RESUME_DEVICES_LATE, amdgpu_plugin_resume_devices_late) CR_PLUGIN_REGISTER_HOOK(CR_PLUGIN_HOOK__RESUME_DEVICES_LATE, cuda_plugin_resume_devices_late) However, CRIU currently does not support running more than one plugin for the same hook. As a result, when both plugins are installed, the resume function for CUDA applications is not executed. To fix this, we need to make sure that both `plugin_resume_devices_late()` functions return `-ENOTSUP` when restore is not supported. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2024-09-11 16:02:11 -07:00
Radostin Stoyanov	fcbadfbdbf	plugins: set executable bit on .so files For historical reasons, some tools like rpm [1] or ldd [2,3] may expect the executable bit to be present for the correct identification of shared libraries. The executable bit on .so files is set by default by compilers (e.g., GCC). It is not strictly necessary but primarily a convention. [1] https://docs.fedoraproject.org/en-US/package-maintainers/CommonRpmlintIssues/#unstripped_binary_or_object [2] https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/ldd.bash.in;h=d6b640df;hb=HEAD#l154 [3] $ sudo ldd /usr/lib/criu/*.so /usr/lib/criu/amdgpu_plugin.so: ldd: warning: you do not have execution permission for `/usr/lib/criu/amdgpu_plugin.so' linux-vdso.so.1 (0x00007fd0a2a3e000) libdrm.so.2 => /lib64/libdrm.so.2 (0x00007fd0a29eb000) libdrm_amdgpu.so.1 => /lib64/libdrm_amdgpu.so.1 (0x00007fd0a29de000) libc.so.6 => /lib64/libc.so.6 (0x00007fd0a27fc000) /lib64/ld-linux-x86-64.so.2 (0x00007fd0a2a40000) /usr/lib/criu/cuda_plugin.so: ldd: warning: you do not have execution permission for `/usr/lib/criu/cuda_plugin.so' linux-vdso.so.1 (0x00007f1806e13000) libc.so.6 => /lib64/libc.so.6 (0x00007f1806c08000) /lib64/ld-linux-x86-64.so.2 (0x00007f1806e15000) Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2024-09-11 16:02:11 -07:00
Andrei Vagin	b169e3b63d	plugins/cuda: fix crosscompilation Signed-off-by: Andrei Vagin <avagin@gmail.com>	2024-09-11 16:02:11 -07:00
Jesus Ramos	bf417dd050	criu/plugin: Add NVIDIA CUDA plugin Adding support for the NVIDIA cuda-checkpoint utility, requires the use of an r555 or higher driver along with the cuda-checkpoint binary. Signed-off-by: Jesus Ramos <jeramos@nvidia.com>	2024-09-11 16:02:11 -07:00
Radostin Stoyanov	a808f09bea	amdgpu_plugin: fix lint errors $ make lint ... # Do not append \n to pr_perror, pr_pwarn or fail ! git --no-pager grep -E '^\s\<(pr_perror\|pr_pwarn\|fail)\>.\\n"' plugins/amdgpu/amdgpu_plugin.c: pr_perror("%s(), Can't handle VMAs of input device\n", __func__); ! git --no-pager grep -En '^\s\<pr_(err\|warn\|msg\|info\|debug)\>.);$' \| grep -v '\\n' plugins/amdgpu/amdgpu_plugin_drm.c:45: pr_err("Error in getting stat for: %s", path); plugins/amdgpu/amdgpu_plugin_util.c:77: pr_err("Unable to read file (read:%ld buf_len:%ld)", len_read, buf_len); plugins/amdgpu/amdgpu_plugin_util.c:89: pr_err("Unable to write file (wrote:%ld buf_len:%ld)", len_write, buf_len); plugins/amdgpu/amdgpu_plugin_util.c:120: pr_err("%s: Failed to open for %s", path, write ? "write" : "read"); plugins/amdgpu/amdgpu_plugin_util.c:126: pr_err("%s: Failed get pointer for %s", path, write ? "write" : "read"); plugins/amdgpu/amdgpu_plugin_util.c:136: pr_err("%s:Failed to access file size", path); plugins/amdgpu/amdgpu_plugin_util.c:152: pr_err("Cannot fopen %s", file_path); make: *** [Makefile:470: lint] Error 1 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2024-09-11 16:02:11 -07:00
Ramesh Errabolu	0d5923c95e	amdgpu_plugin: Refactor code used to implement Checkpoint Refactor code used to Checkpoint DRM devices. Code is moved into amdgpu_plugin_drm.c file which hosts various methods to checkpoint and restore a workload. Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>	2024-09-11 16:02:11 -07:00
Ramesh Errabolu	733ef96315	amdgpu_plugin: Refactor code in preparation to support C&R for DRM devices Add a new compilation unit to host symbols and methods that will be needed to C&R DRM devices. Refactor code that indicates support for C&R and checkpoints KFD and DRM devices Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>	2024-09-11 16:02:11 -07:00
Pavel Tikhomirov	b689a6710c	plugin/amdgpu: Also don't print 'plugin failed' in criu We already don't treat it as error in the plugin itself, but after returning -1 from RESUME_DEVICES_LATE hook we print debug message in criu about failed plugin, let's return 0 instead. While on it let's replace ret to exit_code. Fixes: `a9cbdad76` ("plugin/amdgpu: Don't print error for "No such process" during resume") Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>	2024-09-11 16:02:11 -07:00
David Francis	59599dacdd	plugin/amdgpu: Don't print error for "No such process" during resume During the late stages of restore, each process being resumed gets an ioctl call to KFD_CRIU_OP_RESUME. If the process has no kfd process info, this call with fail with -ESRCH. This is normal behaviour, so we shouldn't print an error message for it. Signed-off-by: David Francis <David.Francis@amd.com>	2024-09-11 16:02:11 -07:00
Andrei Vagin	e076c11e22	ci: fix codespell errors Signed-off-by: Andrei Vagin <avagin@gmail.com>	2023-11-27 16:47:16 -08:00
Radostin Stoyanov	28e854d662	amdgpu: fix clang warnings amdgpu_plugin.c:930:6: error: variable 'buffer' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized] if (ret) { ^~~ amdgpu_plugin.c:988:8: note: uninitialized use occurs here xfree(buffer); Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2023-10-22 13:29:25 -07:00
Radostin Stoyanov	ba168ab78c	ci: enable build with amdgpu plugin This patch adds the `libdrm-dev` package to the list of CRIU dependencies installed in CI to build CRIU with amdgpu plugin. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2023-10-22 13:29:25 -07:00

1 2

77 commits