Mirrors/criu

mirror of https://github.com/checkpoint-restore/criu.git synced 2026-01-22 18:05:10 +00:00

Author	SHA1	Message	Date
unichronic	9e5fbcd668	pycriu: Fix self-dump failure with explicit PID When `opts.pid` is explicitly set to `os.getpid()`, `pycriu` fails to daemonize the `criu` process. This causes `criu` to run as a child of the dumped process, leading to the error "The criu itself is within dumped tree". This can be fixed by modifying `_send_req_and_recv_resp` to check if the target PID matches the current process PID. If so, it enables daemon mode, ensuring `criu` is detached and the dump succeeds. Signed-off-by: unichronic <ishuvam.pal@gmail.com>	2026-01-21 00:25:29 +00:00
Pavel Tikhomirov	21a6758268	cr-restore/shstk: Make arch_shstk_unlock use correct pid In a simple case where the parent process and the child one are in one pid namespace we can safely use vpid(item) to prace the child. But, for the cases where the child is a pid namespace init, or the child is put into external pid namespace, the parent and the child have different pid namespaces and using pid vpid(item) (which e.g. for init will always be 1 here) to ptrace the child process is inorrect. Let's use the pid reported to us from clone as it's always the right pid of the child from the parent's point of view. Fixes: `7dd583002` ("restore: add infrastructure to enable shadow stack") Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>	2026-01-20 00:08:19 +00:00
liqiang2020	07af3304fd	restore/pie: check return value of sys_rseq on unregister The return value of sys_rseq was previously ignored during unregistration, under the assumption that it would not fail if the rseq structure was properly registered. However, if sys_rseq fails, the kernel retains the registration. If the memory containing the rseq structure is subsequently unmapped or reused, kernel updates to the rseq area can cause the process to crash (e.g., via SIGSEGV). Check the return value of sys_rseq. If it fails, log the error code and abort the restoration process. This makes rseq unregistration failures fatal and explicit, aiding in debugging and preventing later obscure crashes. Signed-off-by: liqiang2020 <liqiang64@huawei.com>	2026-01-12 19:07:39 -08:00
Adrian Reber	fb59ae504e	test: fix GCC 16 compile error Fedora rawhide ships a pre-release of GCC 16 which produces following error: uprobes.c:34:22: error: variable ‘dummy’ set but not used [-Werror=unused-but-set-variable=] 34 \| volatile int dummy = 0; \| ^~~~~ Marking this variable as "__maybe_unused" to fix the error. Signed-off-by: Adrian Reber <areber@redhat.com>	2026-01-12 19:06:43 -08:00
Radostin Stoyanov	b208bec12d	crit: show dead task_state In some cases, CRIU can observe tasks that exit during checkpointing, and sets the state of these tasks to COMPEL_TASK_DEAD. This patch adds a string representation of this value that can be used by CRIT when decoding the images. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2026-01-12 18:49:12 -08:00
Radostin Stoyanov	9885fb3c75	crit: fix incorrect task state decoding CRIU defines the following constants for task state in compel/include/uapi/task-state.h COMPEL_TASK_ALIVE = 0x01 COMPEL_TASK_STOPPED = 0x03 COMPEL_TASK_ZOMBIE = 0x06 Thus, we need to swap the values for "zombie" and "stopped" used in CRIT. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2026-01-12 18:49:12 -08:00
ImranullahKhann	71fe85ec90	ci: add iproute2 to the list of packages in apt-packages.sh When running the command 'make docker-test', almost all zdtm tests fail, logging 'ip: not found'. 'ip' command of the iproute2 package was missing. So added the package to the list of dependencies in 'apt-packages.sh'. Now tests run Signed-off-by: ImranullahKhann <imranullahkhann2004@gmail.com>	2026-01-08 15:35:49 -08:00
Radostin Stoyanov	36f1e9d38c	amdgpu: use fseeko with large-file support instead of fseeko64 As of Alpine Linux 3.19, musl libc no longer contains separate fopen64(), fseeko64(), or ftello64() functions. This causes building CRIU with amdgpu plugin to fail with the following error: amdgpu_plugin.c: In function 'parallel_restore_bo_contents': amdgpu_plugin.c:2286:17: error: implicit declaration of function 'fseeko64'; did you mean 'fseeko'? [-Wimplicit-function-declaration] 2286 \| fseeko64(bo_contents_fp, entry->read_offset + offset, SEEK_SET); \| ^~~~~~~~ \| fseeko make[2]: * [Makefile:31: amdgpu_plugin.so] Error 1 make[1]: * [Makefile:363: amdgpu_plugin] Error 2 To fix this, add the missing $(DEFINES) to plugin builds, and since we always compile with _FILE_OFFSET_BITS=64, we don't need the 64 suffix. Fixes: #2826 Suggested-by: Andrei Vagin <avagin@google.com> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2026-01-08 07:48:23 -08:00
Radostin Stoyanov	ddf7a170ff	infect-types: fix user_gcs redefine error In file included from compel/arch/aarch64/src/lib/infect.c:10: compel/include/uapi/compel/asm/infect-types.h:24:8: error: redefinition of 'user_gcs' 24 \| struct user_gcs { \| ^ /usr/include/asm/ptrace.h:329:8: note: previous definition is here 329 \| struct user_gcs { \| ^ 1 error generated. make[1]: *** [/criu/scripts/nmk/scripts/build.mk:215: compel/arch/aarch64/src/lib/infect.o] Error 1 Suggested-by: Andrei Vagin <avagin@google.com> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2026-01-08 07:48:23 -08:00
Radostin Stoyanov	2dd66866e3	zdtm/cgroup_stray: fix uninitialized variable 51.04 DEP cgroup_stray.d 51.07 CC cgroup_stray.o 51.11 cgroup_stray.c:164:18: error: variable 'c' is uninitialized when passed as a const pointer argument here [-Werror,-Wuninitialized-const-pointer] 51.11 164 \| if (write(sk, &c, 1) != 1) { 51.11 \| ^ 51.11 1 error generated. 51.12 make[1]: * [../Makefile.inc:96: cgroup_stray.o] Error 1 51.12 make[1]: Leaving directory '/criu/test/zdtm/static' 51.12 make: * [Makefile:7: static] Error 2 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2026-01-08 07:48:23 -08:00
Radostin Stoyanov	974c1bc898	zdtm/tempfs_subns: fix uninitialized variable DEP tempfs_subns.d CC tempfs_subns.o tempfs_subns.c:50:23: error: variable 'fd' is uninitialized when passed as a const pointer argument here [-Werror,-Wuninitialized-const-pointer] 50 \| if (write(fds[1], &fd, sizeof(fd)) != sizeof(fd)) { \| ^~ 1 error generated. make[1]: *** [../Makefile.inc:96: tempfs_subns.o] Error 1 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2026-01-08 07:48:23 -08:00
Radostin Stoyanov	b1a51489dd	compel: fix sys_clock_gettime function signature The initialization of the struct timespec used as clockid input parameter was removed in commit: `b4441d1bd8` ("restorer.c: rm unneded struct init") This causes the build to fail on Alpine with clang version 21.1.2: GEN criu/pie/parasite-blob.h criu/pie/restorer.c:1230:39: error: variable 'ts' is uninitialized when passed as a const pointer argument here [-Werror,-Wuninitialized-const-pointer] 1230 \| if (sys_clock_gettime(t->clockid, &ts)) { \| ^~ 1 error generated. make[2]: * [/criu/scripts/nmk/scripts/build.mk:118: criu/pie/restorer.o] Error 1 make[1]: * [criu/Makefile:59: pie] Error 2 make: *** [Makefile:278: criu] Error 2 To fix this, we remove the "const" from the declaration of clock_gettime. Since the kernel writes the current time into the struct timespec provided by the caller, the pointer must be writable. Suggested-by: Andrei Vagin <avagin@google.com> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2026-01-08 07:48:23 -08:00
Pavel Tikhomirov	fc1867c44d	kerndat: Fix error handling for kerndat_has_timer_cr_ids() fail After commit [1] we accidentally stopped reporting the errors from kerndat_has_timer_cr_ids(), let's fix that. Fixes: `1eaa870cc` ("kerndat: check that hardware breakpoints work") [1] Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>	2026-01-02 12:51:53 -08:00
Pavel Tikhomirov	2e5f9facf9	util: Make close_safe() reset fd to -1 even on close() failure The "man 2 close":"Dealing with error returns from close()" says: "Retrying the close() after a failure return is the wrong thing to do" We should not leave the fd there, attempting to close it again on next close()/close_safe() may lead to accidentally closing something else. It confirms with the kernel code where sys_close() removes fd from fdtable in this stack: +-> sys_close +-> file_close_fd +-> file_close_fd_locked +-> rcu_assign_pointer(fdt->fd[fd], NULL) If there was an fd this stack is always reached and fd is always removed. Let's replace the fd with -1 after close no matter what. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>	2025-12-29 10:00:35 +00:00
Radostin Stoyanov	d4e8114130	readme: use a local copy of the CRIU logo The README currently uses an external link to criu.org for the embedded CRIU logo. Loading this URL when viewing the README on GitHub sometimes fails with "Error Fetching Resource". Using a local copy of the logo fixes this issue. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-12-17 08:43:50 -08:00
Adrian Reber	30acbabcdd	ci: also exclude docker version 29 Docker version 28 broke container restore in combination with network namespaces. The workaround in the CI script was excluding Docker version 28. Now that there is also Docker version 29, which is still broken, this also excludes Docker version 29. Signed-off-by: Adrian Reber <areber@redhat.com>	2025-12-14 17:28:58 +09:00
Radostin Stoyanov	f66e59ee5c	cr-dump: fix error handling Commit "plugin: Add DUMP_DEVICES_LATE callback" introduced a new plugin callback that is invoked in cr_dump_tasks(). The return value of this callback was assigned to the variable ret. However, this variable is later used as the return value when goto err is triggered in subsequent conditions. As a result, CRIU exits with "Dumping finished successfully" even when some actions have failed and inventory.img has not been created. To fix this, we replace ret with exit_code and use it only when it is actually needed. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-12-09 12:23:23 -08:00
Igor Svilenkov Bozic	f78bea8d34	zdtm: gcs: add opt-in GCS test support for AArch64 Introduce an opt-in mode for building and running ZDTM static tests with Guarded Control Stack (GCS) enabled on AArch64. Changes: - Support `GCS_ENABLE=1` builds, adding `-mbranch-protection=standard` and `-z experimental-gcs=check` to CFLAGS/LDFLAGS. - Export required GLIBC_TUNABLES at runtime via `TEST_ENV`. - %.pid rules to prefix test binaries with `$(TEST_ENV)` so the tunables are set when running tests. - Makefile rules for selectively enabling GCS in tests Usage: # Build and run with GCS enabled make -C zdtm/static GCS_ENABLE=1 posix_timers GCS_ENABLE=1 ./zdtm.py run --keep-img=always \ -t zdtm/static/posix_timers By default (`GCS_ENABLE` unset or 0), test builds and runs are unchanged. NOTE: This assumes that the test victim was compiled also using GCS_ENABLE=1 so that the proper GCS AArch64 ELF headers are present Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com> Reviewed-by: Alexander Mikhalitsyn aleksandr.mikhalitsyn@canonical.com	2025-12-07 19:20:00 +01:00
Igor Svilenkov Bozic	d591e320e0	criu/restore: gcs: adds restore implementation for Guarded Control Stack This commit finalizes AArch64 Guarded Control Stack (GCS) support by wiring the full dump and restore flow. The restore path adds the following steps: - Define shared AArch64 GCS types and constants in a dedicated header for both compel and CRIU inclusion - compel: add get/set NT_ARM_GCS via ptrace, enabling user-space GCS state save and restore. - During restore switch to the new GCS (via GCSSTR) to place capability token sa_restorer address - arch_shstk_trampoline() — We enable GCS in a trampoline that using prctl(PR_SET_SHADOW_STACK_STATUS, ...) via inline SVC. The trampoline ineeded because we can’t RET without a valid GCS. - restorer: map the recorded GCS VMA, populate contents top-down with GCSSTR, write the signal capability at GCSPR_EL0 and the valid token at GCSPR_EL0-8, then switch to the rebuilt GCS (GCSSS1) - Save and restore registers via ptrace - Extend restorer argument structures to carry GCS state into post-restore execution - Add shstk_set_restorer_stack(): sets tmp_gcs to temporary restorer shadow stack start - Add gcs_vma_restore implementation (required for mremap of the GCS VMA) Tested with: GCS_ENABLE=1 ./zdtm.py run -t zdtm/static/env00 Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>	2025-12-07 19:20:00 +01:00
Igor Svilenkov Bozic	2429d49e67	criu/dump: gcs: save GCS state during dump Add debug and info messages to log Guarded Control Stack state when dumping AArch64 threads. This includes the following values: - gcspr_el0 - features_enabled Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com> [ alex: cleanup fixes ] Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Acked-by: Mike Rapoport <rppt@kernel.org>	2025-12-07 19:20:00 +01:00
Igor Svilenkov Bozic	41ecb7ac71	images: aarch64: add user_aarch64_gcs_entry - Define user_aarch64_gcs_entry in core-aarch64.proto to store Guarded Control Stack state (gcspr_el0, features_enabled). - Extend thread_info_aarch64 with an optional gcs field Also extend thread_info_aarch64 with an optional gcs field Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>	2025-12-07 19:20:00 +01:00
Igor Svilenkov Bozic	92e6e523b5	compel: gcs: add opt-in GCS test support for AArch64 Introduce an opt-in mode for building and running compel tests with Guarded Control Stack (GCS) enabled on AArch64. Changes: - Extend compel/test/infect to support `GCS_ENABLE=1` builds, adding `-mbranch-protection=standard` and `-z experimental-gcs=check` to CFLAGS/LDFLAGS. - Export required GLIBC_TUNABLES at runtime via `TEST_ENV`. Usage: make -C compel/test/infect GCS_ENABLE=1 make -C compel/test/infect GCS_ENABLE=1 run By default (`GCS_ENABLE` unset or 0), builds and runs are unchanged. Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>	2025-12-07 19:20:00 +01:00
Igor Svilenkov Bozic	2f676d20e4	compel: gcs: set up GCS token/restorer for rt_sigreturn When GCS is enabled, the kernel expects a capability token at GCSPR_EL0-8 and sa_restorer at GCSPR_EL0-16 on rt_sigreturn. The sigframe must be consistent with the kernel’s expectations, with GCSPR_EL0 advanced by -8 having it point to the token on signal entry. On rt_sigreturn, the kernel verifies the cap at GCSPR_EL0, invalidates it and increments GCSPR_EL0 by 8 at the end of gcs_restore_signal() . Implement parasite_setup_gcs() to: - read NT_ARM_GCS via ptrace(PTRACE_GETREGSET) - write (via ptrace) the computed capability token and restorer address - update GCSPR_EL0 to point to the token's location Call parasite_setup_gcs() into parasite_start_daemon() so the sigreturn frame satisfies kernel's expectation Tests with GCS remain opt‑in: make -C compel/test/infect GCS_ENABLE=1 && make -C compel/test/infect run Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com> [ alex: cleanup fixes ] Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Acked-by: Mike Rapoport <rppt@kernel.org>	2025-12-07 19:20:00 +01:00
Igor Svilenkov Bozic	6bb856b0af	compel: gcs: initial GCS support for signal frames Add basic prerequisites for Guarded Control Stack (GCS) state on AArch64. This adds a gcs_context to the signal frame and extends user_fpregs_struct_t to carry GCS metadata, preparing the groundwork for GCS in the parasite. For now, the GCS fields are zeroed during compel_get_task_regs(), technically ignoring GCS since it does not reach the control logic yet; that will be introduced in the next commit. The code path is gated and does not affect normal tests. Can be explicitly enabled and tested via: make -C infect GCS_ENABLE=1 && make -C infect run Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com> [ alex: clean up fixes ] Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Acked-by: Mike Rapoport <rppt@kernel.org>	2025-12-07 19:20:00 +01:00
Igor Svilenkov Bozic	73ca071483	gcs: add GCS constants and helper macros Introduce ARM64 Guarded Control Stack (GCS) constants and macros in a new uapi header for use in both CRIU and compel. Includes: - NT_ARM_GCS type - prctl(2) constants for GCS enable/write/push modes - Capability token helpers (GCS_CAP, GCS_SIGNAL_CAP) - HWCAP_GCS definition These are based on upstream Linux definitions Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Acked-by: Mike Rapoport <rppt@kernel.org>	2025-12-07 19:20:00 +01:00
Igor Svilenkov Bozic	501b714f76	compel/aarch64: refactor fpregs handling Refactor user_fpregs_struct_t to wrap user_fpsimd_state in a dedicated struct, preparing for future extending by just adding new members Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com> [ alex: fixes ] Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Acked-by: Mike Rapoport <rppt@kernel.org>	2025-12-07 19:20:00 +01:00
Adrian Reber	90300748ef	tty: fix compiler error At least on tests running on Fedora rawhide following error could be seen: ``` criu/tty.c: In function 'pts_fd_get_index': criu/tty.c:262:21: error: initialization discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers] 262 \| char *pos = strrchr(link->name, '/'); \| ``` This fixes it. Signed-off-by: Adrian Reber <areber@redhat.com>	2025-11-28 09:18:59 +00:00
Adrian Reber	09bb362664	restore: fix "Defect type: UNINIT" Static code analysis reported: 1. criu/cr-restore.c:2438:2: var_decl: Declaring variable "end_vma" without initializer. 4. criu/cr-restore.c:2451:5: assign: Assigning: "s_vma" = "&end_vma", which points to uninitialized data. 7. criu/cr-restore.c:2449:4: uninit_use: Using uninitialized value "s_vma->list.next". This tries to fix it by initializing the variable. Signed-off-by: Adrian Reber <areber@redhat.com>	2025-11-28 09:18:15 +00:00
Adrian Reber	bf82389de3	dump: fix "Defect type: IDENTICAL_BRANCHES" Static code analysis reported: criu/cr-dump.c:2328:2: identical_branches: The same code is executed when the condition "ret" is true or false, because the code in the if-then branch and after the if statement is identical. Should the if statement be removed? This is a fix for the warning. Signed-off-by: Adrian Reber <areber@redhat.com>	2025-11-28 09:18:15 +00:00
Mark Polyakov	2cf8f13ca1	doc: update pipe/socket examples for --inherit-fd The syntax of the inherit-fd functionality for unix socket and pipe includes a colon. Fixes: `0df3f79fc0` ("criu(8): fix --inherit-fd description") Fixes: `c37324b6d0` ("crtools: describe the inherit-fd option") Signed-off-by: Mark Polyakov <mark@thundercompute.com> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-11-16 15:56:26 +00:00
Yanning Yang	62aadb22ab	amdgpu: use 64-bit offsets for parallel restore On AMD Instinct MI300 systems, restoring a large GPU application can fail because the checkpoint size is too large and the maximum value of an offset (with integer type) is insufficient. This problem occurs when the total size of all buffer objects exceeds int max, not because any single buffer is too large, but it can also happen with a large number of small buffers. Fixes: #2812 Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-11-16 07:44:37 -08:00
Radostin Stoyanov	1db7eed69f	amdgpu: use local kernel headers instead of libdrm Use local copies of amdgpu and DRM headers for consistency. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-11-14 18:31:37 +00:00
Radostin Stoyanov	29525f8cb3	codespell: skip amdgpu kernel headers These header files are copied directly from the Linux kernel and contain typos. We skip these files in codespell to simplify maintenance. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-11-14 18:31:37 +00:00
Radostin Stoyanov	e4a5e164b4	plugins/amdgpu: update kernel headers This patch updates drm.h and amdgpu_drm.h kernel headers, and adds drm_mode.h (included by drm.h) from the rocm-7.1.0 release tag. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-11-14 18:31:37 +00:00
Radostin Stoyanov	f56ccfd2d6	plugins/amdgpu: remove unused variable amdgpu_plugin_drm.c:167:6: error: variable 'num_bos' set but not used [-Werror,-Wunused-but-set-variable] 167 \| int num_bos = 0; \| Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-11-14 18:31:37 +00:00
David Francis	6ed49894c5	plugins/amdgpu: add a comment for retry_needed Add a comment that explains the purpose of `retry_needed`. Co-authored-by: Andrei Vagin <avagin@google.com> Signed-off-by: David Francis <David.Francis@amd.com> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-11-14 18:31:37 +00:00
David Francis	77e6558ddb	plugins/amdgpu: apply code-style fixes Co-authored-by: Andrei Vagin <avagin@google.com> Signed-off-by: David Francis <David.Francis@amd.com> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-11-14 18:31:37 +00:00
David Francis	690b610432	plugins/amdgpu: return 0 in post_dump_dmabuf_check Use `return 0` on success in `post_dump_dmabuf_check()` for consistency with other functions. Co-authored-by: Andrei Vagin <avagin@google.com> Signed-off-by: David Francis <David.Francis@amd.com> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-11-14 18:31:37 +00:00
David Francis	ff35a9126e	plugins/amdgpu: remove excessive debug messages These pr_info lines begin with "CC3" and "TWI" were not meant to be included in the patch. Co-authored-by: Andrei Vagin <avagin@google.com> Signed-off-by: David Francis <David.Francis@amd.com> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2025-11-14 18:31:37 +00:00
David Francis	9e404e2083	plugin/amdgpu: Support for checkpoint of dmabuf fds amdgpu libraries that use dmabuf fd to share GPU memory between processes close the dmabuf fds immediately after using them. However, it is possible that checkpoint of a process catches one of the dmabuf fds open. In that case, the amdgpu plugin needs to handle it. The checkpoint of the dmabuf fd does require the device file it was exported from to have already been dumped To identify which device this dmabuf fd was exprted from, attempt to import it on each device, then record the dmabuf handle it imports as. This handle can be used to restore it. Signed-off-by: David Francis <David.Francis@amd.com>	2025-11-14 18:31:37 +00:00
David Francis	d43217dadb	plugin: Add DUMP_DEVICES_LATE callback The amdgpu plugin was counting how many files were checkpointed to determine when it should close the device files. The number of device files is not consistent; a process may have multiple copies of the drm device files open. Instead of doing this counting, add a new callback after all files are checkpointed, so plugins can clean up their resources at an appropriate time. Signed-off-by: David Francis <David.Francis@amd.com>	2025-11-14 18:31:37 +00:00
David Francis	db0ec806d1	plugin/amdgpu: Add handling for amdgpu drm buffer objects Buffer objects held by the amdgpu drm driver are checkpointed with the new BO_INFO and MAPPING_INFO ioctls/ioctl options. Handling is in amdgpu_plugin_drm.h Handling of imported buffer objects may require dmabuf fds to be transferred between processes. These occur over fdstore, with the handle-fstore id relationships kept in shread memory. There is a new plugin callback: RESTORE_INIT to create the shared memory. During checkpoint, track shared buffer objects, so that buffer objects that are shared across processes can be identified. During restore, track which buffer objects have been restored. Retry restore of a drm file if a buffer object is imported and the original has not been exported yet. Skip buffer objects that have already been completed or cannot be completed in the current restore. So drm code can use sdma_copy_bo, that function no longer requires kfd bo structs Update the protobuf messages with new amdgpu drm information. Signed-off-by: David Francis <David.Francis@amd.com>	2025-11-14 18:31:36 +00:00
David Francis	5eb61e1b14	plugin/amdgpu: Add drm header The amdgpu plugin usually calls drm ioctls through the libdrm wrappers. However, amdgpu restore requires dealing with dmabufs and gem handles directly, which means drm ioctls must be called directly. Add the drm.h header (from the kernel's uapi). Signed-off-by: David Francis <David.Francis@amd.com>	2025-11-14 18:31:36 +00:00
David Francis	0b7ca29c19	plugin/amdgpu: Add amdgpu drm header For amdgpu plugin to call the new amdgpu drm CRIU ioctls, it needs the amdgpu drm header file, copied from the kernel's includes. Signed-off-by: David Francis <David.Francis@amd.com>	2025-11-14 18:31:36 +00:00
David Francis	fb02dbf685	files-ext: Allow plugin files to retry amdgpu dmabuf CRIU requires the ability of the amdgpu plugin to retry. Change files_ext.c to read a response of 1 from a plugin restore function to mean retry. Signed-off-by: David Francis <David.Francis@amd.com>	2025-11-14 18:31:36 +00:00
David Francis	7a4ee0ae8e	restorer: Skip non-regular VMAs amdgpu represents allocated device memory as a memory mapping of the device file. This is a non-standard VMA that must be handled by the plugin, not the normal VMA code. Ignore all VMAs on device files. Signed-off-by: David Francis <David.Francis@amd.com>	2025-11-14 18:31:36 +00:00
Yanning Yang	920437205c	plugins/amdgpu: Update `README.md` and `criu-amdgpu-plugin.txt` Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>	2025-11-14 18:27:31 +00:00
Yanning Yang	4a3a695dfb	plugins/amdgpu: Implement parallel restore This patch implements the entire logic to enable the offloading of buffer object content restoration. The goal of this patch is to offload the buffer object content restoration to the main CRIU process so that this restoration can occur in parallel with other restoration logic (mainly the restoration of memory state in the restore blob, which is time-consuming) to speed up the restore phase. The restoration of buffer object content usually takes a significant amount of time for GPU applications, so parallelizing it with other operations can reduce the overall restore time. It has three parts: the first replaces the restoration of buffer objects in the target process by sending a parallel restore command to the main CRIU process; the second implements the POST_FORKING hook in the amdgpu plugin to enable buffer object content restoration in the main CRIU process; the third stops the parallel thread in the RESUME_DEVICES_LATE hook. This optimization only focuses on the single-process situation (common case). In other scenarios, it will turn to the original method. This is achieved with the new `parallel_disabled` flag. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>	2025-11-14 18:27:31 +00:00
Yanning Yang	33ed774c8d	plugins/amdgpu: Add parallel restore command Currently the restore of buffer object comsumes a significant amount of time. However, this part has no logical dependencies with other restore operations. This patch introduce some structures and some helper functions for the target process to offload this task to the main CRIU process. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>	2025-11-14 18:27:31 +00:00
Yanning Yang	6386140754	plugins/amdgpu: Add socket operations When enabling parallel restore, the target process and the main CRIU process need an IPC interface to communicate and transfer restore commands. This patch adds a Unix domain TCP socket and stores this socket in `fdstore`. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>	2025-11-14 18:27:31 +00:00

1 2 3 4 5 ...

11759 commits