Compare commits

...

417 commits

Author SHA1 Message Date
unichronic
9e5fbcd668 pycriu: Fix self-dump failure with explicit PID
When `opts.pid` is explicitly set to `os.getpid()`, `pycriu` fails to
daemonize the `criu` process. This causes `criu` to run as a child of
the dumped process, leading to the error "The criu itself is within
dumped tree".

This can be fixed by modifying `_send_req_and_recv_resp` to check if the
target PID matches the current process PID. If so, it enables daemon
mode, ensuring `criu` is detached and the dump succeeds.

Signed-off-by: unichronic <ishuvam.pal@gmail.com>
2026-01-21 00:25:29 +00:00
Pavel Tikhomirov
21a6758268 cr-restore/shstk: Make arch_shstk_unlock use correct pid
In a simple case where the parent process and the child one are in one
pid namespace we can safely use vpid(item) to prace the child. But, for
the cases where the child is a pid namespace init, or the child is put
into external pid namespace, the parent and the child have different pid
namespaces and using pid vpid(item) (which e.g. for init will always be
1 here) to ptrace the child process is inorrect.

Let's use the pid reported to us from clone as it's always the right pid
of the child from the parent's point of view.

Fixes: 7dd583002 ("restore: add infrastructure to enable shadow stack")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2026-01-20 00:08:19 +00:00
liqiang2020
07af3304fd restore/pie: check return value of sys_rseq on unregister
The return value of sys_rseq was previously ignored during
unregistration, under the assumption that it would not fail if the rseq
structure was properly registered.

However, if sys_rseq fails, the kernel retains the registration. If the
memory containing the rseq structure is subsequently unmapped or reused,
kernel updates to the rseq area can cause the process to crash (e.g.,
via SIGSEGV).

Check the return value of sys_rseq. If it fails, log the error code and
abort the restoration process. This makes rseq unregistration failures
fatal and explicit, aiding in debugging and preventing later obscure
crashes.

Signed-off-by: liqiang2020 <liqiang64@huawei.com>
2026-01-12 19:07:39 -08:00
Adrian Reber
fb59ae504e test: fix GCC 16 compile error
Fedora rawhide ships a pre-release of GCC 16 which produces following
error:

  uprobes.c:34:22: error: variable ‘dummy’ set but not used [-Werror=unused-but-set-variable=]
  34 |         volatile int dummy = 0;
  |                      ^~~~~

Marking this variable as "__maybe_unused" to fix the error.

Signed-off-by: Adrian Reber <areber@redhat.com>
2026-01-12 19:06:43 -08:00
Radostin Stoyanov
b208bec12d crit: show dead task_state
In some cases, CRIU can observe tasks that exit during checkpointing,
and sets the state of these tasks to COMPEL_TASK_DEAD.
This patch adds a string representation of this value that can be used
by CRIT when decoding the images.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2026-01-12 18:49:12 -08:00
Radostin Stoyanov
9885fb3c75 crit: fix incorrect task state decoding
CRIU defines the following constants for task state in compel/include/uapi/task-state.h

COMPEL_TASK_ALIVE = 0x01
COMPEL_TASK_STOPPED = 0x03
COMPEL_TASK_ZOMBIE = 0x06

Thus, we need to swap the values for "zombie" and "stopped" used in CRIT.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2026-01-12 18:49:12 -08:00
ImranullahKhann
71fe85ec90 ci: add iproute2 to the list of packages in apt-packages.sh
When running the command 'make docker-test', almost all zdtm tests fail,
logging 'ip: not found'. 'ip' command of the iproute2 package was missing.
So added the package to the list of dependencies in 'apt-packages.sh'. Now
tests run

Signed-off-by: ImranullahKhann <imranullahkhann2004@gmail.com>
2026-01-08 15:35:49 -08:00
Radostin Stoyanov
36f1e9d38c amdgpu: use fseeko with large-file support instead of fseeko64
As of Alpine Linux 3.19, musl libc no longer contains separate
fopen64(), fseeko64(), or ftello64() functions. This causes building
CRIU with amdgpu plugin to fail with the following error:

amdgpu_plugin.c: In function 'parallel_restore_bo_contents':
amdgpu_plugin.c:2286:17: error: implicit declaration of function 'fseeko64'; did you mean 'fseeko'? [-Wimplicit-function-declaration]
 2286 |                 fseeko64(bo_contents_fp, entry->read_offset + offset, SEEK_SET);
      |                 ^~~~~~~~
      |                 fseeko
make[2]: *** [Makefile:31: amdgpu_plugin.so] Error 1
make[1]: *** [Makefile:363: amdgpu_plugin] Error 2

To fix this, add the missing $(DEFINES) to plugin builds, and since we
always compile with _FILE_OFFSET_BITS=64, we don't need the 64 suffix.

Fixes: #2826

Suggested-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2026-01-08 07:48:23 -08:00
Radostin Stoyanov
ddf7a170ff infect-types: fix user_gcs redefine error
In file included from compel/arch/aarch64/src/lib/infect.c:10:
compel/include/uapi/compel/asm/infect-types.h:24:8: error: redefinition of 'user_gcs'
   24 | struct user_gcs {
      |        ^
/usr/include/asm/ptrace.h:329:8: note: previous definition is here
  329 | struct user_gcs {
      |        ^
1 error generated.
make[1]: *** [/criu/scripts/nmk/scripts/build.mk:215: compel/arch/aarch64/src/lib/infect.o] Error 1

Suggested-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2026-01-08 07:48:23 -08:00
Radostin Stoyanov
2dd66866e3 zdtm/cgroup_stray: fix uninitialized variable
51.04  DEP       cgroup_stray.d
51.07  CC        cgroup_stray.o
51.11 cgroup_stray.c:164:18: error: variable 'c' is uninitialized when passed as a const pointer argument here [-Werror,-Wuninitialized-const-pointer]
51.11   164 |                 if (write(sk, &c, 1) != 1) {
51.11       |                                ^
51.11 1 error generated.
51.12 make[1]: *** [../Makefile.inc:96: cgroup_stray.o] Error 1
51.12 make[1]: Leaving directory '/criu/test/zdtm/static'
51.12 make: *** [Makefile:7: static] Error 2

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2026-01-08 07:48:23 -08:00
Radostin Stoyanov
974c1bc898 zdtm/tempfs_subns: fix uninitialized variable
DEP       tempfs_subns.d
 CC        tempfs_subns.o
tempfs_subns.c:50:23: error: variable 'fd' is uninitialized when passed as a const pointer argument here [-Werror,-Wuninitialized-const-pointer]
   50 |                         if (write(fds[1], &fd, sizeof(fd)) != sizeof(fd)) {
      |                                            ^~
1 error generated.
make[1]: *** [../Makefile.inc:96: tempfs_subns.o] Error 1

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2026-01-08 07:48:23 -08:00
Radostin Stoyanov
b1a51489dd compel: fix sys_clock_gettime function signature
The initialization of the struct timespec used as clockid input
parameter was removed in commit:
b4441d1bd8 ("restorer.c: rm unneded struct init")

This causes the build to fail on Alpine with clang version 21.1.2:

  GEN      criu/pie/parasite-blob.h
  criu/pie/restorer.c:1230:39: error: variable 'ts' is uninitialized when passed as a const pointer argument here [-Werror,-Wuninitialized-const-pointer]
   1230 |                         if (sys_clock_gettime(t->clockid, &ts)) {
        |                                                            ^~
  1 error generated.
  make[2]: *** [/criu/scripts/nmk/scripts/build.mk:118: criu/pie/restorer.o] Error 1
  make[1]: *** [criu/Makefile:59: pie] Error 2
  make: *** [Makefile:278: criu] Error 2

To fix this, we remove the "const" from the declaration of
clock_gettime. Since the kernel writes the current time into
the struct timespec provided by the caller, the pointer must
be writable.

Suggested-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2026-01-08 07:48:23 -08:00
Pavel Tikhomirov
fc1867c44d kerndat: Fix error handling for kerndat_has_timer_cr_ids() fail
After commit [1] we accidentally stopped reporting the errors from
kerndat_has_timer_cr_ids(), let's fix that.

Fixes: 1eaa870cc ("kerndat: check that hardware breakpoints work") [1]
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2026-01-02 12:51:53 -08:00
Pavel Tikhomirov
2e5f9facf9 util: Make close_safe() reset fd to -1 even on close() failure
The "man 2 close":"Dealing with error returns from close()" says:

  "Retrying the close() after a failure return is the wrong thing to do"

We should not leave the fd there, attempting to close it again on next
close()/close_safe() may lead to accidentally closing something else.

It confirms with the kernel code where sys_close() removes fd from
fdtable in this stack:

  +-> sys_close
    +-> file_close_fd
      +-> file_close_fd_locked
        +-> rcu_assign_pointer(fdt->fd[fd], NULL)

If there was an fd this stack is always reached and fd is always
removed.

Let's replace the fd with -1 after close no matter what.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2025-12-29 10:00:35 +00:00
Radostin Stoyanov
d4e8114130 readme: use a local copy of the CRIU logo
The README currently uses an external link to criu.org for the embedded
CRIU logo. Loading this URL when viewing the README on GitHub sometimes
fails with "Error Fetching Resource". Using a local copy of the logo
fixes this issue.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-12-17 08:43:50 -08:00
Adrian Reber
30acbabcdd ci: also exclude docker version 29
Docker version 28 broke container restore in combination with network
namespaces. The workaround in the CI script was excluding Docker version
28. Now that there is also Docker version 29, which is still broken,
this also excludes Docker version 29.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-12-14 17:28:58 +09:00
Radostin Stoyanov
f66e59ee5c cr-dump: fix error handling
Commit "plugin: Add DUMP_DEVICES_LATE callback" introduced a new plugin
callback that is invoked in cr_dump_tasks(). The return value of this
callback was assigned to the variable ret. However, this variable is later
used as the return value when goto err is triggered in subsequent
conditions. As a result, CRIU exits with "Dumping finished successfully" even
when some actions have failed and inventory.img has not been created.

To fix this, we replace ret with exit_code and use it only when it is
actually needed.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-12-09 12:23:23 -08:00
Igor Svilenkov Bozic
f78bea8d34 zdtm: gcs: add opt-in GCS test support for AArch64
Introduce an opt-in mode for building and running ZDTM static tests
with Guarded Control Stack (GCS) enabled on AArch64.

Changes:
 - Support `GCS_ENABLE=1` builds, adding `-mbranch-protection=standard`
   and `-z experimental-gcs=check` to CFLAGS/LDFLAGS.
 - Export required GLIBC_TUNABLES at runtime via `TEST_ENV`.
 - %.pid rules to prefix test binaries with `$(TEST_ENV)`
   so the tunables are set when running tests.
 - Makefile rules for selectively enabling GCS in tests

Usage:
    # Build and run with GCS enabled
    make -C zdtm/static GCS_ENABLE=1 posix_timers
    GCS_ENABLE=1 ./zdtm.py run --keep-img=always \
        -t zdtm/static/posix_timers

By default (`GCS_ENABLE` unset or 0), test builds and runs are
unchanged.

NOTE: This assumes that the test victim was compiled also using
      GCS_ENABLE=1 so that the proper GCS AArch64 ELF headers are present

Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Reviewed-by: Alexander Mikhalitsyn aleksandr.mikhalitsyn@canonical.com
2025-12-07 19:20:00 +01:00
Igor Svilenkov Bozic
d591e320e0 criu/restore: gcs: adds restore implementation for Guarded Control Stack
This commit finalizes AArch64 Guarded Control Stack (GCS)
support by wiring the full dump and restore flow.

The restore path adds the following steps:

 - Define shared AArch64 GCS types and constants in a dedicated header
   for both compel and CRIU inclusion
 - compel: add get/set NT_ARM_GCS via ptrace, enabling user-space
   GCS state save and restore.
 - During restore switch to the new GCS (via GCSSTR) to place capability
   token sa_restorer address
 - arch_shstk_trampoline() — We enable GCS in a trampoline that using
   prctl(PR_SET_SHADOW_STACK_STATUS, ...) via inline SVC. The trampoline
   ineeded because we can’t RET without a valid GCS.
 - restorer: map the recorded GCS VMA, populate contents top-down with
   GCSSTR, write the signal capability at GCSPR_EL0 and the valid token at
   GCSPR_EL0-8, then switch to the rebuilt GCS (GCSSS1)
 - Save and restore registers via ptrace
 - Extend restorer argument structures to carry GCS state
   into post-restore execution
 - Add shstk_set_restorer_stack(): sets tmp_gcs to temporary restorer
   shadow stack start
 - Add gcs_vma_restore implementation (required for mremap of the GCS VMA)

Tested with:
    GCS_ENABLE=1 ./zdtm.py run -t zdtm/static/env00

Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
2025-12-07 19:20:00 +01:00
Igor Svilenkov Bozic
2429d49e67 criu/dump: gcs: save GCS state during dump
Add debug and info messages to log Guarded Control Stack state when
dumping AArch64 threads. This includes the following values:
  - gcspr_el0
  - features_enabled

Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
[ alex: cleanup fixes ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Mike Rapoport <rppt@kernel.org>
2025-12-07 19:20:00 +01:00
Igor Svilenkov Bozic
41ecb7ac71 images: aarch64: add user_aarch64_gcs_entry
- Define user_aarch64_gcs_entry in core-aarch64.proto to store
    Guarded Control Stack state (gcspr_el0, features_enabled).
  - Extend thread_info_aarch64 with an optional gcs field

Also extend thread_info_aarch64 with an optional gcs field

Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
2025-12-07 19:20:00 +01:00
Igor Svilenkov Bozic
92e6e523b5 compel: gcs: add opt-in GCS test support for AArch64
Introduce an opt-in mode for building and running compel tests
with Guarded Control Stack (GCS) enabled on AArch64.

Changes:
 - Extend compel/test/infect to support `GCS_ENABLE=1` builds,
   adding `-mbranch-protection=standard` and
   `-z experimental-gcs=check` to CFLAGS/LDFLAGS.
 - Export required GLIBC_TUNABLES at runtime via `TEST_ENV`.

Usage:
    make -C compel/test/infect GCS_ENABLE=1
    make -C compel/test/infect GCS_ENABLE=1 run

By default (`GCS_ENABLE` unset or 0), builds and runs are unchanged.

Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
2025-12-07 19:20:00 +01:00
Igor Svilenkov Bozic
2f676d20e4 compel: gcs: set up GCS token/restorer for rt_sigreturn
When GCS is enabled, the kernel expects a capability token at GCSPR_EL0-8
and sa_restorer at GCSPR_EL0-16 on rt_sigreturn. The sigframe must be
consistent with the kernel’s expectations, with GCSPR_EL0 advanced by -8
having it point to the token on signal entry. On rt_sigreturn, the kernel
verifies the cap at GCSPR_EL0, invalidates it and increments GCSPR_EL0 by 8
at the end of gcs_restore_signal() .

Implement parasite_setup_gcs() to:
- read NT_ARM_GCS via ptrace(PTRACE_GETREGSET)
- write (via ptrace) the computed capability token and restorer address
- update GCSPR_EL0 to point to the token's location

Call parasite_setup_gcs() into parasite_start_daemon() so the sigreturn
frame satisfies kernel's expectation

Tests with GCS remain opt‑in:
	make -C compel/test/infect GCS_ENABLE=1 && make -C compel/test/infect run

Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
[ alex: cleanup fixes ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Mike Rapoport <rppt@kernel.org>
2025-12-07 19:20:00 +01:00
Igor Svilenkov Bozic
6bb856b0af compel: gcs: initial GCS support for signal frames
Add basic prerequisites for Guarded Control Stack (GCS) state on AArch64.

This adds a gcs_context to the signal frame and extends user_fpregs_struct_t to
carry GCS metadata, preparing the groundwork for GCS in the parasite.

For now, the GCS fields are zeroed during compel_get_task_regs(), technically
ignoring GCS since it does not reach the control logic yet; that will be
introduced in the next commit.

The code path is gated and does not affect normal tests. Can be explicitly
enabled and tested via:

    make -C infect GCS_ENABLE=1 && make -C infect run

Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
[ alex: clean up fixes ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Mike Rapoport <rppt@kernel.org>
2025-12-07 19:20:00 +01:00
Igor Svilenkov Bozic
73ca071483 gcs: add GCS constants and helper macros
Introduce ARM64 Guarded Control Stack (GCS) constants and macros
in a new uapi header for use in both CRIU and compel.

Includes:
 - NT_ARM_GCS type
 - prctl(2) constants for GCS enable/write/push modes
 - Capability token helpers (GCS_CAP, GCS_SIGNAL_CAP)
 - HWCAP_GCS definition

These are based on upstream Linux definitions

Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Mike Rapoport <rppt@kernel.org>
2025-12-07 19:20:00 +01:00
Igor Svilenkov Bozic
501b714f76 compel/aarch64: refactor fpregs handling
Refactor user_fpregs_struct_t to wrap user_fpsimd_state in a
dedicated struct, preparing for future extending by just
adding new members

Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
[ alex: fixes ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Mike Rapoport <rppt@kernel.org>
2025-12-07 19:20:00 +01:00
Adrian Reber
90300748ef tty: fix compiler error
At least on tests running on Fedora rawhide following error could be
seen:

```
  criu/tty.c: In function 'pts_fd_get_index':
  criu/tty.c:262:21: error: initialization discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers]
    262 |         char *pos = strrchr(link->name, '/');
        |
```

This fixes it.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-11-28 09:18:59 +00:00
Adrian Reber
09bb362664 restore: fix "Defect type: UNINIT"
Static code analysis reported:

  1. criu/cr-restore.c:2438:2: var_decl: Declaring variable "end_vma"
     without initializer.
  4. criu/cr-restore.c:2451:5: assign: Assigning: "s_vma" = "&end_vma",
     which points to uninitialized data.
  7. criu/cr-restore.c:2449:4: uninit_use: Using uninitialized value
     "s_vma->list.next".

This tries to fix it by initializing the variable.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-11-28 09:18:15 +00:00
Adrian Reber
bf82389de3 dump: fix "Defect type: IDENTICAL_BRANCHES"
Static code analysis reported:

 criu/cr-dump.c:2328:2: identical_branches: The same code is executed
 when the condition "ret" is true or false, because the code in the
 if-then branch and after the if statement is identical. Should the if
 statement be removed?

This is a fix for the warning.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-11-28 09:18:15 +00:00
Mark Polyakov
2cf8f13ca1 doc: update pipe/socket examples for --inherit-fd
The syntax of the inherit-fd functionality for unix socket and pipe
includes a colon.

Fixes: 0df3f79fc0 ("criu(8): fix --inherit-fd description")
Fixes: c37324b6d0 ("crtools: describe the inherit-fd option")
Signed-off-by: Mark Polyakov <mark@thundercompute.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-16 15:56:26 +00:00
Yanning Yang
62aadb22ab amdgpu: use 64-bit offsets for parallel restore
On AMD Instinct MI300 systems, restoring a large GPU application can
fail because the checkpoint size is too large and the maximum value of
an offset (with integer type) is insufficient. This problem occurs when
the total size of all buffer objects exceeds int max, not because any
single buffer is too large, but it can also happen with a large number
of small buffers.

Fixes: #2812

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-16 07:44:37 -08:00
Radostin Stoyanov
1db7eed69f amdgpu: use local kernel headers instead of libdrm
Use local copies of amdgpu and DRM headers for consistency.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-14 18:31:37 +00:00
Radostin Stoyanov
29525f8cb3 codespell: skip amdgpu kernel headers
These header files are copied directly from the Linux kernel and contain
typos. We skip these files in codespell to simplify maintenance.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-14 18:31:37 +00:00
Radostin Stoyanov
e4a5e164b4 plugins/amdgpu: update kernel headers
This patch updates drm.h and amdgpu_drm.h kernel headers,
and adds drm_mode.h (included by drm.h) from the rocm-7.1.0
release tag.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-14 18:31:37 +00:00
Radostin Stoyanov
f56ccfd2d6 plugins/amdgpu: remove unused variable
amdgpu_plugin_drm.c:167:6: error: variable 'num_bos' set but not used [-Werror,-Wunused-but-set-variable]
  167 |         int num_bos = 0;
      |

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-14 18:31:37 +00:00
David Francis
6ed49894c5 plugins/amdgpu: add a comment for retry_needed
Add a comment that explains the purpose of `retry_needed`.

Co-authored-by: Andrei Vagin <avagin@google.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-14 18:31:37 +00:00
David Francis
77e6558ddb plugins/amdgpu: apply code-style fixes
Co-authored-by: Andrei Vagin <avagin@google.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-14 18:31:37 +00:00
David Francis
690b610432 plugins/amdgpu: return 0 in post_dump_dmabuf_check
Use `return 0` on success in `post_dump_dmabuf_check()` for consistency
with other functions.

Co-authored-by: Andrei Vagin <avagin@google.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-14 18:31:37 +00:00
David Francis
ff35a9126e plugins/amdgpu: remove excessive debug messages
These pr_info lines begin with "CC3" and "TWI" were not meant to be
included in the patch.

Co-authored-by: Andrei Vagin <avagin@google.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-14 18:31:37 +00:00
David Francis
9e404e2083 plugin/amdgpu: Support for checkpoint of dmabuf fds
amdgpu libraries that use dmabuf fd to share GPU memory between
processes close the dmabuf fds immediately after using them.
However, it is possible that checkpoint of a process catches one
of the dmabuf fds open. In that case, the amdgpu plugin needs
to handle it.

The checkpoint of the dmabuf fd does require the device file
it was exported from to have already been dumped

To identify which device this dmabuf fd was exprted from, attempt
to import it on each device, then record the dmabuf handle
it imports as. This handle can be used to restore it.

Signed-off-by: David Francis <David.Francis@amd.com>
2025-11-14 18:31:37 +00:00
David Francis
d43217dadb plugin: Add DUMP_DEVICES_LATE callback
The amdgpu plugin was counting how many files were checkpointed
to determine when it should close the device files.

The number of device files is not consistent; a process may
have multiple copies of the drm device files open.

Instead of doing this counting, add a new callback after all
files are checkpointed, so plugins can clean up their
resources at an appropriate time.

Signed-off-by: David Francis <David.Francis@amd.com>
2025-11-14 18:31:37 +00:00
David Francis
db0ec806d1 plugin/amdgpu: Add handling for amdgpu drm buffer objects
Buffer objects held by the amdgpu drm driver are checkpointed with
the new BO_INFO and MAPPING_INFO ioctls/ioctl options. Handling
is in amdgpu_plugin_drm.h

Handling of imported buffer objects may require dmabuf fds to be
transferred between processes. These occur over fdstore, with the
handle-fstore id relationships kept in shread memory. There is a
new plugin callback: RESTORE_INIT to create the shared memory.

During checkpoint, track shared buffer objects, so that buffer objects
that are shared across processes can be identified.

During restore, track which buffer objects have been restored. Retry
restore of a drm file if a buffer object is imported and the
original has not been exported yet. Skip buffer objects that have
already been completed or cannot be completed in the current restore.

So drm code can use sdma_copy_bo, that function no longer requires
kfd bo structs

Update the protobuf messages with new amdgpu drm information.

Signed-off-by: David Francis <David.Francis@amd.com>
2025-11-14 18:31:36 +00:00
David Francis
5eb61e1b14 plugin/amdgpu: Add drm header
The amdgpu plugin usually calls drm ioctls through the libdrm
wrappers. However, amdgpu restore requires dealing with dmabufs
and gem handles directly, which means drm ioctls must be
called directly.

Add the drm.h header (from the kernel's uapi).

Signed-off-by: David Francis <David.Francis@amd.com>
2025-11-14 18:31:36 +00:00
David Francis
0b7ca29c19 plugin/amdgpu: Add amdgpu drm header
For amdgpu plugin to call the new amdgpu drm CRIU ioctls, it needs
the amdgpu drm header file, copied from the kernel's includes.

Signed-off-by: David Francis <David.Francis@amd.com>
2025-11-14 18:31:36 +00:00
David Francis
fb02dbf685 files-ext: Allow plugin files to retry
amdgpu dmabuf CRIU requires the ability of the amdgpu plugin
to retry.

Change files_ext.c to read a response of 1 from a plugin restore
function to mean retry.

Signed-off-by: David Francis <David.Francis@amd.com>
2025-11-14 18:31:36 +00:00
David Francis
7a4ee0ae8e restorer: Skip non-regular VMAs
amdgpu represents allocated device memory as a memory mapping
of the device file. This is a non-standard VMA that must
be handled by the plugin, not the normal VMA code.

Ignore all VMAs on device files.

Signed-off-by: David Francis <David.Francis@amd.com>
2025-11-14 18:31:36 +00:00
Yanning Yang
920437205c plugins/amdgpu: Update README.md and criu-amdgpu-plugin.txt
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-14 18:27:31 +00:00
Yanning Yang
4a3a695dfb plugins/amdgpu: Implement parallel restore
This patch implements the entire logic to enable the offloading of
buffer object content restoration.

The goal of this patch is to offload the buffer object content
restoration to the main CRIU process so that this restoration can occur
in parallel with other restoration logic (mainly the restoration of
memory state in the restore blob, which is time-consuming) to speed up
the restore phase. The restoration of buffer object content usually
takes a significant amount of time for GPU applications, so
parallelizing it with other operations can reduce the overall restore
time.

It has three parts: the first replaces the restoration of buffer objects
in the target process by sending a parallel restore command to the main
CRIU process; the second implements the POST_FORKING hook in the amdgpu
plugin to enable buffer object content restoration in the main CRIU
process; the third stops the parallel thread in the RESUME_DEVICES_LATE
hook.

This optimization only focuses on the single-process situation (common
case). In other scenarios, it will turn to the original method. This is
achieved with the new `parallel_disabled` flag.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-14 18:27:31 +00:00
Yanning Yang
33ed774c8d plugins/amdgpu: Add parallel restore command
Currently the restore of buffer object comsumes a significant amount of
time. However, this part has no logical dependencies with other restore
operations. This patch introduce some structures and some helper
functions for the target process to offload this task to the main CRIU
process.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-14 18:27:31 +00:00
Yanning Yang
6386140754 plugins/amdgpu: Add socket operations
When enabling parallel restore, the target process and the main CRIU
process need an IPC interface to communicate and transfer restore
commands. This patch adds a Unix domain TCP socket and stores this
socket in `fdstore`.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-14 18:27:31 +00:00
Pengda Yang
ddbb3dbd8d limit the field width of 'scanf'
Fixes: #2121

Signed-off-by: Pengda Yang <daz-3ux@proton.me>
2025-11-14 18:26:27 +00:00
Andrei Vagin
3c7d4fa013 criu: Version 4.2 (CRIUTIBILITY)
Major changes:
* plugins/amdgpu: Implement parallel restore
* Handle processes with uprobes vma
* Fix: getsockopt usage for SO_PASSCRED/SO_PASSSEC on Linux 6.16
* Relax ELF magic check to support MIPS libraries
* pagemap: prevent integer overflow in pagemap_len

This release's name is a nod to the growing challenge we face in
maintaining compatibility across the rapidly evolving Linux kernel
ecosystem.

The full changelog can be found here: https://criu.org/Download/criu/4.2.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-13 08:40:46 -08:00
Andrei Vagin
0a7e7d09dd log: use sizeof(*hdr) instead of sizeof(hdr)
Using sizeof(hdr) where hdr is a pointer gives the size of the pointer,
not the size of the structure it points to.

Reported-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-13 08:40:46 -08:00
Andrei Vagin
e689d902b3 criu/log: properly handle truncated length from vsnprintf
vsnprintf does not always return the number of bytes actually written to
the buffer.

If the output was truncated due to the buffer limit, the return value is
the total number of bytes which WOULD have been written to the final
string if enough space had been available.

This means we must cap the return value to the buffer size excluding the
terminating null byte to correctly calculate the log entry size.

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-11-13 08:40:46 -08:00
Radostin Stoyanov
6344e8d71c cr-servce: move kerndat_init after log_init
kerndat_init() can generate a significant volume of logs. If called
before log_init(), all these messages will be saved in the
early_log_buffer, which has a limited capacity. Additionally, saving to
the early_log_buffer can introduce a performance penalty, especially
when verbose mode is not enabled.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Andrei Vagin <avagin@google.com>
2025-11-13 08:40:46 -08:00
Andrei Vagin
a525b3c32e test/vdso-proxy: handle merged vma-s
When we compare two list of vma-s, we need to take into account that
some of them could be merged.

Fixes #12286

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-11-13 08:40:46 -08:00
Andrei Vagin
ce680fc6c7 Revert "plugins/amdgpu: Implement parallel restore"
This functionality (#2527) is being reverted and excluded from this
release due to issue #2812.

It will be included in a subsequent release once all associated issues
are resolved.

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-11-13 08:40:46 -08:00
Radostin Stoyanov
1d08ff8ca7 coredump: fix handling of num_pages
This patch fixes the following error:

$ sudo make -C test/others/criu-coredump run
...
Traceback (most recent call last):
  File "/home/circleci/criu/coredump/coredump", line 55, in <module>
    main()
  File "/home/circleci/criu/coredump/coredump", line 47, in main
    coredump(opts)
  File "/home/circleci/criu/coredump/coredump", line 14, in coredump
    cores = generator(os.path.realpath(opts['in']))
  File "/home/circleci/criu/coredump/criu_coredump/coredump.py", line 192, in __call__
    self.coredumps[pid] = self._gen_coredump(pid)
  File "/home/circleci/criu/coredump/criu_coredump/coredump.py", line 214, in _gen_coredump
    cd.vmas = self._gen_vmas(pid)
  File "/home/circleci/criu/coredump/criu_coredump/coredump.py", line 992, in _gen_vmas
    v.data = self._gen_mem_chunk(pid, vma, v.filesz)
  File "/home/circleci/criu/coredump/criu_coredump/coredump.py", line 879, in _gen_mem_chunk
    page_mem = self._get_page(pid, page_no)
  File "/home/circleci/criu/coredump/criu_coredump/coredump.py", line 797, in _get_page
    num_pages = m.get("nr_pages", m.compat_nr_pages)
AttributeError: 'dict' object has no attribute 'compat_nr_pages'
+ exit 1
make[1]: *** [Makefile:3: run] Error 1

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Andrei Vagin <avagin@google.com>
2025-11-13 08:40:46 -08:00
alam0rt
cb8e1da3f4 coredump: use compat_nr_pages as fallback
Use nr_pages when available, falling back to compat_nr_pages
for compatibility.

Signed-off-by: alam0rt <sam@samlockart.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:42:32 -08:00
Radostin Stoyanov
0fa6ff3d18 test/others: add tests for check() with pycriu
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:35 -08:00
Radostin Stoyanov
567f70ce19 test/others: add test for check() with libcriu
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:35 -08:00
Radostin Stoyanov
a1dc885027 test/rpc: update errno check
The --mntns-compat-mode option is no longer parsed with CHECK.
Use --log-file instead to test the error message.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:35 -08:00
Radostin Stoyanov
3c841af2cf pycriu: use explicit imports for __init__
_init__.py defines the public API for pycriu. It is important to use
explicit imports to avoid leaking every symbol from criu.py into the
pycriu namespace. This avoids import-time side effects, prevents name
collisions, and circular-import traps.

Fixes the following lint error:
  F403 `from .criu import *` used; unable to detect undefined names

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:35 -08:00
Radostin Stoyanov
f7ccb63bdd pycriu: set RPC opts for CHECK
This allows users to specify RPC options when
using the check() functionality.

Co-authored-by: Andrii Herheliuk <andrii@herheliuk.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:35 -08:00
Radostin Stoyanov
9371c4a789 cr-service: refactor RPC opts parsing for check()
The check() functionality is very different from dump, pre-dump,
and restore. It is used only to check if the kernel supports required
features, and does not need the majority of options set via RPC.

In particular, we don't need to open `image_dir` when running `check()`
because this functionality doesn't create or process image files. In
this case, `image_dir` is used as `work_dir`, only when the latter is
not specified and a log file is used.

This patch updates the RPC options parser so that it only handles the
logging options when check() is used. Logging to a file is required when
log_file is explicitly set or no log_to_stderr is used. In such case, we
also resolve images_dir and work_dir where the log file will be created.

Fixes: #2758

Suggested-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:35 -08:00
Radostin Stoyanov
72ca94db4d cr-service: refactor logging setup
Move the logging initialization into a helper function that
can be reused.

No functional change intended.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:35 -08:00
Radostin Stoyanov
5966ffe8a7 cr-service: refactor images_dir path resolution
Move the images_dir selection logic from setup_opts_from_req() into a
new function: resolve_images_dir_path(). This improves readability and
allows the code to be reused. While at it, use snprintf() instead of
sprintf() for the /proc path and ensure NULL termination after strncpy().

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:34 -08:00
Radostin Stoyanov
60a731ab38 cr-service: drop images_dir from setproctitle
Commit 9089ce8 ("service: use setproctitle") extended cr-service to
get the full path of images_dir using readlink(). However, the RPC
API was later extended to allow setting a custom path (folder) to
be set instead of passing a file descriptor, which causes readlink()
to fail as the path is not a symbolic link.

It would be better to drop the code setting the images-dir path as a
string in the proctitle.

Fixes: #2794

Suggested-by: Andrei Vagin <avagin@google.com>
Co-authored-by: Andrii Herheliuk <andrii@herheliuk.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:34 -08:00
Radostin Stoyanov
ee4100c09f cr-service: refactor images/workdir setup
Move the code that opens the images directory, resolves its absolute
path via readlink(), selects the work_dir, and chdir()s into it into a
new function: setup_images_and_workdir(). This reduces the size of
`setup_opts_from_req()`, improves its readability, and allows this
functionality to be reused.

While at it, change open_image_dir() to take a const char *dir
parameter, reflecting that the path is not modified by the function and
allowing callers to pass string literals without casts.

No functional changes are intended.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:34 -08:00
Andrii Herheliuk
71a637923f pycriu: set default value for sk_name
This change allows users to call criu.use_sk() without any
parameters to use the default socket name.

Co-authored-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
2025-11-05 15:41:34 -08:00
Andrii Herheliuk
d2c46b92b0 pycriu: better socket error handling
[Errno 2] No such file or directory -> Socket file not found.
[Errno 111] Connection refused -> Service not running.

Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
2025-11-05 15:41:34 -08:00
Andrii Herheliuk
7aad7317b4 lib/pycriu: changing the default behavior to use the system binary
Use system-installed CRIU binary instead of a local file

Thanks to @avagin for suggesting this solution.

Co-authored-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
2025-11-05 15:41:34 -08:00
Radostin Stoyanov
3f97cfe876 test/libcriu: check setting of RPC config file
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:34 -08:00
Radostin Stoyanov
2878faa74c libcriu: enable setting of RPC config file
Container runtimes that use libcriu (e.g., crun) need to specify a CRIU
configuration file that allows to overwrite default options set via RPC.
This is particularly useful to set options such as `--tcp-established`
via `/etc/criu/runc.conf` in Kubernetes.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:34 -08:00
Kir Kolyshkin
07ad2473f2 Use command -v instead of which
Unlike "which", which is a separate executable not always installed by
default, "command -v" is a shell built-in available at least for bash,
dash, and busybox shell.

Unlike "which", "command -v" is also easier to grep for, and it is
already used in a few places here.

Inspired by commit 57251d811.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-11-05 15:41:34 -08:00
Radostin Stoyanov
afcfcd3bf6 ci: add which dependency in dnf packages
which is used in Makefiles to check for dependencies:

Example:
export USE_ASCIIDOCTOR ?= $(shell which asciidoctor 2>/dev/null)

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:34 -08:00
Radostin Stoyanov
6860181474 ci: add wheel and setuptools in dnf packages
These dependencies are required to for `pip install`.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:34 -08:00
Radostin Stoyanov
d3dfb663b1 make: don't install external dependencies
Don't install external pip dependencies when running `make install`.
As we are not really into developing a Python project, we should
not install additional packages. CRIU does that nowhere else.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:34 -08:00
Radostin Stoyanov
f74e68daf9 ci: verify call order of action-script hooks
The existing test collects all action-script hooks triggered during
`h`, `ns`, and `uns` runs with ZDTM into `actions_called.txt`, then
verifies that each hook appears at least once. However, the test does
not verify that hooks are invoked *exactly once* or in *correct order*.

This change updates the test to run ZDTM only with ns flavour as this
seems to cover all action-script hooks, and checks that all hooks are
called correctly.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:34 -08:00
Radostin Stoyanov
f824dc735b ci: consolidate action-script tests
This patch consolidates the action-script tests into
`test/others/action-script` to ensure all tests are executed
consistently and reduce duplication. Since we had two tests that appear
to do the same thing, we can remove the one that doesn't use zdtm.py.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:34 -08:00
Andrii Herheliuk
d5c81f8108 pycriu: prevent always appending "Unknown" to error messages
Regardless of the actual error message, "Unknown" was always appended
to the end of the string, resulting in messages like:
"DUMP failed: Error(3): No process with such pidUnknown".

Fixed by changing standalone if statements to else-if blocks so
"Unknown" is only added when no specific error condition matches.

Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
2025-11-05 15:41:34 -08:00
Andrii Herheliuk
540c631dd0 pycriu: add missing protobuf dependency
pycriu depends on protobuf to function correctly. Currently,
it raises an error if protobuf is not installed. Adding
protobuf to the dependencies ensures it is available after
installing pycriu.

Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
2025-11-05 15:41:34 -08:00
Andrii Herheliuk
a5ae3c184b pycriu: set licence to LGPLv2.1
We use LGPL-v2.1 license for the libcriu and pycriu as they are
intended to be usable by both proprietary and open-source applications.

Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:34 -08:00
Igor Svilenkov Bozic
697c31abe4 zdtm: shstk: add SHSTK_ENABLE test build option
* add SHSTK_ENABLE=1 toggle
* passes -mshstk to compiler and -z shstk to linker

Example:
  $ make -C test/zdtm/static clean
  $ make -C test/zdtm/static V=1 SHSTK_ENABLE=1 env00

  $ readelf --notes test/zdtm/static/env00 | grep SHSTK
      Properties: x86 feature: SHSTK

Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Co-Authored-By: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-05 15:41:34 -08:00
Igor Svilenkov Bozic
6fd71b9ee9 x86/criu: shstk: restore SHSTK via premap loops
* call shstk_vma_restore() for VMA_AREA_SHSTK in vma_remap()
* delete map/copy/unmap from shstk_restore() and keep token setup + finalize
* before the loop naturally stopped at cet->ssp-8, so a -8 nudge is required here

Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Co-Authored-By: Andrei Vagin <avagin@gmail.com>
[ alex: small code cleanups ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-05 15:41:34 -08:00
Igor Svilenkov Bozic
abf4a71d99 x86/criu: shstk: add shstk_vma_restore()
1. create shadow stack vma during vma_remap cycle
2. copy contents from a premapped non-shstk VMA into it
3. unmap premapped non-shstk VMA
4. Mark shstk VMA for remap into the final destination

Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Co-Authored-By: Andrei Vagin <avagin@gmail.com>
Co-Authored-By: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
[ alex: debugging, rework together with Andrei and code cleanup ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-05 15:41:34 -08:00
Igor Svilenkov Bozic
02462c19c4 restorer: shstk: allocate restorer shadow stack
* reserve space for restorer shadow stack
* set tmp_shstk at mem, advance mem by PAGE_SIZE
* forget the extra PAGE_SIZE (shstk) for premapped VMAs

Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Co-Authored-By: Andrei Vagin <avagin@gmail.com>
[ alex: small code cleanups ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-05 15:41:34 -08:00
Alexander Mikhalitsyn
b18c07d8a8 restorer: shstk: add shstk_min_mmap_addr()
* default: return whatever passed in
  eg. to be used as
     shtk_min_mmap_addr(kdat.mmap_min_addr)
* x86: ignore def and return 4G

On x86, CET shadow stack is required to be mapped above 4GiB
On the other hand forcing 4GiB globally would break 32-bit restores.

Co-Authored-By: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-05 15:41:34 -08:00
Igor Svilenkov Bozic
f29cb750db x86/criu: shstk restorer memory accounting functions
* shstk_restorer_stack_size(): PAGE_SIZE
* shstk_set_restorer_stack(): set restorer temporary shadow stack start

Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Co-Authored-By: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-05 15:41:34 -08:00
Igor Svilenkov Bozic
3365c7c025 restorer: shstk: add restorer shadow stack stubs
* shstk_restorer_stack_size() – restorer shadow stack size
* shstk_set_restorer_stack() – set restorer shadow stack start

Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Co-Authored-By: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-05 15:41:34 -08:00
Radostin Stoyanov
bb9a7202a7 test/others/rpc: show logs on error
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:34 -08:00
Radostin Stoyanov
9d072222ef test/others/rpc: parse action-script via config
Extend the test for overwriting config options via RPC with
repeatable option (--action-script) and verify that the value
will not be silently duplicated.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:34 -08:00
Radostin Stoyanov
c03c08d1bc cr-service: refactor rpc config parsing
When an additional configuration file is specified via RPC, this file is
parsed twice: first at an early stage to load options such as --log-file,
--work-dir, and --images-dir; and again after all RPC options and
configuration files have been evaluated.

This allows users to overwrite options specified via RPC by the
container runtime (e.g., --tcp-established). However, processing
the RPC config file twice leads to silently duplicating the values
of repeatable options such as `--action-script`.

To address this problem, we adjust the order of options parsing so
that the RPC config file is evaluated only once. This change should
not introduce any functional changes. Note that this change does
not affect the logging functionality, as early log messages are
temporarily buffered and only written to the log file once it has
been initialized (see commit 1ff2333 "Printout early log messages").

Fixes #2727

Suggested-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:34 -08:00
Shashank Balaji
dcce9bd0e2 zdtm: add a test for --allow-uprobes option
Program flow:
- Parse the test's own executable to calculate the file offset of the uprobe
target function symbol
- Enable the uprobe at the target function
- Call the target function to trigger the uprobe, and hence the uprobes vma
creation
- C/R
- Call the target function again to check that no SIGTRAP is sent, since the
uprobe is still active

At least v1.7 of libtracefs is required because that's when
tracefs_instance_reset was introduced. The uprobes API was introduced in v1.4,
and the dynamic events API was introduced in v1.3.

Ubuntu Focal doesn't have libtracefs. Jammy has v1.2.5, and Noble has v1.7.

Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
2025-11-05 15:41:34 -08:00
Shashank Balaji
f548d3af4a crtools: remove "consult documentation"
Most people know this, don't they? :)

Suggested-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
2025-11-05 15:41:34 -08:00
Mahadasyam, Shashank (SGC)
aeec40bf02 docs: add documentation for --allow-uprobes
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
2025-11-05 15:41:34 -08:00
Mahadasyam, Shashank (SGC)
bab72af9a5 vma: introduce --allow-uprobes option
This commit teaches criu to deal with processes which have a "[uprobes]" vma.

This vma is mapped by the kernel when execution hits a uprobe location. This
is done so as to execute the uprobe'd instruciton out-of-line in the special
vma. The uprobe'd location is replaced by a software breakpoint instruction,
which is int3 on x86. When execution reaches that location, control is
transferred over to the kernel, which then executes whatever handler code
it has to, for the uprobe, and then executed the replaced instruction out-of-line
in the special vma. For more details, refer to this commit:
d4b3b6384f

Reason for adding a new option
------------------------------

A new option is added instead of making the uprobes vma handling transparent
to the user, so that when a dump is attempted on a process tree in which a
process has the uprobes vma, criu will error, asking the user to use this option.
This gives the user a chance to check what uprobes are attached to the processes
being dumped, and try to ensure that those uprobes are active on restore as well.

Again, the same reason for requiring this option on restore as well. Because
if a process is dumped with an active uprobe, and on restore if the uprobe
is not active, then if execution reaches the uprobe location, then the process
will be sent a SIGTRAP, whose default behaviour will terminate and core dump
the process. This is because the code pages are dumped with the software
breakpoint instruction replacement at the uprobe'd locations. On restore, if
execution reaches these locations and the kernel sees no associated active
uprobes, then it'll send a SIGTRAP.

So, using this option is on dump and restore is an implicit guarantee on the
user's behalf that they'll take care of the active uprobes and that any future
SIGTRAPs because of this are not on us! :)

Handling uprobes vma on dump
----------------------------

We don't need to store any information about the uprobes vma because it's
completely handled by the kernel, transparent to userspace. So, when a uprobes
vma is detected, we check if the --allow-uprobes option was specified or not.
If so, then the allow_uprobes boolean in the inventory image is set (this is
used on restore). The uprobes vma is skipped from being added to the vma list.

Handling uprobes vma on restore
-------------------------------

If allow_uprobes is set in the inventory image, then check if --allow-uprobes
is specified or not. Restoring the vma is not required.

Fixes: checkpoint-restore#1961
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
2025-11-05 15:41:34 -08:00
Shashank Balaji
74bf40feeb crit: add VMA_AREA_UPROBES flag
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
2025-11-05 15:41:34 -08:00
Shashank Balaji
0ff2e0a66e criu-coredump: add VMA_AREA_UPROBES flag
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
2025-11-05 15:41:34 -08:00
Shashank Balaji
7bf402f6b3 vma: introduce VMA_AREA_UPROBES flag
This flag will be used for a "[uprobes]" vma.

Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
2025-11-05 15:41:34 -08:00
Radostin Stoyanov
520266d895 zdtm: add sk-unix-restore-fs-share test
Add a ZDTM test case where CRIU uses a helper process to restore
a non-empty process group with a terminated leader and a Unix
domain socket. This reproduces a corner case in which mount
namespace switching can fail during restore:

https://github.com/checkpoint-restore/criu/issues/2687

Signed-off-by: Qiao Ma <mqaio@linux.alibaba.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:34 -08:00
Kir Kolyshkin
790b3cf425 ci: run alpine tests on arm64
These tests reveal the following build error:

In file included from compel/include/uapi/compel/asm/sigframe.h:4,
                 from compel/plugins/std/infect.c:14:
/usr/include/asm/sigcontext.h:28:8: error: redefinition of 'struct sigcontext'
   28 | struct sigcontext {
      |        ^~~~~~~~~~

In file included from criu/arch/aarch64/include/asm/restorer.h:4,
                 from criu/arch/aarch64/crtools.c:11:
/usr/include/asm/sigcontext.h:28:8: error: redefinition of 'struct sigcontext'
   28 | struct sigcontext {
      |        ^~~~~~~~~~

Inspired by #2766 / #2767.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:41:34 -08:00
Pepper Gray
77553f07d3 make: prevent redefinition of 'struct sigcontext'
Compilation on gentoo/arm64 (llvm+musl) fails with:

In file included from compel/include/uapi/compel/asm/sigframe.h:4,
                 from compel/plugins/std/infect.c:14:
/usr/include/asm/sigcontext.h:28:8: error: redefinition of 'struct sigcontext'
   28 | struct sigcontext {
      |        ^~~~~~~~~~

In file included from criu/arch/aarch64/include/asm/restorer.h:4,
                 from criu/arch/aarch64/crtools.c:11:
/usr/include/asm/sigcontext.h:28:8: error: redefinition of 'struct sigcontext'
   28 | struct sigcontext {
      |        ^~~~~~~~~~

This is happening because <asm/sigcontext.h> and <signal.h> are
mutually incompatible on Linux.

To fix, use  <signal.h> instead of <asm/sigcontext.h> for arm64
(like all others arches do).

Fixes: #2766
Signed-off-by: Pepper Gray <hello@peppergray.xyz>
2025-11-05 15:40:55 -08:00
Radostin Stoyanov
3379c122e5 page-xfer: fix incompatible pointer type on armv7
page_pipe_read() expects an 'unsigned long *', but pi->nr_pages is u64.
On 32-bit platforms (e.g., armv7), passing &pi->nr_pages directly causes
a compiler error. To fix this we introduce a temporary variable and copy
the result back to pi->nr_pages.

Fixes: #2756

Suggested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:48:24 -08:00
Radostin Stoyanov
7a4b35a910 contributing: update links to mailing list
Our previous mailing list had some technical issues and we created
a new one that is hopefully more reliable.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:48:24 -08:00
Radostin Stoyanov
76394e93a8 ci: consolidate aarch64 tests on GitHub runners
Currently we run aarch64 tests on both Cirrus CI and GitHub runners.
However, Cirrus CI fails with "Monthly compute limit exceeded!". This
change removes the redundant tests to streamline our CI process.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:48:24 -08:00
Shashank Balaji
0a81dc8bbe ci/java: update base image from focal to jammy
Ubuntu Focal Fossa (20.04) reached its end-of-life on 31 May 2025. So, move
over to using Ubuntu Jammy (22.04) base images.

Also, focal repos do not have libtracefs, which the uprobes zdtm test needs.

Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
2025-11-02 07:48:24 -08:00
Radostin Stoyanov
b25ff1d336 Remove travis-ci leftovers
Travis CI stopped providing CI minutes for open-source projects
some time ago and we have migrated to GitHub actions.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:48:23 -08:00
Shashank Balaji
25f8be0f60 ci: use package-manager dependency install scripts
Currently, adding a package which is required either for development or testing
requires it to be added in multiple places due to many duplicated Dockerfiles
and installation scripts. This makes it difficult to ensure that all scripts
are updated appropriately and can lead to some places being missed.

This patch consolidates the list of dependencies and adds installation
scripts for each package-manager used in our CI (apk, apt, dnf, pacman).

This change also replaces the `debian/dev-packages.lst` as this subfolder
conflicts with the Ubuntu/Debian packing scripts used for CRIU:
https://github.com/rst0git/criu-deb-packages

This patch also removes the CentOS 8 build scripts as it is EOL
and the container registry is no longer available.

Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:48:23 -08:00
Andrei Vagin
67751bc11b docs: add developer overviews for AI assistants
This commit adds the document to provide high-level overviews of the
CRIU project for AI assistants like Claude and Gemini.

These documents are intended to be used as context for AI-powered
developer assistants to help them understand the project's goals,
architecture, and development process. This will allow them to provide
more accurate and helpful responses to developer questions.

The documents include:
- A brief introduction to CRIU
- A quick start guide for checkpointing and restoring a simple process
- An overview of the dump and restore process
- A description of the Compel subproject
- Information about the project's coding style, code layout, and tests

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-02 07:48:23 -08:00
Andrei Vagin
91758a68e9 zdtm: Remove junit_xml leftovers
The previous commit 4cd4a6b1ac ("zdtm: stop importing junit_xml")
removed the junit_xml library, but some variables related to it were
left in the code. This commit removes the unused `tc` variable and a
call to its `add_error_info` method.

Fixes: 4cd4a6b1ac ("zdtm: stop importing junit_xml")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-02 07:48:23 -08:00
dong sunchao
2d2168fc9c vdso: relax EI_OSABI check to support linux in ELF header
On some ARM/aarch64 systems, the VDSO ELF header sets EI_OSABI to 3 (Linux),
while CRIU expects 0 (System V). This strict check causes restore to fail
with "ELF header magic mismatch"

This patch relaxes the check to accept both values, improving compatibility
with modern toolchains and kernels (e.g. Linux 6.12+)

Fixes: #2751
Signed-off-by: dong sunchao <dongsunchao@gmail.com>
2025-11-02 07:48:23 -08:00
Andrei Vagin
2e26b36d44 pagemap: print page regions in the format start - end
During investigations, it’s much easier to read logs when regions are
printed in the start - end format rather than `start/size`.

In addition, all page counters and memory sizes are now printed in
hexadecimal, as they are hard to read in decimal form.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-02 07:48:23 -08:00
Andrei Vagin
7e0da4d975 pagemap: use unsigned long for page counts
Variables storing page counts were previously `unsigned int`, limiting
them to a maximum of 2^32 pages. With a 4k page size, this corresponds
to a 16TB memory mapping, which is insufficient for larger mappings.

This commit changes the type for these variables to `unsigned long` to
support larger memory mappings.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-02 07:48:23 -08:00
Andrei Vagin
afb2e6c3f9 pagemap: change PagemapEntry.nr_pages to uint64 to support huge mappings
Update the nr_pages field in PagemapEntry to uint64 to prepare for
checkpointing and restoring huge memory mappings.

Backward compatibility with older pagemap images is preserved.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-02 07:48:23 -08:00
Andrei Vagin
c7395f4cbe files: fork helpers without CLONE_FILES | CLONE_FS
On restore, CRIU needs to change mount namespaces to properly restore
files and unix sockets. However, the kernel prevents this if a process
is sharing its file system information (fs) with other processes.

Fixes #2687

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-11-02 07:48:23 -08:00
Filip Hejsek
a8c5e11715 lsm: use attr/apparmor/current to get apparmor label
On some kernels, attr/current can be intercepted by BPF LSM, causing
errors (#2033). Using attr/apparmor/current is preferable, because it
is guaranteed to return the apparmor label. attr/current will still be
used as a fallback for older kernels.

Fixes: #2033

Signed-off-by: Filip Hejsek <filip.hejsek@gmail.com>
2025-11-02 07:48:23 -08:00
dong sunchao
80c280610e compel/mips: Relax ELF magic check to support MIPS libraries
On MIPS platforms, shared libraries may use EI_ABIVERSION = 5 to indicate
support for .MIPS.xhash sections. The previous ELF header check in
handle_binary() strictly compared e_ident against a hardcoded value,
causing legitimate shared objects to be rejected.

This patch replaces the memcmp-based check with a structured validation
of ELF magic and class, and allows EI_ABIVERSION values beside 0.

fixes: #2745
Signed-off-by: dong sunchao <dongsunchao@gmail.com>
2025-11-02 07:48:23 -08:00
Lorenzo Fontana
053a22a23b pagemap: prevent integer overflow in pagemap_len
Fixes #2738

Original-patch-by: Andrey Vagin <avagin@google.com>
Signed-off-by: Lorenzo Fontana <fontanalorenz@gmail.com>
2025-11-02 07:48:23 -08:00
Andrei Vagin
a779417a3f zdtm: stop importing junit_xml
We are dropping support for generating JUnit XML reports in zdtm.py as we've
migrated testing infrastructure entirely to `GitHub Actions` and other
third-party test runners.

This package has been removed from some distribution repositories (e.g.,
Fedora), making it simpler to remove the dependency than to force installation
via pip.

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-11-02 07:48:23 -08:00
Andrei Vagin
254ba3e8cc ci: avoid Docker 28 due to regression
This change modifies the CI script to avoid Docker version 28, which has
a known regression that breaks Checkpoint/Restore (C/R) functionality.
The issue is tracked in the moby/moby project as
https://github.com/moby/moby/issues/50750.

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-11-02 07:48:23 -08:00
Dong Sunchao
4b73985955 criu/sockets: Restrict SO_PASSCRED and SO_PASSSEC to supported families
Linux 6.16+ restricts SO_PASSCRED and SO_PASSSEC to AF_UNIX, AF_NETLINK, and AF_BLUETOOTH
This patch updates CRIU to check the socket family before dumping these options

Fixes: #2705
Signed-off-by: Dong Sunchao <dongsunchao@gmail.com>
2025-11-02 07:48:23 -08:00
Dong Sunchao
fa1b399064 zdtm/static/sock_opts00: use unix socket to test SO_PASSCRED and SO_PASSSEC
SO_PASSCRED and SO_PASSSEC are only valid for AF_UNIX and AF_NETLINK
This patch updates the test logic to use a unix socket for these options,
while preserving the original value consistency check

Fixes: #2705
Signed-off-by: Dong Sunchao <dongsunchao@gmail.com>
2025-11-02 07:48:23 -08:00
Radostin Stoyanov
2ba3430106 test/zdtm/static/maps12: fix pointer-to-int cast
The `offset` argument to `mmap()` was computed with a direct cast from
pointer to `off_t`:

`(off_t)addr_hint - (off_t)map_base`

This causes a build failure when compiling since pointers and `off_t`
may differ in size on some platforms.

maps12.c: In function 'mmap_pages':
maps12.c:114:50: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
  114 |                    filemap ? fd : -1, filemap ? ((off_t)addr_hint - (off_t)map_base) : 0);
      |                                                  ^
maps12.c:114:69: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
  114 |                    filemap ? fd : -1, filemap ? ((off_t)addr_hint - (off_t)map_base) : 0);

The fix in this patch is to cast both pointers to `intptr_t`,
perform the subtraction in that type, and then cast the result
back to `off_t`.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:48:23 -08:00
Andrei Vagin
dcee5bd6ff make: Disable branch-protection for PIE code on ARM64
Branch protection uses PAC. It cryptographically "signs" a function's
return address before it is stored on the stack. Upon return, the address
is authenticated using a secret key. If the signature is invalid, the
program will fault.

The PIE code is used for the parasite and the restorer. In both cases, it
runs in a foreign process. The case of the restorer is even trickier
because it needs to restore the original PAC keys, which invalidates
all previously "signed" pointers within the restorer itself.

Fixes #2709

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-02 07:48:23 -08:00
Alexander Mikhalitsyn
98f2bd525a ci/vagrant: install vanilla kernel for Fedora Rawhide test
We need at least 6.16 to test MADV_GUARD_INSTALL support, but
our current Fedora Rawhide test uses only Rawhide's user space,
while using Fedora 42 kernel. Let's start using a vanilla kernel.

Suggested-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:23 -08:00
Alexander Mikhalitsyn
01265cfc69 test/zdtm/static/maps12: add madv guards test
Test for madvise(MADV_GUARD_INSTALL).

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:23 -08:00
Alexander Mikhalitsyn
9c0f725a62 criu/mem: dump: note MADV_GUARD pages as VMA_AREA_GUARD VMAs
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:23 -08:00
Alexander Mikhalitsyn
59b4d662ae criu/pie/restorer: add madvise(MADV_GUARD_INSTALL) restore logic
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:23 -08:00
Alexander Mikhalitsyn
63c7029686 criu/{mem, vdso, cr-restore}: introduce VMA_AREA_GUARD fake VMAs
Introduce a new kind of VMA - VMA_AREA_GUARD. In fact, it is not
a real VMA as it is not represented as struct vm_area_struct in
the kernel.

We want to reuse an existing vma infrastructure in CRIU to dump
an information about MADV_GUARD_INSTALL-covered address space
ranges as VMAs. Then, on restore, we need to carefully skip
those fake VMAs everywhere we expect a normal VMAs to be processed.
And only in restorer we use these VMAs to get an information about
where to call MADV_GUARD_INSTALL.

Suggested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:23 -08:00
Alexander Mikhalitsyn
cc047d595f criu/mem: dump: skip MADV_GUARD pages content dump
1. get info about MADV_GUARD_INSTALL-protected pages with
help of pagemap by looking for PME_GUARD_REGION flag if /proc/<pid>/pagemap
is used or by looking for PAGE_IS_GUARD flag if ioctl(PAGEMAP_SCAN) is used
2. skip those pages

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:23 -08:00
Alexander Mikhalitsyn
5843cbf975 criu/mem: refactor should_dump_page helper
Make should_dump_page to return int to indicate failure, also
return useful data back through the struct page_info structure
passed as a pointer.

Also, correspondingly convert all call sites.

No functional changes intended, except fixing a bug in
should_dump_page() as it could return (-1) when pmc_fill()
fails, while caller didn't expect that before.

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:23 -08:00
Alexander Mikhalitsyn
42580fcb16 criu/pagemap-cache: pagescan: look for PAGE_IS_GUARD pages
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:23 -08:00
Alexander Mikhalitsyn
1873e8f502 cr-dump: warn if MADV_GUARD is supported but isn't shown in pagemap
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:23 -08:00
Alexander Mikhalitsyn
4fc07a8a41 kerndat: add pagemap_scan_guard_pages feature check logic
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:23 -08:00
Alexander Mikhalitsyn
2bb77daa92 kerndat: add madvise(MADV_GUARD_INSTALL) feature-detection
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:23 -08:00
Alexander Mikhalitsyn
fce491113b criu/include/mman: define MADV_GUARD_INSTALL
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:23 -08:00
Andrei Vagin
5f94dd71e7 CI: Consolidate arm64 tests on GitHub runners
The arm64 tests are currently being executed on both actuated and GitHub
runners. This change removes the actuated runner to avoid redundancy and
streamline our CI process.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-02 07:48:23 -08:00
Andrei Vagin
c6c6f6f231 zdtm/socket-tcp-closing: fill socket buffers effectivly
Send large chunks to fill socket buffers.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-02 07:48:23 -08:00
Radostin Stoyanov
d586b30c6b vagrant: fix tar including archive in itself
The tar command was failing with the following message:

  $ tar cf criu.tar ../../../criu
  tar: Removing leading `../../../' from member names
  tar: ../../../criu/scripts/ci/criu.tar: archive cannot contain itself; not dumped

In addition, the /vagrant no-longer exist in the new Fedora images.

  bash: line 1: cd: /vagrant: No such file or directory

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:48:23 -08:00
Radostin Stoyanov
2762b21e4a vagrant: update image to fedora 42
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:48:23 -08:00
Radostin Stoyanov
0d1e280d09 vagrant: fix 'qemu' install
Installing this package currently fails with the following message:

  Package qemu is not available, but is referred to by another package.
  This may mean that the package is missing, has been obsoleted, or
  is only available from another source

  E: Package 'qemu' has no installation candidate

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:48:23 -08:00
Ignacio Moreno Gonzalez
64276874d8 restore: flush caches during restore
See the previous commit for rationale and architecture-specific details.

[ avagin: tweak code comment ]

Signed-off-by: Ignacio Moreno Gonzalez <Ignacio.MorenoGonzalez@kuka.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-02 07:48:23 -08:00
Ignacio Moreno Gonzalez
95d5e2e59b compel: flush caches after parasite injection
After the CRIU process saves the parasite code for the target thread in
the shared mmap, it is necessary to call __clear_cache before the target
thread executes the code.

Without this step, the target thread may not see the correct code to
execute, which can result in a SIGILL signal.

For the specific arm64 case. this is important so that the newly copied
code is flushed from d-cache to RAM, so that the target thread sees the
new code.

The change is based on commit 6be10a2 by @fu.lin and on input received
from @adrianreber.

[ avagin: tweak code comment ]

Signed-off-by: Ignacio Moreno Gonzalez <Ignacio.MorenoGonzalez@kuka.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-02 07:48:23 -08:00
Kir Kolyshkin
22c83e3eba images/Makefile: use msg-gen
In general, we use "$(E)" instead of "$(Q) echo", but we also have
a msg-gen macro which can be used here.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-11-02 07:48:22 -08:00
Kir Kolyshkin
066bf7bf3c Keep images/google/protobuf directory
Commit 68f92b551 removed images/google/protobuf directory, so it is
re-created each time during the build process.

This resulted in a weird behavior change. Previously, one could do
something like this:

	git clone $CRURL criu
	(cd criu && sudo make install-criu)
	rm -rf criu

This worked fine, including running rm -rf as a non-root user, since no
new directories were created under criu -- all directories were still
owned by the original user.

Since commit 68f92b551 the same sequence fails:

	rm: cannot remove '/home/runner/criu/images/google/protobuf/descriptor.pb-c.c': Permission denied
	rm: cannot remove '/home/runner/criu/images/google/protobuf/descriptor.pb-c.d': Permission denied
	rm: cannot remove '/home/runner/criu/images/google/protobuf/descriptor.pb-c.h': Permission denied

A workaround is to keep empty images/google/protobuf directory,
which is what this commit does.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-11-02 07:48:22 -08:00
Kir Kolyshkin
21c3b9c005 images/Makefile: fix using $(Q)
Commit 68f92b551 used `$$(Q)` instead of `$(Q)` in the Makefile target,
which resulted in the following error:

$(Q) echo "Generating descriptor.pb-c.c"
/bin/sh: 1: Q: not found
Generating descriptor.pb-c.c
$(Q) protoc --proto_path=/usr/include --proto_path=images/ --c_out=images/ /usr/include/google/protobuf/descriptor.proto
/bin/sh: 1: Q: not found

as well as:

$(Q) rm -rf images/google
/bin/sh: line 1: Q: command not found

Fix it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-11-02 07:48:22 -08:00
Radostin Stoyanov
7fbf7b2be4 images: remove symlink for descriptor.proto
Currently the build scripts create the following symlink:

  criu-4.1/images/google/protobuf/descriptor.proto -> /usr/include/google/protobuf/descriptor.proto

This symlink points to a system-wide absolute-path target. Also,
this symlink ends up in the release tarball. The tarball may later be
downloaded and unpacked by e.g. OS distributions. If unpacking is
done using Python 3.14+, it will fail.

This happens because Python 3.14 will switch the default behavior of
extractall() from "fully trusting the content of archive" to
"disallow common attack vectors while extracting the archive".
With this new behavior, extractall() raises an exception when at
least one file in the archive extracts or points to outside of the
extraction directory (these are called path traversal attacks and
zip slip attacks).

Reported-by: Dmitrii Kuvaiskii <dimakuv@amazon.de>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:48:22 -08:00
Pavel Tikhomirov
455c677399 zdtm: Add ztatic/mnt_ext_file_bind_auto test
The test creates a file bindmount in criu mntns and binds it into test
mntns, this external file bindmount is autodetected and restored via
"--external mnt[]" criu option.

Note: In previous patch we fix the problem on this code path where file
bindmount restore fails as there is excess "/" in source path.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2025-11-02 07:48:22 -08:00
Chuan Qiu
e31828ed8c mount: Fix trailing / when a file is bind-mounted
E.g. I have a /etc/hosts in workspace mounted from the host, and get the following message.

(00.141008)      1: mnt-v2: Create plain mountpoint /tmp/.criu.mntns.K1biY1/mnt-0000000938 for 938
(00.141546)      1: mnt-v2:     Mounting unsupported @938 (0)
(00.141887)      1: mnt-v2:     Bind /tmp/agent/1-d8c746c6fda3a8b2/workspace/etc/hosts/ to /tmp/.criu.mntns.K1biY1/mnt-0000000938
(00.142179)      1: Error (criu/mount-v2.c:319): mnt-v2: Failed to open_tree /tmp/agent/1-d8c746c6fda3a8b2/workspace/etc/hosts/: Not a directory
(00.143774) Error (criu/cr-restore.c:2320): Restoring FAILED.

Signed-off-by: Chuan Qiu <qiuc12@gmail.com>
2025-11-02 07:48:22 -08:00
समीर सिंह Sameer Singh
3dc865bc80 test: add static tests for ICMP socket
Add ZDTM static tests for IP4/ICMP and IP6/ICMP
socket feature.

Signed-off-by: समीर सिंह Sameer Singh <lumarzeli30@gmail.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
2025-11-02 07:48:22 -08:00
समीर सिंह Sameer Singh
a80c544845 sk-inet: Add support for checkpoint/restore of ICMP sockets
Currently there is no option to checkpoint/restore programs that use
ICMP sockets, such as `ping`. This patch adds support for the same.

Fixes #2557

Signed-off-by: समीर सिंह Sameer Singh <lumarzeli30@gmail.com>
2025-11-02 07:48:22 -08:00
Andrei Vagin
677a568919 zdtm/netns_sub_sysctl: skip unsupported sysctls
net/unix/max_dgram_qlen can't be tuned from non-root userns before:
v5.17-rc1~170^2~215 ("net: Enable max_dgram_qlen unix sysctl to be
configurable by non-init user namespaces")

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-11-02 07:48:22 -08:00
Pavel Tikhomirov
87bd09a0d1 net/sysctl: make ipv4/ping_group_range work in user namespaces
We dump sysctls from criu user namespace, but restore from restored user
namespace. So group id values should be mapped to the restored user
namespace gid space to restore correctly.

Signed-off-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2025-11-02 07:48:22 -08:00
Pavel Tikhomirov
45d09ae17e net/sysctl: fix broken ipv4_sysctls_op
We have ability to skip sysctl if there is no value, but we still give
n requests to sysctl_op, that is not correct and probably can segfault
on nullptr access. Fix it by adding ri to count non skipped requests.

To be on the safe side, let's add a check that ri == n on read, as we
should not do any skips there.

While on it lets fix bad error message prefix: s/unix/ipv4/.

Remove excess has_iarg set, and add sarg reset to NULL for the case
sysctl_op skipped it.

Signed-off-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2025-11-02 07:48:22 -08:00
Pavel Tikhomirov
4f057a6aeb net/sysctl: fix missprint in an error message
Fixes: f38e58836 ("net/sysctl: c/r ipv4/ping_group_range value")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2025-11-02 07:48:22 -08:00
Pavel Tikhomirov
4c7d42f67a ipc/sysctl: fix CTL_FLAGS_IPC_EACCES_SKIP by making it a flag
Having CTL_FLAGS_IPC_EACCES_SKIP == (CTL_FLAGS_OPTIONAL |
CTL_FLAGS_READ_EIO_SKIP) is probably not what we want. So let's make it
a real distinct flag.

Fixes: 840735aa0 ("ipc_sysctl: Prioritize restoring IPC variables using non usernsd approach")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2025-11-02 07:48:22 -08:00
Ivan Pravdin
922754dffd rpc/log: return first error always
Use shared first error buffer to return correct
first error in rpc.

Fixes: #338

Signed-off-by: Ivan Pravdin <ipravdin.official@gmail.com>
2025-11-02 07:48:22 -08:00
Radostin Stoyanov
a79b33d0c5 cpuinfo: show error when image is missing
The `criu cpuinfo check` command calls cpu_validate_cpuinfo(), which
attempts to open the cpuinfo.img file using `open_image()`. If the
image file is not found, `open_image()` returns an "empty image"
object. As a result, `cpu_validate_cpuinfo()` tries to read from it
and fails with the following error:

(00.002473) Error (criu/protobuf.c:72): Unexpected EOF on (empty-image)

This patch adds a check for an empty image and appropriate error message.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:48:22 -08:00
Andrei Vagin
99ba6db89b crtools: do a few minor cleanups
Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-02 07:48:22 -08:00
Liana Koleva
fcbaac0598 crtools: simplify check for cpuinfo subcommands
The cpuinfo command requires a "dump" or "check" subcommand. Thus, we
replace `CR_CPUINFO` with `CR_CPUINFO_DUMP` and `CR_CPUINFO_CHECK`.
This allows us to remove unnecessary subcommand check in
`image_dir_mode()` and perform all parsing in `parse_criu_mode()`.

With this change the check for validating the cpuinfo subcommand is
now done only once with `CR_CPUINFO_DUMP` or `CR_CPUINFO_CHECK` enum.

Signed-off-by: Liana Koleva <43767763+lianakoleva@users.noreply.github.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:48:22 -08:00
Prajwal S N
fbfed312e0 feat: introduce Nix flake
CRIU currently requires a number of dependencies in order to build from
source. The package names vary across distributions and package
managers. A Nix flake allows developers to spin up a dev environment
with `nix develop`, eliminating the hassle of manual dependency
management. It also prevents polluting the global package set on the
machine.

Signed-off-by: Prajwal S N <prajwalnadig21@gmail.com>
2025-11-02 07:48:22 -08:00
Alexander Mikhalitsyn
5f18ca1bbe test/zdtm/static: add maps11 test for MAP_DROPPABLE/MADV_WIPEONFORK
In this test we want to ensure that contents of droppable mappings
and mappings with MADV_WIPEONFORK is properly restored in
parent/child processes.

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:22 -08:00
Alexander Mikhalitsyn
dfa0ce1808 test/zdtm/static/maps02: add MAP_DROPPABLE testcase
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:22 -08:00
Alexander Mikhalitsyn
4f9dcfb9c8 pycriu/images/pb2dict: add MAP_DROPPABLE flag
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:22 -08:00
Alexander Mikhalitsyn
b90cfc1a80 criu/proc_parse: support MAP_DROPPABLE mappings
Support MAP_DROPPABLE [1] by detecting it from /proc/<pid>/smaps
and restoring it as a normal private mapping flag on vma with only
difference that instead of MAP_PRIVATE we should use MAP_DROPPABLE.

[1] 9651fcedf7

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:22 -08:00
Alexander Mikhalitsyn
6476488a51 test/zdtm/static/maps02: add MADV_WIPEONFORK testcase
In addition to that I did small non-functional corrections.

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:22 -08:00
Alexander Mikhalitsyn
af5412a433 criu/proc_parse: support MADV_WIPEONFORK/VM_WIPEONFORK
Support VM_WIPEONFORK [1] by detecting it from /proc/<pid>/smaps
and setting a corresponding MADV_WIPEONFORK flag on vma.

[1] d2cd9ede6e

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-11-02 07:48:22 -08:00
Andrei Vagin
2b8951a9cf image: use protoc instead of protoc-c
The new protoc 1.5.2 reports warnings:
`protoc-c` is deprecated. Please use `protoc` instead!

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-02 07:48:22 -08:00
Pavel Tikhomirov
1fdff7c7a6 zdtm: fix check for criu binary
The opts['action'] contains actor function and not the action name, so
we should compare it with a function.

While on it let's also add a comment about --criu-bin option if CRIU
binary is missing.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2025-11-02 07:48:22 -08:00
Pavel Tikhomirov
ae1395de18 zdtm.py: add an option to change pycriu import path
By default zdtm expects that criu is built from source first and only
then you can run zdtm tests against it. But what if you really want to
run tests against a criu version installed on the system? Yes there is
already a nice option for zdtm to change the criu binary it uses
"--criu-bin", but it would still end up using the pycriu module from
source and you would still have to build everything beforehand.

Let's add an option to change the path where zdtm searches for pycriu
module "--pycriu-search-path". This way we can run zdtm tests on the
criu installed on the system directly without building criu from source,
e.g. on Fedora it works like:

test/zdtm.py run --criu-bin /usr/sbin/criu \
  --pycriu-search-path /usr/lib/python3.13/site-packages \
  -t zdtm/static/env00

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2025-11-02 07:48:22 -08:00
Yanning Yang
7a5b3d1f41 plugins/amdgpu: Update README.md and criu-amdgpu-plugin.txt
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-02 07:48:22 -08:00
Yanning Yang
a61116fd93 plugins/amdgpu: Implement parallel restore
This patch implements the entire logic to enable the offloading of
buffer object content restoration.

The goal of this patch is to offload the buffer object content
restoration to the main CRIU process so that this restoration can occur
in parallel with other restoration logic (mainly the restoration of
memory state in the restore blob, which is time-consuming) to speed up
the restore phase. The restoration of buffer object content usually
takes a significant amount of time for GPU applications, so
parallelizing it with other operations can reduce the overall restore
time.

It has three parts: the first replaces the restoration of buffer objects
in the target process by sending a parallel restore command to the main
CRIU process; the second implements the POST_FORKING hook in the amdgpu
plugin to enable buffer object content restoration in the main CRIU
process; the third stops the parallel thread in the RESUME_DEVICES_LATE
hook.

This optimization only focuses on the single-process situation (common
case). In other scenarios, it will turn to the original method. This is
achieved with the new `parallel_disabled` flag.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-02 07:48:22 -08:00
Yanning Yang
e8ba7c103a plugins/amdgpu: Add parallel restore command
Currently the restore of buffer object comsumes a significant amount of
time. However, this part has no logical dependencies with other restore
operations. This patch introduce some structures and some helper
functions for the target process to offload this task to the main CRIU
process.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-02 07:48:22 -08:00
Yanning Yang
1fd1b670c4 plugins/amdgpu: Add socket operations
When enabling parallel restore, the target process and the main CRIU
process need an IPC interface to communicate and transfer restore
commands. This patch adds a Unix domain TCP socket and stores this
socket in `fdstore`.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-02 07:48:22 -08:00
Yanning Yang
e257d04974 pstree: Add has_children function
Currently, parallel restore only focuses on the single-process
situation. Therefore, it needs an interface to know if there is only one
process to restore. This patch adds a `has_children` function in
`pstree.h` and replaces some existing implementations with this
function.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-02 07:48:22 -08:00
Yanning Yang
497109eb4e cr-restore: Move cr_plugin_init after fdstore_init
Currently, when CRIU calls `cr_plugin_init`, `fdstore` is not
initialized. However, during the plugin restore procedure, there may be
some common file operations used in multiple hooks. This patch moves
`cr_plugin_init` after `fdstore_init`, allowing `cr_plugin_init` to use
`fdstore` to place these file operations.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-02 07:48:22 -08:00
Yanning Yang
427c0dc27b criu: Introduce a new device plugin hook for restore
Currently, in the target process, device-related restore operations and
other restore operations almost run sequentially. When the target
process executes the corresponding CRIU hook functions, it can't perform
other restore operations.  However, for GPU applications, some device
restore operations have no logical dependencies on other common restore
operations and can be parallelized with other operations to speed up the
process.

Instead of launching a thread in child processes for parallelization,
this patch chooses to add a new hook, `POST_FORKING`, in the main CRIU
process to handle these restore operations. This is because the
restoration of memory state in the restore blob is one of the most
time-consuming parts of all restore logic. The main CRIU process can
easily parallelize these operations, whereas parallelizing in threads
within child processes is challenging.

- POST_FORKING

*POST_FORKING: Hook to enable the main CRIU process to perform some
restore operations of plugins.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-02 07:48:22 -08:00
Radostin Stoyanov
d57d40a5ad sk-inet: add MPTCP definition
Building CRIU on Ubuntu 20.04 fails with the following error:

criu/sk-inet.c: In function 'can_dump_ipproto':
criu/sk-inet.c:131:16: error: 'IPPROTO_MPTCP' undeclared (first use in this function); did you mean 'IPPROTO_MTP'?
  131 |   if (proto == IPPROTO_MPTCP)
      |                ^~~~~~~~~~~~~
      |                IPPROTO_MTP

Add definition for MPTCP to fix this error.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:48:22 -08:00
Radostin Stoyanov
fddca67cc6 seize: fix pause devices for frozen containers
The container checkpointing procedure in Kubernetes freezes running
containers to create a consistent snapshot of both the runtime state
and the rootfs of the container. However, when checkpointing a GPU
container, the container must be unfrozen before invoking the
cuda-checkpoint tool.

This is achieved in prepare_freezer_for_interrupt_only_mode(), which
needs to be called before the PAUSE_DEVICES hook. The patch introducing
this functionality fixes this problem for containers with multiple
processes. However, if the container has a single process,
prepare_freezer_for_interrupt_only_mode() must be invoked immediately
before the PAUSE_DEVICES hook.

Fixes: #2514

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:48:21 -08:00
Lorenzo Fontana
366d73a4c2 make: remove checks and warnings for bsd strlcat and strlcpy
In 0a7c5fd1bd we swapped the BSD
implementation of strlcat and strlcpy in favor of our own replacement.

The checks and the predefined macros are not needed anymore.

Signed-off-by: Lorenzo Fontana <fontanalorenz@gmail.com>
2025-11-02 07:48:21 -08:00
Andrei Vagin
1eaa870cce kerndat: check that hardware breakpoints work
In some cases, they might not work in virtual machines if the hypervisor
doesn't virtualize them. For example, they don't work in AMD SEV virtual
machines if the Debug Virtualization extension isn't supported or isn't
enabled in SEV_FEATURES.

Fixes #2658

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-02 07:48:21 -08:00
Radostin Stoyanov
b458a5c1ad sk-inet: add message how to disable MPTCP in Go
With Go version 1.24, ListenConfig now uses MPTCP by default [1].
Checkpoint/restore for this protocol is not currently supported
and adding support requires kernel changes that are not trivial
to implement. As a result, checkpointing of many containers that
run Go programs is likely to fail with the following error [2]:

(00.026522) Error (criu/sk-inet.c:130): inet: Unsupported proto 262 for socket 2f9bc5

This patch adds a message with suggested workaround for this problem.

[1] https://go.dev/doc/go1.24#netpkgnet
[2] https://github.com/checkpoint-restore/criu/issues/2655

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:48:21 -08:00
Pavel Tikhomirov
5a725266ac zdtm: add mnt_ro_root test
It makes root mount readonly and checks that it is still readonly after
migration.

Make zdtm/static writable for logs via "bind" desc option.

v2: explain why we don't have explicit rw/ro flag check
v3: use new zdtm "bind" desc option

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2025-11-02 07:48:21 -08:00
Pavel Tikhomirov
6b3826a6fb zdtm/lib: add "bind" desc option
Add {'bind': 'path/to/bindmount'} zdtm descriptor option, so that in
test mount namespace a directory bindmount can be created before running
the test.

This is useful to leave test directory writable (e.g. for logs) while
the test makes root mount readonly. note: We create this bindmount early
so that all test files are opened on it initially and not on the below
mount. Will be used in mnt_ro_root test.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2025-11-02 07:48:21 -08:00
Pavel Tikhomirov
88cb552f69 mount: restore root mount flags
Mount flags belong to mount and mount namespace of the Container, so we
should preserve them, as Container user will not expect mounts switching
between ro and rw over c/r.

Fixes: #2632

v5: fix both mount-v1 and mount-v2

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2025-11-02 07:48:17 -08:00
Radostin Stoyanov
b6dca31162 aarch64/crtools: fix define for missing constants
Building CRIU package on Debian 11 aarch64 fails with

criu/arch/aarch64/crtools.c: In function 'save_pac_keys':
criu/arch/aarch64/crtools.c:32:31: error: storage size of 'paca' isn't known
  struct user_pac_address_keys paca;
                               ^~~~
criu/arch/aarch64/crtools.c:33:31: error: storage size of 'pacg' isn't known
  struct user_pac_generic_keys pacg;
                               ^~~~
criu/arch/aarch64/crtools.c:47:15: error: 'HWCAP_PACA' undeclared (first use in this function); did you mean 'HWCAP_FCMA'?
  if (hwcaps & HWCAP_PACA) {
               ^~~~~~~~~~
               HWCAP_FCMA
criu/arch/aarch64/crtools.c:47:15: note: each undeclared identifier is reported only once for each function it appears in
criu/arch/aarch64/crtools.c:53:44: error: 'NT_ARM_PACA_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_SVE'?
   if ((ret = ptrace(PTRACE_GETREGSET, pid, NT_ARM_PACA_KEYS, &iov))) {
                                            ^~~~~~~~~~~~~~~~
                                            NT_ARM_SVE
criu/arch/aarch64/crtools.c:73:39: error: 'NT_ARM_PAC_ENABLED_KEYS' undeclared (first use in this function)
   ret = ptrace(PTRACE_GETREGSET, pid, NT_ARM_PAC_ENABLED_KEYS, &iov);
                                       ^~~~~~~~~~~~~~~~~~~~~~~
criu/arch/aarch64/crtools.c:82:15: error: 'HWCAP_PACG' undeclared (first use in this function); did you mean 'HWCAP_AES'?
  if (hwcaps & HWCAP_PACG) {
               ^~~~~~~~~~
               HWCAP_AES
criu/arch/aarch64/crtools.c:88:44: error: 'NT_ARM_PACG_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_SVE'?
   if ((ret = ptrace(PTRACE_GETREGSET, pid, NT_ARM_PACG_KEYS, &iov))) {
                                            ^~~~~~~~~~~~~~~~
                                            NT_ARM_SVE
criu/arch/aarch64/crtools.c:33:31: error: unused variable 'pacg' [-Werror=unused-variable]
  struct user_pac_generic_keys pacg;
                               ^~~~
criu/arch/aarch64/crtools.c:32:31: error: unused variable 'paca' [-Werror=unused-variable]
  struct user_pac_address_keys paca;
                               ^~~~
criu/arch/aarch64/crtools.c: In function 'arch_ptrace_restore':
criu/arch/aarch64/crtools.c:227:31: error: storage size of 'upaca' isn't known
  struct user_pac_address_keys upaca;
                               ^~~~~
criu/arch/aarch64/crtools.c:228:31: error: storage size of 'upacg' isn't known
  struct user_pac_generic_keys upacg;
                               ^~~~~
criu/arch/aarch64/crtools.c:241:18: error: 'HWCAP_PACA' undeclared (first use in this function); did you mean 'HWCAP_FCMA'?
   if (!(hwcaps & HWCAP_PACA)) {
                  ^~~~~~~~~~
                  HWCAP_FCMA
criu/arch/aarch64/crtools.c:255:44: error: 'NT_ARM_PACA_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_SVE'?
   if ((ret = ptrace(PTRACE_SETREGSET, pid, NT_ARM_PACA_KEYS, &iov))) {
                                            ^~~~~~~~~~~~~~~~
                                            NT_ARM_SVE
criu/arch/aarch64/crtools.c:261:44: error: 'NT_ARM_PAC_ENABLED_KEYS' undeclared (first use in this function)
   if ((ret = ptrace(PTRACE_SETREGSET, pid, NT_ARM_PAC_ENABLED_KEYS, &iov))) {
                                            ^~~~~~~~~~~~~~~~~~~~~~~
criu/arch/aarch64/crtools.c:268:18: error: 'HWCAP_PACG' undeclared (first use in this function); did you mean 'HWCAP_AES'?
   if (!(hwcaps & HWCAP_PACG)) {
                  ^~~~~~~~~~
                  HWCAP_AES
criu/arch/aarch64/crtools.c:275:44: error: 'NT_ARM_PACG_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_SVE'?
   if ((ret = ptrace(PTRACE_SETREGSET, pid, NT_ARM_PACG_KEYS, &iov))) {
                                            ^~~~~~~~~~~~~~~~
                                            NT_ARM_SVE
criu/arch/aarch64/crtools.c:233:6: error: variable 'ret' set but not used [-Werror=unused-but-set-variable]
  int ret;
      ^~~
criu/arch/aarch64/crtools.c:228:31: error: unused variable 'upacg' [-Werror=unused-variable]
  struct user_pac_generic_keys upacg;
                               ^~~~~
criu/arch/aarch64/crtools.c:227:31: error: unused variable 'upaca' [-Werror=unused-variable]
  struct user_pac_address_keys upaca;
                               ^~~~~
This patch adds the missing constants and structs if undefined.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:42:55 -08:00
Andrei Vagin
5de61a721f net: nftables: avoid restore failure if the CRIU nft table already exist
CRIU locks the network during restore in an "empty" network namespace.
However, "empty" in this context means CRIU isn't restoring the
namespace. This network namespace can be the same namespace where
processes have been dumped and so the network is already locked in it.

Fixes #2650

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-02 07:42:55 -08:00
Younes Manton
b9da95b0b2 s390: Fix FP reg restore after parasite code runs
Currently we save FP regs before parasite code runs, and restore after
for --leave-running, --check-only, and in case of errors. In case of
errors the error may have happened before FP regs were saved, so we
should only restore them if they were actually saved.

Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
2025-11-02 07:42:55 -08:00
Adrian Reber
74799ae023 aarch64: fix build with missing NT_ARM_PAC_ENABLED_KEYS
On a RHEL 8 based system building CRIU fails with:

criu/arch/aarch64/crtools.c: In function 'save_pac_keys':
criu/arch/aarch64/crtools.c:73:39: error: 'NT_ARM_PAC_ENABLED_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_PACA_KEYS'?
   ret = ptrace(PTRACE_GETREGSET, pid, NT_ARM_PAC_ENABLED_KEYS, &iov);
                                       ^~~~~~~~~~~~~~~~~~~~~~~
                                       NT_ARM_PACA_KEYS
criu/arch/aarch64/crtools.c:73:39: note: each undeclared identifier is reported only once for each function it appears in
criu/arch/aarch64/crtools.c: In function 'arch_ptrace_restore':
criu/arch/aarch64/crtools.c:261:44: error: 'NT_ARM_PAC_ENABLED_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_PACA_KEYS'?
   if ((ret = ptrace(PTRACE_SETREGSET, pid, NT_ARM_PAC_ENABLED_KEYS, &iov))) {
                                            ^~~~~~~~~~~~~~~~~~~~~~~
                                            NT_ARM_PACA_KEYS

This adds the missing define if it is undefined.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-11-02 07:42:55 -08:00
Radostin Stoyanov
6805841660 cuda: remove redundant goto label
The `goto interrupt` label is unnecessary as the code directly
returns after `cuda_process_checkpoint_action()`.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:42:55 -08:00
Radostin Stoyanov
e7aee3c5c7 cuda: use pr_perror for libc function errors
When handing errors for functions such as `ptrace()`, `pipe()`, and
`fork()` it would be better to use `pr_perror` instead of `pr_err`
as it would include a message describing the encountered error.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-02 07:42:55 -08:00
Andrei Vagin
5ff52326e1 restore: use the new kernel interface to restore timers
Thomas Gleixner introduced the new interface to create posix timers
with specifed timer IDs:
ec2d0c0462

Previously, CRIU recreated timers by repeatedly creating and deleting
them until the desired ID was reached. This approach isn't fast,
especially for timers with large IDs. For example, restoring two timers
with IDs 1000000 and 2000000 took approximately 1.5 seconds.

The new `prctl()` based interface allows direct creation of timers with
specified IDs, reducing the restoration time to around 3 microseconds
for the same example.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-02 07:42:55 -08:00
Andrei Vagin
9a1e979666 compel: fix the stack test
The stack test incorrectly assumed the page immediately
following the stack pointer could never be changed. This doesn't work,
because this page can be a part of another mapping.

This commit introduces a dedicated "stack redzone," a small guard region
directly after the stack. The stack test is modified to specifically
check for corruption within this redzone.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-02 07:42:55 -08:00
Yuanhong Peng
daa548bbfb criu: Do not print failed message when there is no late stage hook
This is highly confusing, and it seems that the ret variable
is not handled in the subsequent process.

Signed-off-by: Yuanhong Peng <yummypeng@linux.alibaba.com>
2025-11-02 07:42:55 -08:00
Adrian Reber
34226fd243 ci: try GitHub arm runners
Signed-off-by: Adrian Reber <areber@redhat.com>
2025-11-02 07:42:55 -08:00
Andrei Vagin
a44aa6d985 criu: Version 4.1.1
This release of CRIU (4.1.1) addresses a critical compatibility issue
introduced in the Linux kernel and back-ported to all stable releases.

The kernel commit (12f147ddd6de "do_change_type(): refuse to operate on
unmounted/not ours mounts") addressed the security issue introduced
almost 20 years ago. Unfortunately, this change inadvertently broke the
restore functionality of mount namespaces within CRIU. Users attempting
to restore a container on updated kernels would encounter the error:
"mnt-v2: Failed to make mount 476 slave: Invalid argument."

This release contains the necessary adjustments to CRIU, allowing it to
work seamlessly with kernels incorporating this security change.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-07-29 09:10:08 -07:00
Andrei Vagin
ced15c302b test/zdtm: remove unused compiler argument
Fixes a clang compile-time error:
"argument unused during compilation: '-c'".

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-07-29 09:10:08 -07:00
Andrei Vagin
570621a48a mount-v2: enter the mount namesapce to propagation properties
A kernel change (commit 12f147ddd6de, "do_change_type(): refuse to
operate on unmounted/not ours mounts") modified how mount propagation
properties can be changed. Previously, these properties could be changed
from any mount namespace. Now, they can only be modified from the
specific mount namespace where the target mount is actually mounted

This commit addresses this new restriction by ensuring that CRIU enters the
correct mount namespace before attempting to restore mount propagation
properties (MS_SLAVE or MS_SHARED) for a mount.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-07-29 09:10:08 -07:00
Andrei Vagin
b6059ff193 criu: Version 4.1 (CRISC-V)
Major changes:
* RISC-V Support
* PIDFD Support
* CUDA Enhancements
* Fixes here and there

The full changelog can be found here: https://criu.org/Download/criu/4.1.

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-03-25 14:31:33 -07:00
Ivan Pravdin
bc14153173 criu: fix log_keep_err signal deadlock
When using pr_err in signal handler, locking is used
in an unsafe manner. If another signal happens while holding the
lock, deadlock can happen.

To fix this, we can introduce mutex_trylock similar to
pthread_mutex_trylock that returns immediately. Due to the fact
that lock is used only for writing first_err, this change garantees
that deadlock cannot happen.

Fixes: #358

Signed-off-by: Ivan Pravdin <ipravdin.official@gmail.com>
2025-03-25 14:31:33 -07:00
Bui Quang Minh
0f64709442 namespace: skip cleaning up the uid/gid map in error cases
free_userns_maps is called to clean up uid/gid map when the dump
finishes. If we try to clean up these maps in error cases, it can lead
to double free panic. So just skip cleaning up these maps and let
free_userns_maps do its job.

Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
2025-03-25 14:31:33 -07:00
Adrian Reber
6826ac58ce ci: run tests on a nftables only system
Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
700a8c4b5e ci: do not run tests requiring iptables if it is missing
There are a couple of tests that require the iptables binary.

Instead of adding a checkskip script, which could also handle this,
this change now uses CRIU's feature detection to see if the CRIU
feature 'has_ipt_legacy' exists.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
f22330ff07 test: print out logs if tests fail
If the tests in others/rpc are failing no information about that error
can be seen in a CI run. This change displays the log files if the test
fails.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
29ccb5b625 test: others/rpc do not use nftables locking backend
The tests in others/rpc are running as non-root and
fail silently if the nftables network locking backend is used.

This switches those tests to skip the network locking.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
95729ec328 docs: mark make commands with same format as elsewhere
This uses the same formatting for the make command examples as seen in
README.md.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
2cd9d5ded8 docs: update INSTALL.md with a section about building CRIU
The building section also contains the information how to change the
network locking backend without source code changes.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
867c773031 make: allow setting the default network locking backend
As different Linux distributions are switching away from iptables
to nftables, this makes it easier to compile CRIU with a different
default network locking backend. Instead of changing the source
code it is now possible to select the nft backend like this:

    make NETWORK_LOCK_DEFAULT=NETWORK_LOCK_NFTABLES

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Andrei Vagin
720bf67e06 zdtm/vdso02: unmap vvar_vclock mappings
It is a part of vvar and this test intends to unmap vdso and all vvar
mappings.

Fixes #2622

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-03-21 12:40:31 -07:00
Andrei Vagin
62a4a5874b vdso: correct data types for ELF hash table sizes
Let's change the data types of `nbucket` and `nchain` to uint32.

This should fix the following compile-time error on arm32:
/criu/criu/pie/util-vdso.c:336: undefined reference to `__aeabi_uldivmod'

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-03-21 12:40:31 -07:00
AV
b8553d19ed test/zdtm: check that PAC keys are C/R-ed
Add another variation of ptrhead00 compiled with enabled branch-protection.

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-03-21 12:40:31 -07:00
AV
8ae5db37bb arm64: C/R PAC keys
PAC stands for Pointer Authentication Code. Each process has 5 PAC keys
and a mask of enabled keys. All this properties have to be C/R-ed.

As they are per-process protperties, we can save/restore them just for
one thread.

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-03-21 12:40:31 -07:00
Han-Wen Nienhuys
c5d46d86a8 restorer: Add a lock around cgroupd communication.
Threads are put into cgroups through the cgroupd thread, which
communicates with other threads using a socketpair.

Previously, each thread received a dup'd copy of the socket, and did
the following

    sendmsg(socket_dup_fd, my_cgroup_set);

    // wait for ack.
    while (1) {
        recvmsg(socket_dup_fd, &h, MSG_PEEK);
        if (h.pid != my_pid) continue;
        recvmsg(socket_dup_fd, &h, 0);
    }
    close(socket_dup_fd);

When restoring many threads, many threads would be spinning in the
above loop waiting for their PID to appear.

In my test-case, restoring a process with a 11.5G heap and 491 threads
could take anywhere between 10 seconds and 60 seconds to complete.

To avoid the spinning, we drop the loop and MSG_PEEK, and add a lock
around the above code. This does not decrease parallelism, as the
cgroupd daemon uses a single thread anyway.

With the lock in place, the same restore consistently takes around 10
seconds on my machine (Thinkpad P14s, AMD Ryzen 8840HS).

There is a similar "daemon" thread for user namespaces. That already
is protected with a similar userns_sync_lock in __userns_call().

Fixes #2614

Signed-off-by: Han-Wen Nienhuys <hanwen@engflow.com>
2025-03-21 12:40:31 -07:00
Han-Wen Nienhuys
7748b3fe73 pstree: print clone flags in error message
Signed-off-by: Han-Wen Nienhuys <hanwen@engflow.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
d855501575 vdso: Fixes in DT_GNU_HASH handling
* Hash buckets is an array of 32-bit words. While DT_HASH is 32-bit on
  most platforms except s390 (where it's 64-bit).
* The bloom filter word size differs between 32-bit and 64-bit ELF
  files. This commit adjusts the code to handle both cases.

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
ed6374b48c lsm: use the user provided lsm label
Currently CRIU has the possibility to specify a LSM label during
restore. Unfortunately the information is completely ignored in the case
of SELinux.

This change selects the lsm label from the user if it is provided and
else the label from the checkpoint image is used.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
d35808f5ee ci: update to latest actions for codeql CI job
Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
c298b51a69 scripts/uninstall_module: import signal module
With Python 3.13, the `subprocess` module now uses the
`posix_spawn()` function [1], which requires the `signal`
module to be imported.

Fixes: #2607

[1] https://docs.python.org/3/whatsnew/3.13.html#subprocess

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
समीर सिंह Sameer Singh
38b9807cd5 coredump: enable coredump generation on arm
Add relevant elf header constants and notes for the arm platform
to enable coredump generation.

Signed-off-by: समीर सिंह Sameer Singh <lumarzeli30@gmail.com>
2025-03-21 12:40:31 -07:00
समीर सिंह Sameer Singh
da90b33a42 coredump: enable coredump generation on aarch64
Add relevant elf header constants and notes for the aarch64 platform
to enable coredump generation.

Signed-off-by: समीर सिंह Sameer Singh <lumarzeli30@gmail.com>
2025-03-21 12:40:31 -07:00
dschervov
030fa4affd criu: fix internal representation of cgroups hierarchical structure
strstartswith() function is incorrect choice for finding parent
directory so i change it to issubpath() function

Signed-off-by: Dmitrii Chervov <dschervov1@yandex.ru>
2025-03-21 12:40:31 -07:00
Andrei Vagin
b7fa7d304c kerndat: run iptables with -n to not resolve service names
Resolving service names can be slow and it isn't needed here.

Fixes #2032

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
528c94c48b ci: install gawk for Fedora based tests
Currently Fedora rawhide based CI runs fail with:

 /bin/sh: line 1: awk: command not found

Let's install it.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Kir Kolyshkin
d66bc34995 Makefile: move codespell options to .codespellrc
This way,
 - Makefile is less cluttered;
 - one can run codespell from the command line.

Fixes: fd7e97fcf ("lint: exclude tags file from codespell")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
8a06ca27cc vdso: switch from DT_HASH to DT_GNU_HASH (aarch64)
Trying to run latest CRIU on CentOS Stream 10 or Ubuntu 24.04 (aarch64)
fails like this:

    # criu/criu check -v4
    [...]
    (00.096460) vdso: Parsing at ffffb2e2a000 ffffb2e2c000
    (00.096539) vdso: PT_LOAD p_vaddr: 0
    (00.096567) vdso: DT_STRTAB: 1d0
    (00.096592) vdso: DT_SYMTAB: 128
    (00.096616) vdso: DT_STRSZ: 8a
    (00.096640) vdso: DT_SYMENT: 18
    (00.096663) Error (criu/pie-util-vdso.c:193): vdso: Not all dynamic entries are present
    (00.096688) Error (criu/vdso.c:627): vdso: Failed to fill self vdso symtable
    (00.096713) Error (criu/kerndat.c:1906): kerndat_vdso_fill_symtable failed when initializing kerndat.
    (00.096812) Found mmap_min_addr 0x10000
    (00.096881) files stat: fs/nr_open 1073741816
    (00.096908) Error (criu/crtools.c:267): Could not initialize kernel features detection.

This seems to be related to the kernel (6.12.0-41.el10.aarch64). The
Ubuntu user-space is running in a container on the same kernel.

Looking at the kernel this seems to be related to:

    commit 48f6430505c0b0498ee9020ce3cf9558b1caaaeb
    Author: Fangrui Song <i@maskray.me>
    Date:   Thu Jul 18 10:34:23 2024 -0700

        arm64/vdso: Remove --hash-style=sysv

        glibc added support for .gnu.hash in 2006 and .hash has been obsoleted
        for more than one decade in many Linux distributions.  Using
        --hash-style=sysv might imply unaddressed issues and confuse readers.

        Just drop the option and rely on the linker default, which is likely
        "both", or "gnu" when the distribution really wants to eliminate sysv
        hash overhead.

        Similar to commit 6b7e26547fad ("x86/vdso: Emit a GNU hash").

The commit basically does:

    -ldflags-y := -shared -soname=linux-vdso.so.1 --hash-style=sysv \
    +ldflags-y := -shared -soname=linux-vdso.so.1 \

Which results in only a GNU hash being added to the ELF header. This
change has been merged with 6.11.

Looking at the referenced x86 commit:

    commit 6b7e26547fad7ace3dcb27a5babd2317fb9d1e12
    Author: Andy Lutomirski <luto@amacapital.net>
    Date:   Thu Aug 6 14:45:45 2015 -0700

        x86/vdso: Emit a GNU hash

        Some dynamic loaders may be slightly faster if a GNU hash is
        available.  Strangely, this seems to have no effect at all on
        the vdso size.

        This is unlikely to have any measurable effect on the time it
        takes to resolve vdso symbols (since there are so few of them).
        In some contexts, it can be a win for a different reason: if
        every DSO has a GNU hash section, then libc can avoid
        calculating SysV hashes at all.  Both musl and glibc appear to
        have this optimization.

        It's plausible that this breaks some ancient glibc version.  If
        so, then, depending on what glibc versions break, we could
        either require COMPAT_VDSO for them or consider reverting.

Which is also a really simple change:

    -VDSO_LDFLAGS = -fPIC -shared $(call cc-ldoption, -Wl$(comma)--hash-style=sysv) \
    +VDSO_LDFLAGS = -fPIC -shared $(call cc-ldoption, -Wl$(comma)--hash-style=both) \

The big difference here is that for x86 both hash sections are
generated. For aarch64 only the newer GNU hash is generated. That is why
we only see this error on kernel >= 6.11 and aarch64.

Changing from DT_HASH to DT_GNU_HASH seems to work on aarch64.  The test
suite runs without any errors.

Unfortunately I am not aware of all implication of this change and if a
successful test suite run means that it still works.

Looking at the kernel I see following hash styles for the VDSO:

aarch64: not specified (only GNU hash style)
arm: --hash-style=sysv
loongarch: --hash-style=sysv
mips: --hash-style=sysv
powerpc: --hash-style=both
riscv: --hash-style=both
s390: --hash-style=both
x86: --hash-style=both

Only aarch64 on kernels >= 6.11 is a problem right now, because all
other platforms provide the old style hashing.

Signed-off-by: Adrian Reber <areber@redhat.com>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Co-authored-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
2025-03-21 12:40:31 -07:00
Pavel Tikhomirov
6710cfce10 zdtm/netns_sub_sysctl: add ipv4/ping_group_range sysctl check
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2025-03-21 12:40:31 -07:00
Pavel Tikhomirov
4ca74b9aff net/sysctl: c/r ipv4/ping_group_range value
It is per net namespace, we need it to allow creation of unprivileged
ICMP sockets.

Note: in case this sysctl was disabled after unprivileged ICMP
socket was created we still need to somehow handle it on restore.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2025-03-21 12:40:31 -07:00
Pavel Tikhomirov
9c40781c26 net/sysctl: put common multiplier outside the brackets
Also add an explanation of the logic behind this calculation.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
d226bd4f67 ci: handle results from latest codespell
CI pulls in a newer version of codespell. This fixes complaints from
that codespell version.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
e2dffcbc8e lib: do not set protobuf has_* field too early
For two cases libcriu was setting the RPC protobuf field `has_*` before
checking if the given parameter is valid. This can lead to situations,
if the caller doesn't check the return value, that we pass as RPC struct
to CRIU which has the `has_*` protobuf field set to true, but does not
have a verified value (or non at all) set for the actual RPC entry.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
82b03429b7 cuda: disable CUDA plugin for pre-dump
Temporarily disable CUDA plugin for `criu pre-dump`.

pre-dump currently fails with the following error:

Handling VMA with the following smaps entry: 1822c000-18da5000 rw-p 00000000 00:00 0                                  [heap]
Handling VMA with the following smaps entry: 200000000-200200000 ---p 00000000 00:00 0
Handling VMA with the following smaps entry: 200200000-200400000 rw-s 00000000 00:06 895                              /dev/nvidia0
Error (criu/proc_parse.c:116): handle_device_vma plugin failed: No such file or directory
Error (criu/proc_parse.c:632): Can't handle non-regular mapping on 705693's map 200200000
Error (criu/cr-dump.c:1486): Collect mappings (pid: 705693) failed with -1

We plan to enable support for pre-dump by skipping nvidia mappings
in a separate patch.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
7f0d107fe5 seize: use separate checkpoint_devices function
Move `run_plugins(CHECKPOINT_DEVICES)` out of `collect_pstree()` to
ensure that the function's sole responsibility is to use the cgroup
freezer for the process tree. This allows us to avoid a time-out
error when checkpointing applications with large GPU state.

v2: This patch calls `checkpoint_devices()` only for `criu dump`.
Support for GPU checkpointing with `pre-dump` will be introduced in
a separate patch.

Suggested-by: Andrei Vagin <avagin@google.com>
Suggested-by: Jesus Ramos <jeramos@nvidia.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
02056bf41a cuda: prevent task lockup on timeout error
When creating a checkpoint of large models, the `checkpoint` action of
`cuda-checkpoint` can exceed the CRIU timeout. This causes CRIU to fail
with the following error, leaving the CUDA task in a locked state:

	cuda_plugin: Checkpointing CUDA devices on pid 84145 restore_tid 84202
	Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0
	Error (cuda_plugin.c:139): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call
	Error (cuda_plugin.c:396): cuda_plugin: CHECKPOINT_DEVICES failed with
	net: Unlock network
	cuda_plugin: finished cuda_plugin stage 0 err -1
	cuda_plugin: resuming devices on pid 84145
	cuda_plugin: Restore thread pid 84202 found for real pid 84145
	Unfreezing tasks into 1
		Unseizing 84145 into 1
	Error (criu/cr-dump.c:2111): Dumping FAILED.

To fix this, we set `task_info->checkpointed` before invoking
the `checkpoint` action to ensure that the CUDA task is resumed
even if CRIU times out.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Adrian Reber
f83931542a net: remember the name of the lock chain (nftables)
Using libnftables the chain to lock the network is composed of
("CRIU-%d", real_pid). This leads to around 40 zdtm tests failing
with errors like this:

Error: No such file or directory; did you mean table 'CRIU-62' in family inet?
delete table inet CRIU-86

The reason is that as soon as a process is running in a namespace the
real PID can be anything and only the PID in the namespace is restored
correctly. Relying on the real PID does not work for the chain name.

Using the PID of the innermost namespace would lead to the chain be
called 'CRIU-1' most of the time which is also not really unique.

With this commit the change is now named using the already existing CRIU
run ID. To be able to correctly restore the process and delete the
locking table, the CRIU run id during checkpointing is now stored in the
inventory as dump_criu_run_id.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
54795f174b criu: use libuuid for criu_run_id generation
criu_run_id will be used in upcoming changes to create and remove
network rules for network locking. Instead of trying to come up with
a way to create unique IDs, just use an existing library.

libuuid should be installed on most systems as it is indirectly required
by systemd (via libmount).

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
815ef68848 ci: two check-commits.yml changes
* Switch to v4 actions/checkout (from v3)
 * Use our apt wrapper to gracefully handle temporary repository errors

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Austin Kuo
061f4266e8 test/zdtm: add a new test to check non-periodic timers
It creates a few timers with log expiration intervals, waites for C/R
and check that timers are armed and their intervals have been restored.

Signed-off-by: Austin Kuo <hsuanchikuo@gmail.com>
2025-03-21 12:40:31 -07:00
Austin Kuo
09dc2e9584 timer: Refine itimer_armed logic and improve timer value handling
Right now, CRIU skips timers non-periodic timers. This change addresses
this issue.

Signed-off-by: Austin Kuo <hsuanchikuo@gmail.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
aad66a4f7c test: fix cmdlinenv00 on aarch64
On aarch64 the test cmdlinenv00 was failing with:

  FAIL: cmdlinenv00.c:120: auxv corrupted on restore (errno = 11 (Resource temporarily unavailable))

Starting with Linux kernel version 6.3 the size of AUXV was changed:

    commit 28c8e088427ad30b4260953f3b6f908972b77c2d
    Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Date:   Wed Jan 4 14:20:54 2023 -0500

        rseq: Increase AT_VECTOR_SIZE_BASE to match rseq auxvec entries

        Two new auxiliary vector entries are introduced for rseq without
        matching increment of the AT_VECTOR_SIZE_BASE, which causes failures
        with CONFIG_HARDENED_USERCOPY=y.

        Fixes: 317c8194e6ae ("rseq: Introduce feature size and alignment ELF auxiliary vector entries")

With this change AT_VECTOR_SIZE increases from 40 to 50 on aarch64. CRIU
uses AT_VECTOR_SIZE to read the content of /proc/PID/auxv

        auxv_t mm_saved_auxv[AT_VECTOR_SIZE];
        ret = read(fd, mm_saved_auxv, sizeof(mm_saved_auxv));

Now the tests works again on aarch64.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
2b74924805 files-reg: fix buffer overflow on aarch64
Running the zdtm/static/unlink_regular00 test on Ubuntu 24.04 on aarch64
results in following error:

    # ./zdtm.py run -t zdtm/static/unlink_regular00 -k always
    userns is supported
    === Run 1/1 ================ zdtm/static/unlink_regular00
    ==================== Run zdtm/static/unlink_regular00 in ns ====================
    Skipping rtc at root
    Start test
    Test is SUID
    ./unlink_regular00 --pidfile=unlink_regular00.pid --outfile=unlink_regular00.out --dirname=unlink_regular00.test
    Run criu dump
    *** buffer overflow detected ***: terminated
    ############# Test zdtm/static/unlink_regular00 FAIL at CRIU dump ##############
    Test output: ================================

     <<< ================================
    Send the 9 signal to  47
    Wait for zdtm/static/unlink_regular00(47) to die for 0.100000
    ##################################### FAIL #####################################

According to the backtrace:

    #0  __pthread_kill_implementation (threadid=281473158467616, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
    #1  0x0000ffff93477690 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
    #2  0x0000ffff9342cb3c in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
    #3  0x0000ffff93417e00 in __GI_abort () at ./stdlib/abort.c:79
    #4  0x0000ffff9346abf0 in __libc_message_impl (fmt=fmt@entry=0xffff93552a78 "*** %s ***: terminated\n") at ../sysdeps/posix/libc_fatal.c:132
    #5  0x0000ffff934e81a8 in __GI___fortify_fail (msg=msg@entry=0xffff93552a28 "buffer overflow detected") at ./debug/fortify_fail.c:24
    #6  0x0000ffff934e79e4 in __GI___chk_fail () at ./debug/chk_fail.c:28
    #7  0x0000ffff934e9070 in ___snprintf_chk (s=s@entry=0xffffc6ed04a3 "testfile", maxlen=maxlen@entry=4056, flag=flag@entry=2, slen=slen@entry=4053,
        format=format@entry=0xaaaacffe3888 "link_remap.%d") at ./debug/snprintf_chk.c:29
    #8  0x0000aaaacff4b8b8 in snprintf (__fmt=0xaaaacffe3888 "link_remap.%d", __n=4056, __s=0xffffc6ed04a3 "testfile")
        at /usr/include/aarch64-linux-gnu/bits/stdio2.h:54
    #9  create_link_remap (path=path@entry=0xffffc6ed2901 "/zdtm/static/unlink_regular00.test/subdir/testfile", len=len@entry=60, lfd=lfd@entry=20,
        idp=idp@entry=0xffffc6ed14ec, nsid=nsid@entry=0xaaaada2bac00, parms=parms@entry=0xffffc6ed2808, fallback=0xaaaacff4c6c0 <dump_linked_remap+96>,
        fallback@entry=0xffffc6ed2797) at criu/files-reg.c:1164
    #10 0x0000aaaacff4c6c0 in dump_linked_remap (path=path@entry=0xffffc6ed2901 "/zdtm/static/unlink_regular00.test/subdir/testfile", len=len@entry=60,
        parms=parms@entry=0xffffc6ed2808, lfd=lfd@entry=20, id=id@entry=12, nsid=nsid@entry=0xaaaada2bac00, fallback=fallback@entry=0xffffc6ed2797)
        at criu/files-reg.c:1198
    #11 0x0000aaaacff4d8b0 in check_path_remap (nsid=0xaaaada2bac00, id=12, lfd=20, parms=0xffffc6ed2808, link=<optimized out>) at criu/files-reg.c:1426
    #12 dump_one_reg_file (lfd=20, id=12, p=0xffffc6ed2808) at criu/files-reg.c:1827
    #13 0x0000aaaacff51078 in dump_one_file (pid=<optimized out>, fd=4, lfd=20, opts=opts@entry=0xaaaada2ba2c0, ctl=ctl@entry=0xaaaada2c4d50,
        e=e@entry=0xffffc6ed39c8, dfds=dfds@entry=0xaaaada2c3d40) at criu/files.c:581
    #14 0x0000aaaacff5176c in dump_task_files_seized (ctl=ctl@entry=0xaaaada2c4d50, item=item@entry=0xaaaada2b8f80, dfds=dfds@entry=0xaaaada2c3d40)
        at criu/files.c:657
    #15 0x0000aaaacff3d3c0 in dump_one_task (parent_ie=0x0, item=0xaaaada2b8f80) at criu/cr-dump.c:1679
    #16 cr_dump_tasks (pid=<optimized out>) at criu/cr-dump.c:2224
    #17 0x0000aaaacff163a0 in main (argc=<optimized out>, argv=0xffffc6ed40e8, envp=<optimized out>) at criu/crtools.c:293

This line is the problem:

    snprintf(tmp + 1, sizeof(link_name) - (size_t)(tmp - link_name - 1), "link_remap.%d", rfe.id);

The problem was that the `-1` was on the inside of the braces and not on
the outside. This way the destination size was increase by 1 instead of
being decreased by 1 which triggered the buffer overflow detection.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Yuanhong Peng
6fdac50818 seize: Adjust the position of the log message
Based on the code, the `ret` variable at this point does not
represent the task state, so this log message should be
moved to a position after the `compel_wait_task()` function.

Signed-off-by: Yuanhong Peng <yummypeng@linux.alibaba.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
97398068b1 net: redirect nftables stdout and stderr to CRIU's log file
When using the nftables network locking backend and restoring a process
a second time the network locking has already been deleted by the first
restore. The second restore will print out to the console text like:

Error: Could not process rule: No such file or directory
delete table inet CRIU-202621

With this change CRIU's log FD is used by libnftables stdout and stderr.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Adrian Reber
6dce80c533 util: added cleanup_file attribute.
Signed-off-by: Adrian Reber <areber@redhat.com>
2025-03-21 12:40:31 -07:00
Liu Chao
260c08418b zdtm: Check CapAmb is restored correctly after C/R
This test sets CapAmb according to CapPrm and CapInh and check CapAmb
after C/R.

Signed-off-by: Liu Chao <liuchao173@huawei.com>
2025-03-21 12:40:31 -07:00
Liu Chao
6f8efad304 cr: Task CapAmb support
Signed-off-by: Liu Chao <liuchao173@huawei.com>
2025-03-21 12:40:31 -07:00
Kir Kolyshkin
94b9b3c5da freeze_processes: implement kludges for cgroup v1
Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.

In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).

When criu is used, it fails to freeze a cgroup from time to time
(see [1], [2]). Let's try adding kludges similar to ones in runc.

Alas, I have absolutely no way to test this, so please review carefully.

[1]: https://github.com/opencontainers/runc/issues/4273
[2]: https://github.com/opencontainers/runc/issues/4457

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-21 12:40:31 -07:00
Kir Kolyshkin
82f4ecda69 freeze_processes: fix logic
There are a few issues with the freeze_processes logic:

1. Commit 9fae23fbe2 grossly (by 1000x) miscalculated the number of
   attempts required, as a result, we are seeing something like this:

> (00.000340) freezing processes: 100000 attempts with 100 ms steps
> (00.000351) freezer.state=THAWED
> (00.000358) freezer.state=FREEZING
> (00.100446) freezer.state=FREEZING
> ...close to 100 lines skipped...
> (09.915110) freezer.state=FREEZING
> (10.000432) Error (criu/cr-dump.c:1467): Timeout reached. Try to interrupt: 0
> (10.000563) freezer.state=FREEZING

   For 10s with 100ms steps we only need 100 attempts, not 100000.

2. When the timeout is hit, the "failed to freeze cgroup" error is not
   printed, and the log_unfrozen_stacks is not called either.

3. The nanosleep at the last iteration is useless (this was hidden by
   issue 1 above, as the timeout was hit first).

Fix all these.

While at it,

4. Amend the error message with the number of attempts, sleep duration,
   and timeout.

5. Modify the "freezing cgroup" debug message to be in sync with the
   above error.

   Was:

   > freezing processes: 100000 attempts with 100 ms steps

   Now:

   > freezing cgroup some/name: 100 x 100ms attempts, timeout: 10s

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-21 12:40:31 -07:00
Kir Kolyshkin
99e1fbd8a2 criu/seize.c: clang-format it
Done using clang-format 19.1.5 with .clang-format obtained via
scripts/fetch-clang-format.sh.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-21 12:40:31 -07:00
Andrei Vagin
a8754905c0 test: run scm06 in the ns and uns flavors
The kernel releases a test socket asynchronously, so the restore can
fail if it is executed before the kernel actually destroys the socket.

Fixes #2537

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-03-21 12:40:31 -07:00
Andrei Vagin
15c81c1262 test/java: increate the ghost file limit
Right now, this test fails with this error:
Error (criu/files-reg.c:1031): Can't dump ghost file
  /criu/test/javaTests/omrvmem_000000626_Mlm48x of 2097152 size,
  increase limit

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-03-21 12:40:31 -07:00
Jesus Ramos
dc6cef0b4c cuda: Fix return value from CHECKPOINT_DEVICES hook so that dump's fail properly
cuda-checkpoint returns the positive CUDA error code when it runs into an issue
and passing that along as the return value would cause errors to get ignored

Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
2025-03-21 12:40:31 -07:00
Andrei Vagin
8ee2eba47c vdso: handle vvar_vclock vma-s
The vvar_vclock was introduced by [1]. Basically, the old vvar vma has
been splited on two parts. In term of C/R, these two vma-s can be still
treated as one.

[1] e93d2521b27f ("x86/vdso: Split virtual clock pages into dedicated mapping")

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
ed560a3491 pidfd: add missing include
Fix for the following error when building CRIU on Rocky Linux 8

criu/pidfd.c: In function ‘pidfd_open’:
criu/pidfd.c:119:17: error: ‘__NR_pidfd_open’ undeclared (first use in this function); did you mean ‘pidfd_open’?
  return syscall(__NR_pidfd_open, pid, flags);
                 ^~~~~~~~~~~~~~~
                 pidfd_open
criu/pidfd.c:119:17: note: each undeclared identifier is reported only once for each function it appears in
criu/pidfd.c:120:1: error: control reaches end of non-void function [-Werror=return-type]
 }
 ^
criu/pidfd.c: At top level:
cc1: error: unrecognized command line option ‘-Wno-unknown-warning-option’ [-Werror]
cc1: error: unrecognized command line option ‘-Wno-dangling-pointer’ [-Werror]
cc1: all warnings being treated as errors

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Alexander Mikhalitsyn
40b7f04b7c compel/arch/riscv64: properly implement compel_task_size()
We need to dynamically calculate TASK_SIZE depending
on the MMU on RISC-V system. [We are using analogical
approach on aarch64/ppc64le.]

This change was tested on physical machine:
StarFive VisionFive 2
isa		: rv64imafdc_zicntr_zicsr_zifencei_zihpm_zca_zcd_zba_zbb
mmu		: sv39
uarch		: sifive,u74-mc
mvendorid	: 0x489
marchid		: 0x8000000000000007
mimpid		: 0x4210427
hart isa	: rv64imafdc_zicntr_zicsr_zifencei_zihpm_zca_zcd_zba_zbb

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-03-21 12:40:31 -07:00
Alexander Mikhalitsyn
399d7bdcbb compel: fix gitignore and remove autogenerated code
We don't need to have compel/arch/riscv64/plugins/std/syscalls/syscalls.S
tracked in git. It is autogenerated. We also need to update our .gitignore
to ignore autogenerated files with syscall tables.

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
21e5f4cfd5 test: add get-state to mocked cuda-checkpoint tool
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
28c2cb3fd6 cuda: enable checkpoint support for paused tasks
If a CUDA process is already in a "locked" or "checkpointed" state
during criu dump, the CUDA plugin currently fails with an error because
it attempts an unnecessary "lock" action using the cuda-checkpoint tool.

This patch extends the CUDA plugin to handle such cases by first
verifying the initial state of the CUDA processes and skipping
unnecessary "lock" and "checkpoint" actions when a process has been
locked or checkpointed before CRIU is invoked.

In particular, CUDA tasks may already be in a "locked" or "checkpointed"
state to ensure consistent checkpoint/restore for distributed workloads,
such as model training, where multiple containers run across different
cluster nodes.

Another use case for this functionality is optimizing resource
utilization, where CUDA tasks with low-priority are preempted
immediately to release GPU resources needed by high-priority
tasks, and the paused workloads are later resumed or migrated
to another node.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Bhavik Sachdev
498bcf2806 zdtm: Check many processes with common dead pidfd
We have multiple processes open a pidfd to a common dead process.
After C/R we check that the inode numbers for these pidfds are equal or
not.

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
2025-03-21 12:40:31 -07:00
Andrei Vagin
7125bfc695 pidfd: one process creates a helper and opens all fds to it
Currently, the `waitpid()` call on the tmp process can be made by a
process which is not its parent. This causes restore to fail.

This patch instead selects one process to create the tmp process and
open all the fds that point to it. These fds are sent to the correct
process(es).

Fixes: #2496

Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
b1cac7a8e5 cuda: fix check for GPU device availability
The check for `/dev/nvidiactl` to determine if the CUDA plugin can be
used is unreliable because in some cases the default path for driver
installation is different [1]. This patch changes the logic to check
if a GPU device is available in `/proc/driver/nvidia/gpus/`. This
approach is similar to `torch.cuda.is_available()` and it is a more
accurate indicator.

The subsequent check for support of the `cuda-checkpoint --action`
option would confirm if the driver supports checkpoint/restore.

[1] https://github.com/NVIDIA/gpu-operator

Fixes: #2509

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
36a53fe23c ci: test interrupt-only mode with frozen cgroup
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
4196268eef seize: enable support for frozen containers
Container runtimes like CRI-O and containerd utilize the freezer cgroup
to create a consistent snapshot of container root filesystem (rootfs)
changes. In this case, the container is frozen before invoking CRIU.
After CRIU successfully completes, a copy of the container rootfs diff
is saved, and the container is then unfrozen.

However, the `cuda-checkpoint` tool is not able to perform a 'lock'
action on frozen threads.  To support GPU checkpointing with these
container runtimes, we need to unfreeze the cgroup and return it to its
original state once the checkpointing is complete.

To reflect this new behavior, the following changes are applied:
 - `dont_use_freeze_cgroup(void)` -> `set_compel_interrupt_only_mode(void)`
 - `bool freeze_cgroup_disabled` -> `bool compel_interrupt_only_mode`
 - `check_freezer_cgroup(void)` -> `prepare_freezer_for_interrupt_only_mode(void)`

Note that when `compel_interrupt_only_mode` is set to `true`,
`compel_interrupt_task()` is used instead of `freeze_processes()`
to prevent tasks from running during `criu dump`.

Fixes: #2508

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
ff9dbef902 seize: fix error handling for check_freezer_cgroup
When `check_freezer_cgroup()` has non-zero return value, `goto err` calls
`return ret`. However, the value of `ret` has been set to `0` in the lines
above and CRIU does not handle the error properly.

This problem is related to https://github.com/checkpoint-restore/criu/issues/2508

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Lorenzo Fontana
622b43392f criu: Initialize util before service worker starts
When restoring dumps in new mount + pid namespaces where multiple dumps
share the same network namespace, CRIU may fail due to conflicting
unix socket names. This happens because the service worker creates
sockets using a pattern that includes criu_run_id, but util_init()
is called after cr_service_work() starts.

The socket naming pattern "crtools-fd-%d-%d" uses the restore PID
and criu_run_id, however criu_run_id is always 0 when not initialized,
leading to conflicts when multiple restores run simultaneously either
in the same CRIU process or because of multiple CRIU processes
doing the same operation in different PID namespaces.

Fix this by:

- Moving util_init() before cr_service_work() starts
- Adding a second util_init() call in the service worker fork
to ensure unique IDs across multiple worker runs
- Making sure that dump and restore operations have util_init() called
early to generate unique socket names

With this fix, socket names always include the namespace ID, preventing
conflicts when multiple processes with the same pid share a network
namespace.

Fixes #2499

[ avagin: minore code changes ]

Signed-off-by: Lorenzo Fontana <fontanalorenz@gmail.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
2025-03-21 12:40:31 -07:00
Liu Hua
9052ef93c7 uffd: Disable image deduplication after fork
After a fork, both the child and parent processes may trigger a page fault (#PF)
at the same virtual address, referencing the same position in the page image.
If deduplication is enabled, the last process to trigger the page fault will fail.

Therefore, deduplication should be disabled after a fork to prevent this issue.

Signed-off-by: Liu Hua <weldonliu@tencent.com>
2025-03-21 12:40:31 -07:00
Cryolitia PukNgae
2be958d22e include: don't use GCC's __builtin_ffs on riscv64
Link: e300da4db4

Signed-off-by: PukNgae Cryolitia <Cryolitia@gmail.com>
---
- cherry-picked
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-03-21 12:40:31 -07:00
Haorong Lu
da6b1807ef ci: add workflow for riscv64
Signed-off-by: Haorong Lu <ancientmodern4@gmail.com>
2025-03-21 12:40:31 -07:00
Haorong Lu
bb29067de9 zdtm: add riscv64 support
Signed-off-by: Haorong Lu <ancientmodern4@gmail.com>
2025-03-21 12:40:31 -07:00
Haorong Lu
6d970ed047 criu: add riscv64 support to parasite and restorer
Co-authored-by: Yixue Zhao <felicitia2010@gmail.com>
Co-authored-by: stove <stove@rivosinc.com>
Signed-off-by: Haorong Lu <ancientmodern4@gmail.com>
2025-03-21 12:40:31 -07:00
Haorong Lu
1d028ef44e images: add riscv64 core image
Co-authored-by: Yixue Zhao <felicitia2010@gmail.com>
Co-authored-by: stove <stove@rivosinc.com>
Signed-off-by: Haorong Lu <ancientmodern4@gmail.com>
2025-03-21 12:40:31 -07:00
Haorong Lu
95359a62aa compel: add riscv64 support
Co-authored-by: Yixue Zhao <felicitia2010@gmail.com>
Co-authored-by: stove <stove@rivosinc.com>
Signed-off-by: Haorong Lu <ancientmodern4@gmail.com>
---
- rebased
- added a membarrier() to syscall table (fix authored by Cryolitia PukNgae)
Signed-off-by: PukNgae Cryolitia <Cryolitia@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-03-21 12:40:31 -07:00
Haorong Lu
d8f93e7bac include: add common header files for riscv64
Co-authored-by: Yixue Zhao <felicitia2010@gmail.com>
Co-authored-by: stove <stove@rivosinc.com>
Signed-off-by: Haorong Lu <ancientmodern4@gmail.com>
---
- rebased
- imported a page_size() type fix (authored by Cryolitia PukNgae)
Signed-off-by: PukNgae Cryolitia <Cryolitia@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2025-03-21 12:40:31 -07:00
Bhavik Sachdev
c49eb18f9f pidfd: block SIGCHLD during tmp process creation
This patch blocks SIGCHLD during temporary process creation to prevent a
race condition between kill() and waitpid() where sigchld_handler()
causes `criu restore` to fail with an error.

Fixes: #2490

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
5ca4400699 zdtm: add inventory test plugins
This patch adds two test plugins to verify that CRIU plugins listed
in the inventory image are enabled, while those that are not listed
can be disabled.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
5335b35f72 images/inventory: add field for enabled plugins
This patch extends the inventory image with a `plugins` field that
contains an array of plugins which were used during checkpoint,
for example, to save GPU state. In particular, the CUDA and AMDGPU
plugins are added to this field only when the checkpoint contains
GPU state. This allows to disable unnecessary plugins during restore,
show appropriate error messages if required CRIU plugin are missing,
and migrate a process that does not use GPU from a GPU-enabled system
to CPU-only environment.

We use the `optional plugins_entry` for backwards compatibility. This
entry allows us to distinguish between *unset* and *missing* field:

- When the field is missing, it indicates that the checkpoint was
  created with a previous version of CRIU, and all plugins should be
  *enabled* during restore.

- When the field is empty, it indicates that no plugins were used during
  checkpointing. Thus, all plugins can be *disabled* during restore.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
b524dab32f pycriu: fix lint errors
This patch fixes the following errors reported by ruff:

lib/pycriu/images/pb2dict.py:307:24: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
305 |     elif field.type in _basic_cast:
306 |         cast = _basic_cast[field.type]
307 |         if pretty and (cast == int):
    |                        ^^^^^^^^^^^ E721
308 |             if is_hex:
309 |                 # Fields that have (criu).hex = true option set
    |

lib/pycriu/images/pb2dict.py:379:13: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
377 |     elif field.type in _basic_cast:
378 |         cast = _basic_cast[field.type]
379 |         if (cast == int) and is_string(value):
    |             ^^^^^^^^^^^ E721
380 |             if _marked_as_dev(field):
381 |                 return encode_dev(field, value)
    |

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
88aa7e2c10 make/lint: use 'ruff check <path>'
The command `ruff <path>` has been deprecated and removed:
https://astral.sh/blog/ruff-v0.5.0#removed-deprecated-features

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Bhavik Sachdev
f29e655df9 zdtm: Check pidfd for thread is valid after C/R
We open a pidfd to a thread using `PIDFD_THREAD` flag and after C/R
ensure that we can send signals using it with `PIDFD_SIGNAL_THREAD`.

signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
2025-03-21 12:40:31 -07:00
Bhavik Sachdev
7a64004dc8 zdtm: Check fd from pidfd_getfd is C/Red correctly
We get the read end of a pipe using `pidfd_getfd` and check if we can
read from it after C/R.

signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
2025-03-21 12:40:31 -07:00
Bhavik Sachdev
2e6f348458 zdtm: Check dead pidfd is restored correctly
After, C/R of pidfds that point to dead processes their inodes might
change. But if two pidfds point to same dead process they should
continue to do so after C/R.

This test ensures that this happens by calling `statx()` on pidfds after
C/R and then comparing their inode numbers.

Support for comparing pidfds by using `statx()` and inode numbers was
introduced alongside pidfs. So if `f_type` of pidfd is not equal to
`PID_FS_MAGIC` then we skip this test.

signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
2025-03-21 12:40:31 -07:00
Bhavik Sachdev
3f30ec0eda zdtm: Check pidfd can kill descendant processes
Validate that pidfds can been used to send signals to different
processes after C/R using the `pidfd_send_signal()` syscall.

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
2025-03-21 12:40:31 -07:00
Bhavik Sachdev
2899d46000 zdtm: Check pidfd can send signal after C/R
Ensure `pidfd_send_signal()` syscall works as expected after C/R.

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
2025-03-21 12:40:31 -07:00
Bhavik Sachdev
3096df9ea3 zdtm: Check pidfd fdinfo entry is consistent
Ensures that entries in /proc/<pid>/fdinfo/<pidfd> are same.

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
2025-03-21 12:40:31 -07:00
Bhavik Sachdev
1ce408ffa4 criu: Support C/R of pidfds
Process file descriptors (pidfds) were introduced to provide a stable
handle on a process. They solve the problem of pid recycling.

For a detailed explanation, see https://lwn.net/Articles/801319/ and
http://www.corsix.org/content/what-is-a-pidfd

Before Linux 6.9, anonymous inodes were used for the implementation of
pidfds. So, we detect them in a fashion similiar to other fd types that
use anonymous inodes by calling `readlink()`.
After 6.9, pidfs (a file system for pidfds) was introduced.
In 6.9 `S_ISREG()` returned true for pidfds, but this again changed with
6.10.
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/pidfs.c?h=v6.11-rc2#n285)
After this change, pidfs inodes have no file type in st_mode in
userspace.
We use `PID_FS_MAGIC` to detect pidfds for kernel >= 6.9
Hence, check for pidfds occurs before the check for regular files.

For pidfds that refer to dead processes, we lose the pid of the process
as the Pid and NSpid fields in /proc/<pid>/fdinfo/<pidfd> change to -1.
So, we create a temporary process for each unique inode and open pidfds
that refer to this process. After all pidfds have been opened we kill
this temporary process.

This commit does not include support for pidfds that point to a specific
thread, i.e pidfds opened with `PIDFD_THREAD` flag.

Fixes: #2258

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
2025-03-21 12:40:31 -07:00
Bhavik Sachdev
3322d1e94c images: Add protobuf definition for pidfd
We only use the last pid from the list in NSpid entry (from
/proc/<pid>/fdinfo/<pidfd>) while restoring pidfds.
The last pid refers to the pid of the process in the most deeply nested
pid namespace. Since CRIU does not currently support nested pid
namespaces, this entry is the one we want.

After Linux 6.9, inode numbers can be used to compare pidfds. pidfds
referring to the same process will have the same inode numbers. We use
inode numbers to restore pidfds that point to dead processes.

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
4f8f6f2883 Makefile.config: set CR_PLUGIN_DEFAULT variable
By default, CRIU uses the path "/usr/lib/criu" to install and load
plugins at runtime. This path is defined by the `PLUGINDIR` variable
in Makefile.install and `CR_PLUGIN_DEFAULT` in `criu/include/plugin.h`.
However, some distribution packages might install the CRIU plugins at
"/usr/lib64/criu" instead. This patch updates the makefile to align
the path defined by `CR_PLUGIN_DEFAULT` with the value of `PLUGINDIR`.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
f1d465448f amdgpu: remove exec permissions on source files
This patch fixes the following warnings that appear
when building an RPM package:

+ /usr/lib/rpm/redhat/brp-mangle-shebangs
*** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.c is executable but has no shebang, removing executable bit
*** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.h is executable but has no shebang, removing executable bit

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Andrei Vagin
c2b48ff423 criu: Version 4.0 (CRIUDA)
Major changes:
* CUDA plugin to support checkpointing and restoring NVIDIA CUDA applications.
* Shadow stack support
* Pagemap cache: Added support for PAGEMAP_SCAN ioctl

The full changelog can be found here: https://criu.org/Download/criu/4.0.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2024-09-21 16:34:14 -07:00
Andrei Vagin
a8cbe76d4f util: dump fsfd log messages
It should help to investigate errors of fsconfig, fsmount and etc.

Signed-off-by: Andrei Vagin <avagin@google.com>
2024-09-19 15:23:42 -07:00
David Francis
096c1f7a4d plugins/amdgpu - Increase maximum parameter length
The topology parsing assumed that all parameter names were
30 characters or fewer, but

recommended_sdma_engine_id_mask

is 31 characters.

Make the maximum length a macro, and set it to 64.

Signed-off-by: David Francis <David.Francis@amd.com>
2024-09-19 15:23:42 -07:00
David Francis
60ee5ebd9d plugins/amdgpu: Zero ib_info on initialization
This struct was being used un-initialized, meaning it
was filled with random garbage.

Mea culpa.

Signed-off-by: David Francis <David.Francis@amd.com>
2024-09-19 15:23:42 -07:00
Andrei Vagin
6918998897 plugin/cuda: disable CUDA plugin if /dev/nvidiactl isn't present
The presence of /dev/nvidiactl indicates that the system has a
compatible NVIDIA GPU driver installed and that the GPU is accessible to
the operating system.

Signed-off-by: Andrei Vagin <avagin@google.com>
2024-09-19 15:23:42 -07:00
Andrei Vagin
e1331a4b60 fault: allow to check dont_use_freeze_cgroup
Adds a new "fault" to call dont_use_freeze_cgroup.

Signed-off-by: Andrei Vagin <avagin@google.com>
2024-09-19 15:23:42 -07:00
Andrei Vagin
651df375bd criu: Allow disabling freeze cgroups
Some plugins (e.g., CUDA) may not function correctly when processes are
frozen using cgroups. This change introduces a mechanism to disable the
use of freeze cgroups during process seizing, even if explicitly
requested via the --freeze-cgroup option.

The CUDA plugin is updated to utilize this new mechanism to ensure
compatibility.

Signed-off-by: Andrei Vagin <avagin@google.com>
2024-09-19 15:23:42 -07:00
Radostin Stoyanov
59f49c6276 codespell: fix typos
This patch fixes the following typos reported by codespell:

./test/others/bers/bers.c:394: dependin ==> depending, depend in
./criu/kerndat.c:837: hitted ==> hit

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-19 15:23:42 -07:00
Radostin Stoyanov
edb6fbb820 scripts/uninstall_module: fix package discovery
The `uninstall_module.py` script is a wrapper for the `pip uninstall`
command that enables support for specifying installation prefix
(i.e., `--prefix`). When this functionality is used, we intentionally
set `sys.path` to include only search paths for the specified prefix
to avoid unintentional uninstallation of packages in system paths.

Since `importlib_metadata` version 8.1.0, the `Distribution.from_name()`
method has been modified [1] to perform additional pre-processing of
Distribution objects [2] that requires loading distribution metadata
and results in the following error:

  File "/usr/local/lib/python3.12/site-packages/importlib_metadata/__init__.py", line 422, in <lambda>
    buckets = bucket(dists, lambda dist: bool(dist.metadata))
                                              ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/importlib_metadata/__init__.py", line 454, in metadata
    from . import _adapters
  File "/usr/local/lib/python3.12/site-packages/importlib_metadata/_adapters.py", line 3, in <module>
    import email.message
  File "/usr/lib64/python3.12/email/message.py", line 11, in <module>
    import quopri
  ModuleNotFoundError: No module named 'quopri'

This error occurs because we have excluded system paths from the list
of search paths (`sys.path`).

However, this pre-processing is not required for our use case, as we
only use the discovery mechanism of importlib_metadata to resolve the
metadata directory path of the module being uninstalled.

To fix this problem, this patch updates `uninstall_module` to avoid the
`from_name()` method and use `discover(name=package_name)` directly.

[1] a65c29adc0
[2] a65c29ad/importlib_metadata/__init__.py (L391)

Fixes: #2468

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-19 15:23:42 -07:00
Radostin Stoyanov
b1b3c14b17 cuda: unlock on timeout error
When attempting to checkpoint a container with CUDA processes,
CRIU could fail with the following error:

	Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 1
	Error (cuda_plugin.c:143): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call
	Error (cuda_plugin.c:384): cuda_plugin: PAUSE_DEVICES failed with

In this situation, the target process is locked, but CRIU fails due to
a timeout and exits with an error. We need to make sure that the target
PID is unlocked in such case.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-19 15:23:42 -07:00
Adrian Reber
dbfa450246 ci: run aarch64 tests native via actuated
Signed-off-by: Adrian Reber <areber@redhat.com>
2024-09-19 15:23:42 -07:00
Adrian Reber
8beac656fc coredump: fail on unsupported architectures early
Currently coredump only works on x86_64. Fail early on any other
architecture.

Signed-off-by: Adrian Reber <areber@redhat.com>
2024-09-19 15:23:42 -07:00
Adrian Reber
d44fc0de5a test: only run macvlan tests if macvlan devices can be created
Some test environments (Actuated runners for example) do not support
maclvan devices. Skip tests depending on it automatically.

Signed-off-by: Adrian Reber <areber@redhat.com>
2024-09-19 15:23:42 -07:00
Adrian Reber
01c65732b6 test: better test for SELinux tools
Previously the check was just if /sys/fs/selinux is mounted. This
extends the check to see if all necessary tools are installed.

Signed-off-by: Adrian Reber <areber@redhat.com>
2024-09-19 15:23:42 -07:00
Adrian Reber
615ccf98cf crit: do not crash on aarch64 doing 'crit x ./ rss'
Running 'crit x ./ rss' on aarch64 crashes with:

    File "/home/criu/crit/crit/__main__.py", line 331, in explore_rss
      while vmas[vmi]['start'] < pme:
            ~~~~^^^^^
  IndexError: list index out of range

This adds an additional check to the while loop to do access indexes out
of range.

Signed-off-by: Adrian Reber <areber@redhat.com>
2024-09-19 15:23:42 -07:00
Radostin Stoyanov
21ea718f9f plugins/amdgpu: fix printf format specifiers
Errors on aarch64:

	In file included from amdgpu_plugin_drm.h:10,
			 from amdgpu_plugin.c:33:
	amdgpu_plugin.c: In function 'amdgpu_plugin_dump_file':
	amdgpu_plugin_util.h:24:20: error: format '%lld' expects argument of type 'long long int', but argument 6 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
	   24 | #define LOG_PREFIX "amdgpu_plugin: "
	      |                    ^~~~~~~~~~~~~~~~~
	../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
	   47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
	      |                                                    ^~~~~~~~~~
	amdgpu_plugin.c:1236:9: note: in expansion of macro 'pr_info'
	 1236 |         pr_info("devices:%d bos:%d objects:%d priv_data:%lld\n", args.num_devices, args.num_bos, args.num_objects,
	      |         ^~~~~~~
	cc1: all warnings being treated as errors

Errors on ppc64:

	In file included from amdgpu_plugin_drm.h:10,
			 from amdgpu_plugin.c:33:
	amdgpu_plugin.c: In function 'amdgpu_plugin_dump_file':
	amdgpu_plugin_util.h:24:20: error: format '%llu' expects argument of type 'long long unsigned int', but argument 6 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
	   24 | #define LOG_PREFIX "amdgpu_plugin: "
	      |                    ^~~~~~~~~~~~~~~~~
	../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
	   47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
	      |                                                    ^~~~~~~~~~
	amdgpu_plugin.c:1236:9: note: in expansion of macro 'pr_info'
	 1236 |         pr_info("devices:%u bos:%u objects:%u priv_data:%llu\n",
	      |         ^~~~~~~
	cc1: all warnings being treated as errors
	In file included from amdgpu_plugin_util.c:38:
	amdgpu_plugin_util.c: In function 'print_kfd_bo_stat':
	amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
	   24 | #define LOG_PREFIX "amdgpu_plugin: "
	      |                    ^~~~~~~~~~~~~~~~~
	../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
	   47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
	      |                                                    ^~~~~~~~~~
	amdgpu_plugin_util.c:196:17: note: in expansion of macro 'pr_info'
	  196 |                 pr_info("%s(), %d. KFD BO Addr: %llx \n", __func__, idx, bo->addr);
	      |                 ^~~~~~~
	amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
	   24 | #define LOG_PREFIX "amdgpu_plugin: "
	      |                    ^~~~~~~~~~~~~~~~~
	../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
	   47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
	      |                                                    ^~~~~~~~~~
	amdgpu_plugin_util.c:197:17: note: in expansion of macro 'pr_info'
	  197 |                 pr_info("%s(), %d. KFD BO Size: %llx \n", __func__, idx, bo->size);
	      |                 ^~~~~~~
	amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
	   24 | #define LOG_PREFIX "amdgpu_plugin: "
	      |                    ^~~~~~~~~~~~~~~~~
	../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
	   47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
	      |                                                    ^~~~~~~~~~
	amdgpu_plugin_util.c:198:17: note: in expansion of macro 'pr_info'
	  198 |                 pr_info("%s(), %d. KFD BO Offset: %llx \n", __func__, idx, bo->offset);
	      |                 ^~~~~~~
	amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
	   24 | #define LOG_PREFIX "amdgpu_plugin: "
	      |                    ^~~~~~~~~~~~~~~~~
	../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
	   47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
	      |                                                    ^~~~~~~~~~
	amdgpu_plugin_util.c:199:17: note: in expansion of macro 'pr_info'
	  199 |                 pr_info("%s(), %d. KFD BO Restored Offset: %llx \n", __func__, idx, bo->restored_offset);
	      |                 ^~~~~~~
	cc1: all warnings being treated as errors

Co-developed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-19 15:23:42 -07:00
Radostin Stoyanov
3e2ed18790 plugins/amdgpu: use C99-standard types
Co-developed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-19 15:23:42 -07:00
Radostin Stoyanov
d68205e919 ci: enable cross compile testing for amdgpu-plugin
Skip cross-compilation on armv7 because, among many other errors,
it fails with the following:

	In file included from ../../include/common/lock.h:9,
			 from ../../criu/include/files.h:9,
			 from amdgpu_plugin.c:30:
	../../include/common/asm/atomic.h:60:2: error: #error ARM architecture version (CONFIG_ARMV*) not set or unsupported.
	   60 | #error ARM architecture version (CONFIG_ARMV*) not set or unsupported.
	      |  ^~~~~
	../../include/common/asm/atomic.h: In function 'atomic_add_return':
	../../include/common/asm/atomic.h:81:9: error: implicit declaration of function 'smp_mb' [-Werror=implicit-function-declaration]
	   81 |         smp_mb();
	      |         ^~~~~~

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-19 15:23:42 -07:00
Radostin Stoyanov
2ee5844411 plugins/amdgpu: fix cross-compilation
To enable cross-compile we need to use the CC definition from
criu/scripts/nmk/scripts/tools.mk:

CC := $(CROSS_COMPILE)$(HOSTCC)

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-19 15:23:42 -07:00
Andrei Vagin
9a19cf34de scripts/ci: run tests with the mocked cuda-checkpoint tool
Signed-off-by: Andrei Vagin <avagin@google.com>
2024-09-11 16:02:11 -07:00
Andrei Vagin
de31abb970 criu/plugin: don't call plugin device hooks for non-alive tasks
Dead tasks don't hold any resources.

Fixes: 2465
Signed-off-by: Andrei Vagin <avagin@google.com>
2024-09-11 16:02:11 -07:00
Andrei Vagin
dea6305914 test/zdtm: allow to run tests with the mocked cuda-checkpoint tool
Here is an example how to run one test:
$ python test/zdtm.py run -t zdtm/static/env00 --ignore-taint --mocked-cuda-checkpoint

Signed-off-by: Andrei Vagin <avagin@google.com>
2024-09-11 16:02:11 -07:00
haozi007
67fe44e981 support user set remote mmap vma address
1. os auto assignment vma addr maybe conflict with vma in gpu living migrate scene;
2. so, we should give choice to user;

Signed-off-by: haozi007 <liuhao27@huawei.com>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
551cd92447 timer: fix printf specifiers for __suseconds64_t
New internal glibc types __timeval64 [1] and __suseconds64_t [2] have
been introduced as a solution for the Y2038 problem [3]. These 64-bit
types are used across all architectures. However, this change causes
the following build errors when cross-compiling on ARMv7 (armhf):

criu/timer.c:49:17: error: format '%ld' expects argument of type 'long int', but argument 5 has type '__suseconds64_t' {aka 'long long int'} [-Werror=format=]
   49 |         pr_info("Restored %s timer to %" PRId64 ".%ld -> %" PRId64 ".%ld\n", n,
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~
   50 |                 (int64_t)val->it_value.tv_sec, val->it_value.tv_usec,
      |                                                ~~~~~~~~~~~~~~~~~~~~~
      |                                                             |
      |                                                             __suseconds64_t {aka long long int}

criu/timer.c:49:17: error: format '%ld' expects argument of type 'long int', but argument 7 has type '__suseconds64_t' {aka 'long long int'} [-Werror=format=]
   49 |         pr_info("Restored %s timer to %" PRId64 ".%ld -> %" PRId64 ".%ld\n", n,
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~
   50 |                 (int64_t)val->it_value.tv_sec, val->it_value.tv_usec,
   51 |                 (int64_t)val->it_interval.tv_sec, val->it_interval.tv_usec);
      |                                                   ~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                                   |
      |                                                                   __suseconds64_t {aka long long int}

ns.c:234:48: error: format '%ld' expects argument of type 'long int', but argument 5 has type 'time_t' {aka 'long long int'} [-Werror=format=]
  234 |         len = snprintf(buf, sizeof(buf), "%d %ld 0", clk_id, offset);
      |                                              ~~^             ~~~~~~
      |                                                |             |
      |                                                long int      time_t {aka long long int}
      |                                              %lld

msg.c:58:41: error: format '%ld' expects argument of type 'long int', but argument 3 has type '__suseconds64_t' {aka 'long long int'} [-Werror=format=]
   58 |         off += sprintf(buf + off, ".%.3ld: ", tv.tv_usec / 1000);
      |                                     ~~~~^     ~~~~~~~~~~~~~~~~~
      |                                         |                |
      |                                         long int         __suseconds64_t {aka long long int}
      |                                     %.3lld

../lib/zdtmtst.h:137:26: error: format '%ld' expects argument of type 'long int', but argument 4 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
  137 |                 test_msg("ERR: %s:%d: " format " (errno = %d (%s))\n", __FILE__, __LINE__, ##arg, errno, \
      |                          ^~~~~~~~~~~~~~
pthread_timers_h.c:72:17: note: in expansion of macro 'pr_perror'
   72 |                 pr_perror("wrong interval: %ld:%ld", itimerspec.it_interval.tv_sec, itimerspec.it_interval.tv_nsec);
      |                 ^~~~~~~~~

vdso00.c:22:32: error: format '%li' expects argument of type 'long int', but argument 3 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
   22 |         test_msg("%d time: %10li\n", getpid(), tv.tv_sec);
      |                            ~~~~^               ~~~~~~~~~
      |                                |                 |
      |                                long int          __time64_t {aka long long int}
      |                            %10lli

vdso00.c:29:32: error: format '%li' expects argument of type 'long int', but argument 3 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
   29 |         test_msg("%d time: %10li\n", getpid(), tv.tv_sec);
      |                            ~~~~^               ~~~~~~~~~
      |                                |                 |
      |                                long int          __time64_t {aka long long int}
      |                            %10lli

vdso01.c:357:42: error: format '%li' expects argument of type 'long int', but argument 2 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
  357 |         test_msg("gettimeofday: tv_sec %li vdso_gettimeofday: tv_sec %li\n", tv1.tv_sec, tv2.tv_sec);
      |                                        ~~^                                   ~~~~~~~~~~
      |                                          |                                      |
      |                                          long int                               __time64_t {aka long long int}
      |                                        %lli

vdso01.c:357:72: error: format '%li' expects argument of type 'long int', but argument 3 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
  357 |         test_msg("gettimeofday: tv_sec %li vdso_gettimeofday: tv_sec %li\n", tv1.tv_sec, tv2.tv_sec);
      |                                                                      ~~^                 ~~~~~~~~~~
      |                                                                        |                    |
      |                                                                        long int             __time64_t {aka long long int}
      |

vdso01.c:328:43: error: format '%li' expects argument of type 'long int', but argument 2 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
  328 |         test_msg("clock_gettime: tv_sec %li vdso_clock_gettime: tv_sec %li\n", ts1.tv_sec, ts2.tv_sec);
      |                                         ~~^                                    ~~~~~~~~~~
      |                                           |                                       |
      |                                           long int                                __time64_t {aka long long int}
      |                                         %lli

vdso01.c:328:74: error: format '%li' expects argument of type 'long int', but argument 3 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
  328 |         test_msg("clock_gettime: tv_sec %li vdso_clock_gettime: tv_sec %li\n", ts1.tv_sec, ts2.tv_sec);
      |                                                                        ~~^                 ~~~~~~~~~~
      |                                                                          |                    |
      |                                                                          long int             __time64_t {aka long long int}
      |

../lib/zdtmtst.h:144:26: error: format '%ld' expects argument of type 'long int', but argument 4 has type 'time_t' {aka 'long long int'} [-Werror=format=]
  144 |                 test_msg("FAIL: %s:%d: " format " (errno = %d (%s))\n", __FILE__, __LINE__, ##arg, errno, \
      |                          ^~~~~~~~~~~~~~~
mtime_mmap.c:80:17: note: in expansion of macro 'fail'
   80 |                 fail("mtime %ld wasn't updated on mmapped %s file", mtime_new, filename);
      |                 ^~~~

../lib/zdtmtst.h:144:26: error: format '%ld' expects argument of type 'long int', but argument 4 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
  144 |                 test_msg("FAIL: %s:%d: " format " (errno = %d (%s))\n", __FILE__, __LINE__, ##arg, errno, \
      |                          ^~~~~~~~~~~~~~~
mtime_mmap.c:101:17: note: in expansion of macro 'fail'
  101 |                 fail("After migration, mtime changed to %ld", fst.st_mtime);
      |                 ^~~~

[1] https://sourceware.org/git/?p=glibc.git;h=504c98717062cb9bcbd4b3e59e932d04331ddca5
[2] https://sourceware.org/git/?p=glibc.git;h=3fced064f23562ec24f8312ffbc14950993969e6
[3] https://en.wikipedia.org/wiki/Year_2038_problem

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
a045c874cb ci: run tests with amdgpu and cuda plugins
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
2453ed69a2 zdtm: add option to run tests with criu plugins
By default, if the "CRIU_LIBS_DIR" environment variable is not set,
CRIU will load all plugins installed in `/usr/lib/criu`. This may
result in running the ZDTM tests with plugins for a different version
of CRIU (e.g., installed from a package).

This patch updates ZDTM to always set the "CRIU_LIBS_DIR" environment
variable and use a local "plugins" directory. This directory contains
copies of the plugin files built from source. In addition, this patch
adds the `--criu-plugin` option to the `zdtm.py run` command, allowing
tests to be run with specified CRIU plugins.

Example:

  - Run test only with AMDGPU plugin
    ./zdtm.py run -t zdtm/static/busyloop00 --criu-plugin amdgpu

  - Run test only with CUDA plugin
    ./zdtm.py run -t zdtm/static/busyloop00 --criu-plugin cuda

  - Run test with both AMDGPU and CUDA plugins
    ./zdtm.py run -t zdtm/static/busyloop00 --criu-plugin amdgpu cuda

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
ad66c27a11 cuda: fix launch cuda-checkpoint
When the cuda-checkpoint tool is not installed, execvp() is expected to
fail and return -1. In this case, we need to call exit() to terminate
the child process that was created earlier with fork().

Since CRIU can be used with applications that do not use CUDA, even
when the CUDA plugin is installed, this patch also updates the log
messages to show debug and warning (instead of error) when the
cuda-checkpoint tool is not found in $PATH.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Andrei Vagin <avagin@google.com>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
fde0b7ac69 cuda: don't leak fds to cuda-checkpoint
Leaking open file descriptors to third-party tools can lead
to security risks.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
4dde52a308 ci/podman: show mounts
Show information about mounts available on the host filesystem.
This is useful for debugging.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
9a85fb6382 ci/podman: show criu logs in case of error
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
liuchao173
8437663cc6 delete redundant include header files
restorer.h has been included in line 43.

Fixes: 22963d2827 ("Hide asm/restorer.h from sources")

Signed-off-by: liuchao173 <liuchao173@huawei.com>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
c42b58f4fb plugin: enable multiple plugins for the same hook
CRIU provides two plugins for checkpoint/restore of GPU applications:
amdgpu and cuda. Both plugins use the `RESUME_DEVICES_LATE` hook to
enable restore:

    CR_PLUGIN_REGISTER_HOOK(CR_PLUGIN_HOOK__RESUME_DEVICES_LATE, amdgpu_plugin_resume_devices_late)
    CR_PLUGIN_REGISTER_HOOK(CR_PLUGIN_HOOK__RESUME_DEVICES_LATE, cuda_plugin_resume_devices_late)

However, CRIU currently does not support running more than one plugin
for the same hook. As a result, when both plugins are installed, the
resume function for CUDA applications is not executed. To fix this,
we need to make sure that both `plugin_resume_devices_late()` functions
return `-ENOTSUP` when restore is not supported.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
85050be66b seize: fix pause-devices plugin hook
The plugin hook "PAUSE_DEVICES" was recently introduced in the following
commit. This hook was intended to execute the cuda-checkpoint tool
before the process tree is frozen. However, the run_plugins() call has
been placed immediately *after* freeze_processes(). This causes the
cuda-checkpoint tool to hang indefinitely during the checkpointing
of CUDA applications running in containers, eventually leading to its
termination by the timeout alarm.

a85f488595
criu/plugin: Introduce new plugin hooks PAUSE_DEVICES and CHECKPOINT_DEVICES to be used during pstree collection

This problem can be reproduced with the following example:

sudo podman run -d --rm \
        --device nvidia.com/gpu=all --security-opt=label=disable \
        quay.io/radostin/cuda-counter

sudo podman container checkpoint -l -e /tmp/checkpoint.tar

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Andrei Vagin
21108b40de test/zdtm: mount a new tmpfs to the zdtm root /dev
The current file system can be mounted with nodev.

Fixes #2441

Signed-off-by: Andrei Vagin <avagin@google.com>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
fcbadfbdbf plugins: set executable bit on .so files
For historical reasons, some tools like rpm [1] or ldd [2,3]
may expect the executable bit to be present for the correct
identification of shared libraries. The executable bit on .so
files is set by default by compilers (e.g., GCC). It is not
strictly necessary but primarily a convention.

[1] https://docs.fedoraproject.org/en-US/package-maintainers/CommonRpmlintIssues/#unstripped_binary_or_object
[2] https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/ldd.bash.in;h=d6b640df;hb=HEAD#l154

[3] $ sudo ldd /usr/lib/criu/*.so
/usr/lib/criu/amdgpu_plugin.so:
ldd: warning: you do not have execution permission for `/usr/lib/criu/amdgpu_plugin.so'
	linux-vdso.so.1 (0x00007fd0a2a3e000)
	libdrm.so.2 => /lib64/libdrm.so.2 (0x00007fd0a29eb000)
	libdrm_amdgpu.so.1 => /lib64/libdrm_amdgpu.so.1 (0x00007fd0a29de000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fd0a27fc000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fd0a2a40000)
/usr/lib/criu/cuda_plugin.so:
ldd: warning: you do not have execution permission for `/usr/lib/criu/cuda_plugin.so'
	linux-vdso.so.1 (0x00007f1806e13000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f1806c08000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f1806e15000)

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
5783706d57 docs: update amdgpu-plugin man page
This patch updates the dependencies section of the AMDGPU plugin man
page to reflect that the plugin has been merged upstream and to fix a
formatting issue.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Florian Weimer
089345f77a Adjust to glibc __rseq_size semantic change
In commit 2e456ccf0c34a056e3ccafac4a0c7effef14d918 ("Linux: Make
__rseq_size useful for feature detection (bug 31965)") glibc 2.40
changed the meaning of __rseq_size slightly: it is now the size
of the active/feature area (20 bytes initially), and not the size
of the entire initially defined struct (32 bytes including padding).
The reason for the change is that the size including padding does not
allow detection of newly added features while previously unused
padding is consumed.

The prep_libc_rseq_info change in criu/cr-restore.c is not necessary
on kernels which have full ptrace support for obtaining rseq
information because the code is not used.  On older kernels, it is
a correctness fix because with size 20 (the new value), rseq
registeration would fail.

The two other changes are required to make rseq unregistration work
in tests.

Signed-off-by: Florian Weimer <fweimer@redhat.com>
2024-09-11 16:02:11 -07:00
Bui Quang Minh
b9081ca56b zdtm: make cgroup testcases run non-parallel
cgroup testcases live in the same cgroup root zdtmtst and
zdtmtst.defaultroot controller then create child subgroup for testing. This
can cause problems when cgroup testcases run in parallel. For example,
testcase A dumps the child subgroup of testcase B since it's in the cgroup
root but in the middle of restoring of testcase A, testcase B completes and
cleans up the subgroup directory. This causes error in testcase A restore.
This commit adds excl flag to all cgroup testcases description so that
these don't run parallel.

Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
2024-09-11 16:02:11 -07:00
Andrei Vagin
4f45572fde util: use close_range when it's supported
close_range is faster than reading /proc/self/fd and closing descriptors
one by one.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
42b177da62 scripts/build: drop centos 7 targets
The CI tests with CentOS 7 have been disabled and removed [1,2].
This patch removes the obsolete Makefile targets for these tests.

[1] 24bc083653
[2] f8466ca798

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Andrei Vagin
1815838191 vdso: proxify the __vdso_clock_gettime64 function
It was added in v5.3-rc1~211^2~4^2~10.

Fixes #2390

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2024-09-11 16:02:11 -07:00
Andrei Vagin
ac22aaf576 apparmor: get_suspend_policy must return NULL in error cases
Before this fix, it could return MAP_FAILED which is ((void *) -1).

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2024-09-11 16:02:11 -07:00
Pavel Tikhomirov
71999d8883 cgroupd: unblock SIGTERM to make stop_cgroupd actually work
Sometimes due to sigblockmask inheritance cgroupd can inherit SIGTERM
blocked. That will lead cgroupd ignoring SIGTERM from stop_cgroupd() and
CRIU will get stuck due to waiting for never-stopping cgroupd.

I see this happening in lxc-checkpoint, also saw this in OpenVZ jenkins
on cgroup_inotify00 test.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2024-09-11 16:02:11 -07:00
Liu Hua
daed6c3535 irmap: duplicate string in irmap_scan_path_add
Duplicate string in irmap_scan_path_add, otherwise it will free before
parsing next configuration input.

[ avagin: handle errors of xstrdup ]

Signed-off-by: Liu Hua <weldonliu@tencent.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
2024-09-11 16:02:11 -07:00
Andrei Vagin
b169e3b63d plugins/cuda: fix crosscompilation
Signed-off-by: Andrei Vagin <avagin@gmail.com>
2024-09-11 16:02:11 -07:00
Pratyush Yadav
ca971b7f8b compel: fix build on Amazon Linux 2 due to missing PTRACE_ARCH_PRCTL
Commit fc683cb01 ("compel: shstk: save CET state when CPU supports it")
started using PTRACE_ARCH_PRCTL to query shadow stack status. While
PTRACE_ARCH_PRCTL has existed in the kernel for a long time, it was only
added to glibc in version 2.27. Amazon Linux 2 (AL2) has glibc 2.26,
which does not have this definition. As a result, build on AL2 fails
with the below error:

    compel/arch/x86/src/lib/infect.c: In function ‘get_task_xsave’:
    compel/arch/x86/src/lib/infect.c:276:14: error: ‘PTRACE_ARCH_PRCTL’ undeclared (first use in this function)
    276 |   if (ptrace(PTRACE_ARCH_PRCTL, pid, (unsigned long)&features, ARCH_SHSTK_STATUS)) {
        |              ^~~~~~~~~~~~~~~~~

While the definition is present on the system via the kernel headers (in
asm/ptrace-abi.h) which can be reached by including linux/ptrace.h, the
comment in compel/include/uapi/ptrace.h says:

    We'd want to include both sys/ptrace.h and linux/ptrace.h, hoping
    that most definitions come from either one or another. Alas, on
    Alpine/musl both files declare struct ptrace_peeksiginfo_args, so
    there is no way they can be used together. Let's rely on libc one.

Since including linux/ptrace.h is not an option, define
PTRACE_ARCH_PRCTL if it doesn't already exist. An interesting point to
note is that in sys/ptrace.h, PTRACE_ARCH_PRCTL is an enum value so the
preprocessor doesn't know about it. PT_ARCH_PRCTL is the preprocessor
symbol that matches the value of PTRACE_ARCH_PRCTL. So look for
PT_ARCH_PRCTL to decide if PTRACE_ARCH_PRCTL is available or not.

Another interesting point to note is that AL2 ships with GCC 7 by
default, which does not support the -mshstk option, causing other build
failures. Luckily, it also ships GCC 10 which does have the option.
Using GCC 10 lets the build succeed.

Fixes: fc683cb01 ("compel: shstk: save CET state when CPU supports it")
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
2024-09-11 16:02:11 -07:00
Jesus Ramos
bf417dd050 criu/plugin: Add NVIDIA CUDA plugin
Adding support for the NVIDIA cuda-checkpoint utility, requires the use of an
r555 or higher driver along with the cuda-checkpoint binary.

Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
2024-09-11 16:02:11 -07:00
Jesus Ramos
5f486d5aee criu/plugin: Introduce new plugin hooks PAUSE_DEVICES and CHECKPOINT_DEVICES to be used during pstree collection
PAUSE_DEVICES is called before a process is frozen and is used by the CUDA
plugin to place the process in a state that's ready to be checkpointed and
quiesce any pending work

CHECKPOINT_DEVICES is called after all processes in the tree have been frozen
and PAUSE'd and performs the actual checkpointing operation for CUDA
applications

Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
2024-09-11 16:02:11 -07:00
Jesus Ramos
1012e542e5 criu: Restore rseq_cs state slightly earlier in the restore sequence and run the plugin finalizer later in the dump sequence
Restore rseq_cs state before calling RESUME_DEVICES_LATE as the CUDA plugin will
temporarily unfreeze a thread during the plugin hook to assist with device
restore

Run the plugin finalizer later in the dump sequence since the finalizer is used
by the CUDA plugin to handle some process cleanup

Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
7ac4537069 readme: update link to FAQ page
The current link opens a page with the following text:

    The MediaWiki FAQ can be found at:
    https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:FAQ

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
4f15fe8c59 make: improve check for externally managed Python
Move PYTHON_EXTERNALLY_MANAGED and PIP_BREAK_SYSTEM_PACKAGES
into Makefile.install to avoid code duplication. In addition, add
PIPFLAGS variable to enable specifying pip options during installation.
This is particularly useful for packaging, where it is common for `pip install`
to run in an environment with pre-installed dependencies and without internet
access. In such environment, we need to specify the following options:

    --no-build-isolation --no-index --no-deps

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Adrian Reber
fdf546dbd5 ci: upgrade to Fedora 40 Vagrant images (38 is EOL)
Signed-off-by: Adrian Reber <areber@redhat.com>
2024-09-11 16:02:11 -07:00
Bhavik Sachdev
f171649264 test/dump-crash: check code path when dump crashes
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
2024-09-11 16:02:11 -07:00
Bhavik Sachdev
a252a240c3 zdtm: Distinguish between fail and crash of dump
Adds a exit_signal static method to criu_cli, criu_config and criu_rpc
used to detect a crash.

Fixes: #350

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
2024-09-11 16:02:11 -07:00
Adrian Reber
6feb57a840 ci: remove CentOS Stream 8 test (EOL)
Signed-off-by: Adrian Reber <areber@redhat.com>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
1da29f27f6 zdtm: add support for LD_PRELOAD tests
This commit adds a `--preload-libfault` option to ZDTM's run command.
This option runs CRIU with LD_PRELOAD to intercept libc functions
such as pread(). This method allows to simulate special cases,
for example, when a successful call to pread() transfers fewer
bytes than requested.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Andrei Vagin
e7276cf63b pagemap-cache: handle short reads
It is possible for pread() to return fewer number of bytes than
requested. In such case, we need to repeat the read operation
with appropriate offset.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Andrei Vagin
cc88b1e1ff net: Fix TOCTOU race condition in unix_conf_op
The unix_conf_op function reads the size of the sysctl entry array
twice. gcc thinks that it can lead to a time-of-check to time-of-use
(TOCTOU) race condition if the array size changes between the two reads.

Fixes #2398

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2024-09-11 16:02:11 -07:00
Alexander Mikhalitsyn
457bc6a8ff criu: use proper format-specified to accommodate time_t 64-bit change
See also:
https://wiki.debian.org/ReleaseGoals/64bit-time

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2024-09-11 16:02:11 -07:00
Arnav Bhatt
95f66d13db criu: move sigact dump/restore code into sigact.c
Seperate sigact dump/restore code from cr-restore.c and parasite-syscall.c into sigact.c

Signed-off-by: Arnav Bhatt <arnav@ghativega.in>
2024-09-11 16:02:11 -07:00
Adrian Reber
9c8a6927aa ci: update check for SELinux
The rawhide tests runs in a container. Containers always have SELinux
disabled from the inside. Somehow /sys/fs/selinux is now mounted. We
used the existence of that directory if SELinux is available. This seems
to be no longer true.

Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
b3c3422cd9 test/make: remove unused target
A fault-injection test was introduced in commit [1] and later removed in
commit [2]. This patch removes the obsolete Makefile target.

[1] b95407e264
    test: check, that parasite can rollback itself (v2)

[2] 2cb4532e26
    tests: remove zdtm.sh (v2)

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
30aa8dbe4d mount: fix unbounded write
Replace sprintf() with snprintf() and specify maximum length of
characters to avoid potential overflow.

Reported-by: GitHub CodeQL (https://codeql.github.com/)
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Juntong Deng
708f872a6d sk-tcp: Add test cases for TCP_CORK and TCP_NODELAY socket options
Currently there are no socket option test cases for TCP_CORK and
TCP_NODELAY, this commit adds related test cases.

The socket option test cases for TCP_KEEPCNT, TCP_KEEPIDLE, and
TCP_KEEPINTVL already exist in socket-tcp_keepalive.c, so they are
not included in this test case.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
2024-09-11 16:02:11 -07:00
Juntong Deng
9ba9aff77f sk-tcp: Move TCP socket options from SkOptsEntry to TcpOptsEntry
Currently some TCP socket option information is stored in SkOptsEntry,
which is a little confusing.

SkOptsEntry should only contain socket options that are common to
all sockets.

In this commit move the TCP-specific socket options from SkOptsEntry
to TcpOptsEntry.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
2024-09-11 16:02:11 -07:00
Juntong Deng
1cb75c0b1e sk-tcp: Move TCP socket options from TcpStreamEntry to TcpOptsEntry
Currently some of the TCP socket option information is stored in the
TcpStreamEntry, but the information in the TcpStreamEntry is only
restored after the TCP socket has established connection, which
results in these TCP socket options not being restored for
unconnected TCP sockets.

In this commit move the TCP socket options from TcpStreamEntry to
TcpOptsEntry and add dump_tcp_opts() and restore_tcp_opts() for TCP
socket options dump and restore.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
2024-09-11 16:02:11 -07:00
Kir Kolyshkin
13854a988c criu: fix a fatal failure if nft doesn't work
On some systems, nft binary might not be installed, or some kernel
options might be unconfigured, resulting in something like this:

	sudo unshare -n nft create table inet CRIU
	Error: Could not process rule: Operation not supported
	create table inet CRIU
	^^^^^^^^^^^^^^^^^^^^^^^

This is similar to what kerndat_has_nftables_concat() does, and if the
outcome is the same, it returns an error to kerndat_init(), and an error
from kerndat_init() is considered fatal.

Let's relax the check, returning mere "feature not working" instead of
a fatal error.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-09-11 16:02:11 -07:00
Pavel Tikhomirov
df178c7e53 sk-tcp: cleanup dump_tcp_conn_state error handling
1) In dump_tcp_conn_state, if return from libsoccr_save is >=0, we check
that sizeof(struct libsoccr_sk_data) returned from libsoccr_save is
equal to sizeof(struct libsoccr_sk_data) we see in dump_tcp_conn_state
(probably to check if we use the right library version). And if sizes
are different we go to err_r, which just returns ret, which can
teoretically be 0 (if size in library is zero) and that would lead
dump_one_tcp treat this as success though it is obvious error.

2) In case of dump_opt or open_image fails we don't explicitly set ret
and rely that sizeof(struct libsoccr_sk_data) previously set to ret is
not 0, I don't really like it, it makes reading code too complex.

3) We have a lot of err_* labels which do exactly the same thing, there
is no point in having all of them, also it is better to choose the name
of the label based on what it really does.

So let's refactor error handling to avoid these inconsistencies.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
4607b53566 mem: optimize debug logging of enqueued pages
During restore, CRIU prints "Enqueue page-read" messages for
each page-read request [1]. However, this message does not
provide useful information, increases performance overhead
during restore and the size of log file.

$ ./zdtm.py run -t zdtm/static/maps06 -f h -k always
$ grep 'Enqueue page-read' dump/zdtm/static/maps06/56/1/restore.log | wc -l
20493

This commit replaces these log messages with a single message
that shows the number of enqueued page-read requests.

$ grep 'enqueued' dump/zdtm/static/maps06/56/1/restore.log
(00.061449)     56: nr_enqueued:   20493

[1] 91388fc

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
f4290868bb ci/vdso01: fix typo
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
e68a06cfd1 ci: update actions/checkout to v4
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
5aaf450213 ci: update base OS to ubuntu 22.04
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Andrei Vagin
1c2a3d7faa check: verify ino and dev of overlayfs files in /proc/pid/maps
Check that the file device and inode shown in /proc/pid/maps match
values returned by stat(2).

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2024-09-11 16:02:11 -07:00
Kir Kolyshkin
e07ffa04b0 Makefile.config: fix/improve feature warnings.
1. Tell which RPMs or DEBs are required in all cases.

2. Use $(info ...) everywhere.

3. Drop extra nested $(info), instead use (a document) a simpler kludge.

4. Simplify and unify the language, add missing periods.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-09-11 16:02:11 -07:00
Pavel Tikhomirov
af4058871e timer: fix wrapping allignment in function declaration
Currently we have tabs + spaces on the wrapped line but the wrapped part
is not alligned to the opening bracket.

Fixes: bbe26d1b7 ("timer: fix allignment in function definition")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2024-09-11 16:02:11 -07:00
Adrian Reber
0fc83a79b1 ci: silence CircleCI warning about deprecated image
CircleCI currently prints out the following warning:

   This job is using a deprecated image 'ubuntu-2004:202010-01', please update to a newer image

According to https://discuss.circleci.com/t/linux-image-deprecations-and-eol-for-2024/
the recommended image name is: "image: default"

Signed-off-by: Adrian Reber <areber@redhat.com>
2024-09-11 16:02:11 -07:00
ccccrrrr
52623cca16 criu: move timers dump/restore code into separate file
Fixes: #335

Signed-off-by: ccccrrrr <zcr1006@gmail.com>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
231ba0cd29 zdtm/sched_policy00: use reset-on-fork flag
This patch extends the sched_policy00 test case to verify that
the SCHED_RESET_ON_FORK flag is restored correctly.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
75fed59ef6 Add support for reset-on-fork scheduling flag
This patch extends CRIU with support for SCHED_RESET_ON_FORK.
When the SCHED_RESET_ON_FORK flag is set, the following rules
apply for subsequently created children:

- If the calling thread has a scheduling policy of SCHED_FIFO or
SCHED_RR, the policy is reset to SCHED_OTHER in child processes.

- If the calling process has a negative nice value, the nice value
is reset to zero in child processes.

(See 'man 7 sched')

Fixes: #2359

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Artem Trushkin
8f0e200e66 mem: fix some VMAs being incorrectly mapped wtih PROT_WRITE
A memory interval is a half-open interval, so the condition
when pr->pe->vaddr == vma->e->end should not be interpreted
as an intersection and should cause vma to be marked with VMA_NO_PROT_WRITE.

Fixes: #2364

Signed-off-by: Artem Trushkin <at.120@ya.ru>
2024-09-11 16:02:11 -07:00
Adrian Reber
a2b018a188 ci: try to fix broken docker test
Upgrade to 22.04 base image and use the existing version of docker.

Signed-off-by: Adrian Reber <areber@redhat.com>
2024-09-11 16:02:11 -07:00
Mike Rapoport (IBM)
a48aa33eaa restorer: shstk: implement shadow stack restore
The restore of a task with shadow stack enabled adds these steps:

* switch from the default shadow stack to a temporary shadow stack
  allocated in the premmaped area
* unmap CRIU mappings; nothing changed here, but it's important that
  CRIU mappings can be removed only after switching to a temporary
  shadow stack
* create shadow stack VMA with map_shadow_stack()
* restore shadow stack contents with wrss
* switch to "real" shadow stack
* lock shadow stack features

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
2024-09-11 16:02:11 -07:00
Mike Rapoport (IBM)
7dd5830023 restore: add infrastructure to enable shadow stack
There are several gotachs when restoring a task with shadow stack:
* depending on the compiler options, glibc version and glibc tunables
  CRIU can run with or without shadow stack.
* shadow stack VMAs are special, they must be created using a dedicated
  map_shadow_stack() system call and can be modified only by a special
  instruction (wrss) that is only available when shadow stack is
  enabled.
* once shadow stack is enabled, it is not writable even with wrss;
  writes to shadow stack can be only enabled with ptrace() and only when
  shadow stack is enabled in the tracee.
* if the shadow stack is enabled during restore rather than by glibc,
  calling retq after arch_prctl() that enables the shadow stack causes
  #CP, so the function that enables shadow stack can never return.

Add the infrastructure required to cope with all of those:

* modify the restore code to allow trampoline (arch_shstk_trampoline)
  that will enable shadow stack and call restore_task_with_children().
* add call to arch_shstk_unlock() right after the tasks are clone()ed;
  this will allow unlocking shadow stack features and making shadow
  stack writable.
* add stubs for architectures that do not support shadow stacks
* add implementation of arch_shstk_trampoline() and arch_shstk_unlock()
  for x86, but keep it disabled; it will be enabled along with addtion
  of the code that will restore shadow stack in the restorer blob

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
2024-09-11 16:02:11 -07:00
Mike Rapoport (IBM)
f47899c9ef criu: kerndat: add kdat_has_shstk()
Detect if CRIU runs with shadow stack enabled and store the result in
kerndat.

Unlike most kerndat knobs, kdat_has_shstk() does not check for
availability of the shadow stack in the kernel, but rather checks if
criu runs with shadow stack enabled.

This depends on hardware availabilty, kernel and glibc support, compiler
options and glibc tunables, so kdat_has_shstk() must be called every
time CRIU starts and its result cannot be cached.

The result will be used by the code that controls shadow stack
enablement in the next commit.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
2024-09-11 16:02:11 -07:00
Mike Rapoport (IBM)
2ebd1a4f0b criu: shstk: prepare shadow stack parameters for restorer blob
Shadow stacks must be populated using special WRSS instruction. This
instruction is only available when shadow stack is enabled, calling it
with disabled shadow stack causes #UD.

Moreover, shadow stack VMAs cannot be mremap()ed and they must be
created using map_shadow_stack() system call. This requires delaying the
restore of shadow stacks to restorer blob after the CRIU mappings are
cleared.

Introduce rst_shstk_info structure to hold shadow stack parameters
required in the restorer blob and populate this structure in
arch_prepare_shstk() method.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
2024-09-11 16:02:11 -07:00
Mike Rapoport (IBM)
4b6dda7ec0 criu: shstk: premap and prepopulate shadow stack VMAs
Shadow stack VMAs cannot be mmap()ed, they must be created using
map_shadow_stack() system call and populated using special wrss
instruction available only when shadow stack is enabled.

Premap them to reserve virtual address space and populate it to have
there contents available for later copying after enabling shadow stack.

Along with the space required by shadow stack VMAs also reserve an extra
page that will be later used as a temporary shadow stack.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
2024-09-11 16:02:11 -07:00
Mike Rapoport (IBM)
17eda3ce57 criu: shstk: add VMA_AREA_SHSTK flag
The shadow stack VMAs require special care because they can only be
created and populated using special system calls.

Add VMA_AREA_SHSTK flag and set it for VMAs that are marked as "ss" in
/proc/pid/smaps

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
2024-09-11 16:02:11 -07:00
Mike Rapoport (IBM)
0aba3dcfa1 compel: shstk: prepare shadow stack signal frame
When calling sigreturn with CET enabled, the kernel verifies that the
shadow stack has proper address of sa_restorer and a "restore token".
Normally, they pushed to the shadow stack when signal processing is
started.

Since compel calls sigreturn directly, the shadow stack should be
updated to match the kernel expectations for sigreturn invocation.

Add parasite_setup_shstk() that sets up the shadow stack with the
address of __export_parasite_head_start as sa_restorer and with the
required restore token.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
2024-09-11 16:02:11 -07:00
Mike Rapoport (IBM)
63a45e1c8a compel: infect: prepare parasite_service() for addition of CET support
To support sigreturn with CET enabled parasite must rewind its stack
before calling sigreturn so that shadow stack will be compatible with
actual calling sequence.

In addition, calling sigreturn from top level routine
(__export_parasite_head_start) will significantly simplify the shadow
stack manipulations required to execute sigreturn.

For x86 make fini_sigreturn() return the stack pointer for the signal
frame that will be used by sigreturn and propagate that return value up
to __export_parasite_head_start.

In non-daemon mode parasite_trap_cmd() returns non-positive value
which allows to distinguish daemon and non-daemon mode and properly stop
at int3 in non-daemon mode.

Architectures other than x86 remain unchanged and will still call
sigreturn from fini_sigreturn().

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
2024-09-11 16:02:11 -07:00
Mike Rapoport (IBM)
6e491a19a3 compel: shstk: save CET state when CPU supports it
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
2024-09-11 16:02:11 -07:00
Mike Rapoport (IBM)
17f4dd0959 compel: always pass user_fpregs_struct_t to compel_get_task_regs()
All architectures create on-stack structure for floating point save area
in compel_get_task_regs() if the caller passes NULL rather than a valid
pointer.

The only place that calls compel_get_task_regs() with NULL for floating
point save area is parasite_start_daemon() and it is simpler to define
this strucuture on stack of parasite_start_daemon().

The availability of floating point save data is required in
parasite_start_daemon() to detect shadow stack presence early during
parasite infection and will be used in later patches.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
2024-09-11 16:02:11 -07:00
Mike Rapoport (IBM)
0b8c51eaad compiler: add ALIGN_DOWN macro
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
2024-09-11 16:02:11 -07:00
Stepan Pieshkin
f590c2b638 zdtm/static: check that cgroup layout of threads is preserved
Co-developed-by: Stepan Pieshkin <stepanpieshkin@google.com>
Signed-off-by: Stepan Pieshkin <stepanpieshkin@google.com>
Signed-off-by: Michal Clapinski <mclapinski@google.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
2024-09-11 16:02:11 -07:00
Stepan Pieshkin
a0a6ec3dc0 cgroup: Add support for restoring a thread in a correct v1 cgroup
Currently we have checkpoint/restore support only of cgroup v2 threaded
controllers. Threads originating in cgroup v1 environments will be
restored to the main thread's cgroup. This change extends the support
for a cgroups v1.

Signed-off-by: Stepan Pieshkin <stepanpieshkin@google.com>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
835afb1b88 criu-ns: fix lint error
This patch fixes the following lint error:
scripts/criu-ns:219:16: E713 [*] Test for membership should be `not in`

The change in this patch is auto-generated with `ruff --fix`.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
e0b74f558b make: replace flake8 with ruff
Ruff (https://github.com/astral-sh/ruff) is a Python linter
written in Rust, designed to replace Flake8. It is significantly
faster and actively maintained.

In addition to replacing flake8 with ruff, this patch also
creates separate makefile targets for ruff, shellcheck and
codespell, so that they can be tested independently.

RUFF_FLAGS can be used to specify options such as '--fix'.
Example:
	make lint
	make ruff RUFF_FLAGS=--fix

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
7fd4a15e68 pb2dict: fix flake8 error
This patch fixes the following flake8 error:
python3 -m flake8 --config=scripts/flake8.cfg lib/pycriu/images/pb2dict.py
lib/pycriu/images/pb2dict.py:361:43: E721 do not compare types, for exact checks use `is` / `is not`, for instance checks use `isinstance()`

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
e0f91e66ee kerndat: check support for PAGE_IS_SOFT_DIRTY
The commit introducing PAGE_IS_SOFT_DIRTY has not been merged
in kernel v6.7.x.

fs/proc/task_mmu: report SOFT_DIRTY bits through the PAGEMAP_SCAN ioctl
e6a9a2cbc1

As a result, CRIU fails with the following error:

Error (criu/pagemap-cache.c:199): pagemap-cache: PAGEMAP_SCAN: Invalid argument'
Error (criu/pagemap-cache.c:225): pagemap-cache: Failed to fill cache for 63 (400000-402000)'

This patch updates check_pagemap() in kerndat to check if PAGE_IS_SOFT_DIRTY is supported.
Fixes: #2334

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
a808f09bea amdgpu_plugin: fix lint errors
$ make lint
 ...
 # Do not append \n to pr_perror, pr_pwarn or fail
 ! git --no-pager grep -E '^\s*\<(pr_perror|pr_pwarn|fail)\>.*\\n"'
 plugins/amdgpu/amdgpu_plugin.c:		pr_perror("%s(), Can't handle VMAs of input device\n", __func__);

 ! git --no-pager grep -En '^\s*\<pr_(err|warn|msg|info|debug)\>.*);$' | grep -v '\\n'
 plugins/amdgpu/amdgpu_plugin_drm.c:45:		pr_err("Error in getting stat for: %s", path);
 plugins/amdgpu/amdgpu_plugin_util.c:77:		pr_err("Unable to read file (read:%ld buf_len:%ld)", len_read, buf_len);
 plugins/amdgpu/amdgpu_plugin_util.c:89:		pr_err("Unable to write file (wrote:%ld buf_len:%ld)", len_write, buf_len);
 plugins/amdgpu/amdgpu_plugin_util.c:120:		pr_err("%s: Failed to open for %s", path, write ? "write" : "read");
 plugins/amdgpu/amdgpu_plugin_util.c:126:		pr_err("%s: Failed get pointer for %s", path, write ? "write" : "read");
 plugins/amdgpu/amdgpu_plugin_util.c:136:		pr_err("%s:Failed to access file size", path);
 plugins/amdgpu/amdgpu_plugin_util.c:152:		pr_err("Cannot fopen %s", file_path);

 make: *** [Makefile:470: lint] Error 1

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Pavel Tikhomirov
bd17bd43e3 sk-inet: fix codding style in restore_ip_opts
Commit [1] introduced codding-style breackage, let's fix it.

Fixes: 66cab1f49 ("sk-inet: Added IP_TTL socket option") [1]
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2024-09-11 16:02:11 -07:00
rahulk789
895a16c13c zdtm: Added tests for IP_TTL restore
Signed-off-by: rahulk789 <rahul.u.india@gmail.com>
2024-09-11 16:02:11 -07:00
rahulk789
71102e7f73 sk-inet: Added IP_TTL socket option
Signed-off-by: rahulk789 <rahul.u.india@gmail.com>
2024-09-11 16:02:11 -07:00
Ramesh Errabolu
0d5923c95e amdgpu_plugin: Refactor code used to implement Checkpoint
Refactor code used to Checkpoint DRM devices. Code is moved
into amdgpu_plugin_drm.c file which hosts various methods to
checkpoint and restore a workload.

Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
2024-09-11 16:02:11 -07:00
Ramesh Errabolu
733ef96315 amdgpu_plugin: Refactor code in preparation to support C&R for DRM devices
Add a new compilation unit to host symbols and methods that will be
needed to C&R DRM devices. Refactor code that indicates support for
C&R and checkpoints KFD and DRM devices

Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
2024-09-11 16:02:11 -07:00
Pavel Tikhomirov
b689a6710c plugin/amdgpu: Also don't print 'plugin failed' in criu
We already don't treat it as error in the plugin itself, but after
returning -1 from RESUME_DEVICES_LATE hook we print debug message in
criu about failed plugin, let's return 0 instead.

While on it let's replace ret to exit_code.

Fixes: a9cbdad76 ("plugin/amdgpu: Don't print error for "No such process" during resume")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2024-09-11 16:02:11 -07:00
David Francis
59599dacdd plugin/amdgpu: Don't print error for "No such process" during resume
During the late stages of restore, each process being resumed gets
an ioctl call to KFD_CRIU_OP_RESUME. If the process has no kfd
process info, this call with fail with -ESRCH. This is normal
behaviour, so we shouldn't print an error message for it.

Signed-off-by: David Francis <David.Francis@amd.com>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
92e8f9293b net: return bool with iptable_has_criu_jump_target
To improve readability, this patch changes the return type of
iptables_has_criu_jump_target() to a boolean, where 'true' indicates
that iptables has CRIU jump target and 'false' indicates otherwise.

Suggested-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
a62f827302 criu-log: remove unused declaration
This patch removes a leftover declaration for log_closedir()
which has been removed in the following commit:

dc80d6f125
log: get rid of LOG_DIR_FD_OFF and opening cwd in log_init()

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Pavel Tikhomirov
d2511707fa zdtm: socket-tcp-nft-nfconntrack: add a hook to the chain in nft case
Let's use hooked nft chain which actually affects packets.

Fixes: e5f4d8c6f ("test/nfconntrack: use nft or iptables-legacy")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2024-09-11 16:02:11 -07:00
Andrei Vagin
afc0efcf78 pagemap-cache: add an ability to run tests without PAGEMAP_SCAN
This change adds a new injectable fault (135) to disable PAGEMAP_SCAN and fault
back to read pagemap files.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2024-09-11 16:02:11 -07:00
Andrei Vagin
cb64d73ada page-cache: use the PAGEMAP_SCAN ioctl when it is available
Signed-off-by: Andrei Vagin <avagin@gmail.com>
2024-09-11 16:02:11 -07:00
Andrei Vagin
20628bc8a1 kerndat: check the PAGEMAP_SCAN ioctl
PAGEMAP_SCAN is a new ioctl that allows to get page attributes in a more
effeciant way than reading pagemap files.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
842289c7eb net: add error messages for restore of nftables
Show appropriate error messages when restore of nftables fails.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
d94251df75 test/nfconntrack: use nft or iptables-legacy
nft does not support xtables compat expressions
https://git.netfilter.org/nftables/commit/?id=79195a8cc9e9d9cf2d17165bf07ac4cc9d55539f

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
0ab2f9e976 net: fix network unlock with iptables-nft
When iptables-nft is used as backend for iptables, the rules for
network locking are translated into the following nft rules:

```
$ iptables-restore-translate -f lock.txt
add table ip filter
add chain ip filter CRIU
insert rule ip filter INPUT counter jump CRIU
insert rule ip filter OUTPUT counter jump CRIU
add rule ip filter CRIU mark 0xc114 counter accept
add rule ip filter CRIU counter drop
```

These rules create the following chains:

```
table ip filter { # handle 1
	chain CRIU { # handle 1
		meta mark 0x0000c114 counter packets 16 bytes 890 accept # handle 6
		counter packets 1 bytes 60 drop # handle 7
		meta mark 0x0000c114 counter packets 0 bytes 0 accept # handle 8
		counter packets 0 bytes 0 drop # handle 9
	}

	chain INPUT { # handle 2
		type filter hook input priority filter; policy accept;
		counter packets 8 bytes 445 jump CRIU # handle 3
		counter packets 0 bytes 0 jump CRIU # handle 10
	}

	chain OUTPUT { # handle 4
		type filter hook output priority filter; policy accept;
		counter packets 9 bytes 505 jump CRIU # handle 5
		counter packets 0 bytes 0 jump CRIU # handle 11
	}
}
```

In order to delete the CRIU chain, we need to first delete all four
jump targets. Otherwise, `-X CRIU` would fail with the following error:

iptables-restore v1.8.10 (nf_tables):
line 5: CHAIN_DEL failed (Resource busy): chain CRIU

Reported-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
robert
d9c427d70c irmap: hardcode some more interesting paths
Signed-off-by: robert <robert.ayrapetyan@gmail.com>
2024-09-11 16:02:11 -07:00
Andrei Vagin
b419f3dfdc make: fix compilation on alpine
Starting with the musl v1.2.4~69, _GNU_SOURCE doesn't set _LARGEFILE64_SOURCE.

Fixes #2313
Signed-off-by: Andrei Vagin <avagin@gmail.com>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
7b689b7a42 gitignore: remove historical left-over files
In commit [1] was introduced a mechanism to auto-generate the files:
sys-exec-tbl*.c, syscalls*.S, syscall-codes*.h, and syscall*.h.
This commit also updated the gitignore rules to ignore auto-generated
files. However, after commit [2], the path for these files has changed
and the patterns specified in gitignore are no longer needed.

[1] bbc2f133 (x86/build: generate syscalls-{64,32}.built-in.o)
[2] 19fadee9 (compel: plugins,std -- Implement syscalls in std plugin)

Reported-by: @felicitia
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Adrian Reber
2d1f4ec719 ci: disable non-root in user namespace test in container
Signed-off-by: Adrian Reber <areber@redhat.com>
2024-09-11 16:02:11 -07:00
Adrian Reber
fe8f5130c2 ci: fix centos-stream 9 ci errors
The image has a too old version of nettle which does not work with gnutls.
Just upgrade to the latest to make the error go away.

Signed-off-by: Adrian Reber <areber@redhat.com>
2024-09-11 16:02:11 -07:00
Adrian Reber
6679d60ffd ci: do not use 'tail' for skip-file-rwx-check test
Newer versions of 'tail' rely on inotify and after a restore 'tail' is
unhappy with the state of inotify and just stops.

This replaces 'tail' with a minimal shell based test (thanks Andrei).

Signed-off-by: Adrian Reber <areber@redhat.com>
2024-09-11 16:02:11 -07:00
Alexander Mikhalitsyn
f86f1b8491 tty: skip ioctl(TIOCSLCKTRMIOS) if possible
If ioctl(TIOCSLCKTRMIOS) fails with EPERM it means that a CRIU
process lacks of CAP_SYS_ADMIN capability. But we can use
ioctl(TIOCGLCKTRMIOS) to *read* current ->termios_locked
value from the kernel and if it's the same as we already have
we can skip failing ioctl(TIOCSLCKTRMIOS) safely.

Adrian has recently posted [1] a very good patch to allow ioctl(TIOCSLCKTRMIOS)
for processes that have CAP_CHECKPOINT_RESTORE (right now it requires CAP_SYS_ADMIN).

[1] https://lore.kernel.org/all/20231206134340.7093-1-areber@redhat.com/

Suggested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2024-09-11 16:02:11 -07:00
Ivan A. Melnikov
8a51639e3d Makefile: Use common warnings settings for loongarch64
WARNINGS variable should be amended, not redefined.
We still need, e.g.,  `-Wno-dangling-pointer` to build
criu on loongarch64 with gcc13.

Signed-off-by: Ivan A. Melnikov <iv@altlinux.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
37d62fa475 docker-test: downgrade docker to v24.0.7
Checkpoint/restore with version 25.0.0-beta.1 fails
with the following error:

$ docker start --checkpoint=c1 cr
Error response from daemon: failed to create task for container: content digest fdb1054b00a8c07f08574ce52198c5501d1f552b6a5fb46105c688c70a9acb45: not found: unknown

Release notes:
https://github.com/moby/moby/discussions/46816

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
1004625fac docker-test: fix condition for max tries
Replace a recursive call with a loop.

Reported-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Adrian Reber
088390ea89 ci: switch to permissive selinux mode during test
Signed-off-by: Adrian Reber <areber@redhat.com>
2024-09-11 16:02:11 -07:00
Adrian Reber
900909d95e test: check for btrfs in the current directory
The old test was checking if '/' is btrfs but we should check if the
current directory is btrfs.

Signed-off-by: Adrian Reber <areber@redhat.com>
2024-09-11 16:02:11 -07:00
Adrian Reber
fc94b2d158 ci: fix rawhide netlink error
The rawhide netlink errors are fixed with a newer kernel than the
default 6.2 available in Fedora 38.

Signed-off-by: Adrian Reber <areber@redhat.com>
2024-09-11 16:02:11 -07:00
sally kang
9f9737c800 comple: correct the syscall number of bind on ARM64
In the compel/arch/arm/plugins/std/syscalls/syscall.def, the syscall number of bind on ARM64 should be 200 instead of 235

Signed-off-by: Sally Kang <snapekang@gmail.com>
2024-09-11 16:02:11 -07:00
500 changed files with 22927 additions and 3596 deletions

View file

@ -2,7 +2,7 @@ version: 2.1
jobs:
test-local-gcc:
machine:
image: ubuntu-2004:202010-01
image: default
working_directory: ~/criu
steps:
- checkout
@ -11,7 +11,7 @@ jobs:
command: sudo -E make -C scripts/ci local
test-local-clang:
machine:
image: ubuntu-2004:202010-01
image: default
working_directory: ~/criu
steps:
- checkout

View file

@ -13,9 +13,8 @@ task:
nested_virtualization: true
setup_script: |
scripts/ci/apt-install make gcc pkg-config git perl-modules iproute2 kmod wget cpu-checker
contrib/apt-install make gcc pkg-config git perl-modules iproute2 kmod wget cpu-checker
sudo kvm-ok
ln -sf /usr/include/google/protobuf/descriptor.proto images/google/protobuf/descriptor.proto
build_script: |
make -C scripts/ci vagrant-fedora-no-vdso
@ -33,10 +32,12 @@ task:
memory: 8G
setup_script: |
ln -sf /usr/include/google/protobuf/descriptor.proto images/google/protobuf/descriptor.proto
dnf config-manager --set-enabled crb # Same as CentOS 8 powertools
dnf -y install epel-release epel-next-release
dnf -y install --allowerasing asciidoc gcc git gnutls-devel libaio-devel libasan libcap-devel libnet-devel libnl3-devel libbsd-devel libselinux-devel make protobuf-c-devel protobuf-devel python-devel python-PyYAML python-protobuf python-junit_xml python3-importlib-metadata python-flake8 xmlto libdrm-devel
contrib/dependencies/dnf-packages.sh
# The image has a too old version of nettle which does not work with gnutls.
# Just upgrade to the latest to make the error go away.
dnf -y upgrade nettle nettle-devel
systemctl stop sssd
# Even with selinux in permissive mode the selinux tests will be executed.
# The Cirrus CI user runs as a service from selinux point of view and is
@ -62,9 +63,8 @@ task:
nested_virtualization: true
setup_script: |
scripts/ci/apt-install make gcc pkg-config git perl-modules iproute2 kmod wget cpu-checker
contrib/apt-install make gcc pkg-config git perl-modules iproute2 kmod wget cpu-checker
sudo kvm-ok
ln -sf /usr/include/google/protobuf/descriptor.proto images/google/protobuf/descriptor.proto
build_script: |
make -C scripts/ci vagrant-fedora-rawhide
@ -83,67 +83,11 @@ task:
nested_virtualization: true
setup_script: |
scripts/ci/apt-install make gcc pkg-config git perl-modules iproute2 kmod wget cpu-checker
contrib/apt-install make gcc pkg-config git perl-modules iproute2 kmod wget cpu-checker
sudo kvm-ok
ln -sf /usr/include/google/protobuf/descriptor.proto images/google/protobuf/descriptor.proto
build_script: |
make -C scripts/ci vagrant-fedora-non-root
task:
name: CentOS Stream 8 based test
environment:
HOME: "/root"
CIRRUS_WORKING_DIR: "/tmp/criu"
compute_engine_instance:
image_project: centos-cloud
image: family/centos-stream-8
platform: linux
cpu: 4
memory: 8G
setup_script: |
ln -sf /usr/include/google/protobuf/descriptor.proto images/google/protobuf/descriptor.proto
# Do not fail if latest epel repository definition is already installed
yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm || :
yum install -y dnf-plugins-core
yum config-manager --set-enabled powertools
yum install -y --allowerasing asciidoc gcc git gnutls-devel libaio-devel libasan libcap-devel libnet-devel libnl3-devel libbsd-devel libselinux-devel make protobuf-c-devel protobuf-devel python3-devel python3-flake8 python3-PyYAML python3-protobuf python3-importlib-metadata python3-junit_xml xmlto libdrm-devel
alternatives --set python /usr/bin/python3
systemctl stop sssd
# Even with selinux in permissive mode the selinux tests will be executed
# The Cirrus CI user runs as a service from selinux point of view and is
# much more restricted than a normal shell (system_u:system_r:unconfined_service_t:s0)
# The test case above (vagrant-fedora-no-vdso) should run selinux tests in enforcing mode
setenforce 0
build_script: |
make -C scripts/ci local SKIP_CI_PREP=1 CC=gcc CD_TO_TOP=1 ZDTM_OPTS="-x zdtm/static/socket-raw"
task:
name: aarch64 build GCC (native)
arm_container:
image: docker.io/library/ubuntu:jammy
cpu: 4
memory: 4G
script: uname -a
build_script: |
scripts/ci/apt-install make
ln -sf /usr/include/google/protobuf/descriptor.proto images/google/protobuf/descriptor.proto
make -C scripts/ci local
task:
name: aarch64 build CLANG (native)
arm_container:
image: docker.io/library/ubuntu:jammy
cpu: 4
memory: 4G
script: uname -a
build_script: |
scripts/ci/apt-install make
ln -sf /usr/include/google/protobuf/descriptor.proto images/google/protobuf/descriptor.proto
make -C scripts/ci local CLANG=1
task:
name: aarch64 Fedora Rawhide
arm_container:
@ -153,6 +97,5 @@ task:
script: uname -a
build_script: |
scripts/ci/prepare-for-fedora-rawhide.sh
ln -sf /usr/include/google/protobuf/descriptor.proto images/google/protobuf/descriptor.proto
make -C scripts/ci/ local CC=gcc SKIP_CI_PREP=1 SKIP_CI_TEST=1 CD_TO_TOP=1
make -C test/zdtm -j 4

View file

@ -1,3 +1,3 @@
[codespell]
skip = ./.git,./test/pki
ignore-words-list = creat,fpr,fle,ue,bord,parms,nd,te,testng,inh,wronly,renderd,bui,clen
skip = ./.git,./test/pki,./tags,./plugins/amdgpu/amdgpu_drm.h,./plugins/amdgpu/drm.h,./plugins/amdgpu/drm_mode.h
ignore-words-list = creat,fpr,fle,ue,bord,parms,nd,te,testng,inh,wronly,renderd,bui,clen,sems

34
.github/workflows/aarch64-test.yaml vendored Normal file
View file

@ -0,0 +1,34 @@
name: aarch64 test
on: [push, pull_request]
# Cancel any preceding run on the pull request.
concurrency:
group: aarch64-test-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: ${{ github.ref != 'refs/heads/criu-dev' }}
jobs:
build:
strategy:
matrix:
os: [ubuntu-24.04-arm, ubuntu-22.04-arm]
target: [GCC=1, CLANG=1]
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- name: Run Tests ${{ matrix.target }} on ${{ matrix.os }}
# Following tests are failing on the VMs:
# ./change_mnt_context --pidfile=change_mnt_context.pid --outfile=change_mnt_context.out
# 45: ERR: change_mnt_context.c:23: mount (errno = 22 (Invalid argument))
#
# In combination with '--remote-lazy-pages' following error occurs:
# 138: FAIL: maps05.c:84: Data corrupted at page 1639 (errno = 11 (Resource temporarily unavailable))
run: |
# The 'sched_policy00' needs the following:
sudo sysctl -w kernel.sched_rt_runtime_us=-1
# etc/hosts entry is needed for netns_lock_iptables
echo "127.0.0.1 localhost" | sudo tee -a /etc/hosts
sudo -E make -C scripts/ci local ${{ matrix.target }} RUN_TESTS=1 \
ZDTM_OPTS="-x zdtm/static/change_mnt_context -x zdtm/static/maps05"

View file

@ -9,12 +9,13 @@ concurrency:
jobs:
build:
runs-on: ubuntu-20.04
strategy:
matrix:
os: [ubuntu-22.04, ubuntu-22.04-arm]
target: [GCC=1, CLANG=1]
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Run Alpine ${{ matrix.target }} Test
run: sudo -E make -C scripts/ci alpine ${{ matrix.target }}

View file

@ -9,8 +9,8 @@ concurrency:
jobs:
build:
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Run Arch Linux Test
run: sudo -E make -C scripts/ci archlinux

View file

@ -12,14 +12,14 @@ jobs:
# Check if pull request does not have label "not-selfcontained-ok"
if: "!contains(github.event.pull_request.labels.*.name, 'not-selfcontained-ok')"
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
with:
# Needed to rebase against the base branch
fetch-depth: 0
# Checkout pull request HEAD commit instead of merge commit
ref: ${{ github.event.pull_request.head.sha }}
- name: Install dependencies
run: sudo apt-get install -y libprotobuf-dev libprotobuf-c-dev protobuf-c-compiler protobuf-compiler python3-protobuf libnl-3-dev libnet-dev libcap-dev
run: sudo contrib/apt-install libprotobuf-dev libprotobuf-c-dev protobuf-c-compiler protobuf-compiler python3-protobuf libnl-3-dev libnet-dev libcap-dev uuid-dev
- name: Configure git user details
run: |
git config --global user.email "checkpoint-restore@users.noreply.github.com"

View file

@ -29,22 +29,22 @@ jobs:
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Install Packages (cpp)
if: ${{ matrix.language == 'cpp' }}
run: |
sudo scripts/ci/apt-install protobuf-c-compiler libprotobuf-c-dev libprotobuf-dev build-essential libprotobuf-dev libprotobuf-c-dev protobuf-c-compiler protobuf-compiler python3-protobuf libnet-dev pkg-config libnl-3-dev libbsd0 libbsd-dev iproute2 libcap-dev libaio-dev libbsd-dev python3-yaml libnl-route-3-dev gnutls-dev
sudo contrib/apt-install protobuf-c-compiler libprotobuf-c-dev libprotobuf-dev build-essential libprotobuf-dev libprotobuf-c-dev protobuf-c-compiler protobuf-compiler python3-protobuf libnet-dev pkg-config libnl-3-dev libbsd0 libbsd-dev iproute2 libcap-dev libaio-dev libbsd-dev python3-yaml libnl-route-3-dev gnutls-dev
- name: Initialize CodeQL
uses: github/codeql-action/init@v2
uses: github/codeql-action/init@v3
with:
languages: ${{ matrix.language }}
queries: +security-and-quality
- name: Autobuild
uses: github/codeql-action/autobuild@v2
uses: github/codeql-action/autobuild@v3
- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v2
uses: github/codeql-action/analyze@v3
with:
category: "/language:${{ matrix.language }}"

View file

@ -9,13 +9,13 @@ concurrency:
jobs:
build:
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
strategy:
matrix:
target: [GCC, CLANG]
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Run Compat Tests (${{ matrix.target }})
run: sudo -E make -C scripts/ci local COMPAT_TEST=y ${{ matrix.target }}=1

View file

@ -10,11 +10,11 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
target: [armv7-stable-cross, aarch64-stable-cross, ppc64-stable-cross, mips64el-stable-cross]
target: [armv7-stable-cross, aarch64-stable-cross, ppc64-stable-cross, mips64el-stable-cross, riscv64-stable-cross]
branches: [criu-dev, master]
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
with:
ref: ${{ matrix.branches }}
- name: Run Cross Compilation Targets

View file

@ -21,6 +21,7 @@ jobs:
aarch64-stable-cross,
ppc64-stable-cross,
mips64el-stable-cross,
riscv64-stable-cross,
]
include:
- experimental: true
@ -33,7 +34,7 @@ jobs:
target: mips64el-unstable-cross
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Run Cross Compilation Targets
run: >
sudo make -C scripts/ci ${{ matrix.target }}

View file

@ -12,8 +12,8 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-20.04]
os: [ubuntu-22.04]
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Run Docker Test (${{ matrix.os }})
run: sudo make -C scripts/ci docker-test

View file

@ -9,9 +9,9 @@ concurrency:
jobs:
build:
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Run Fedora ASAN Test
run: sudo -E make -C scripts/ci fedora-asan

View file

@ -9,10 +9,10 @@ concurrency:
jobs:
build:
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Run Fedora Rawhide Test
# We need to pass environment variables from the CI environment to
# distinguish between CI environments. However, we need to make sure that

View file

@ -9,10 +9,10 @@ concurrency:
jobs:
build:
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Run Coverage Tests
run: sudo -E make -C scripts/ci local GCOV=1
- name: Run gcov

View file

@ -9,8 +9,8 @@ concurrency:
jobs:
build:
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Run Java Test
run: sudo make -C scripts/ci java-test

View file

@ -14,9 +14,9 @@ jobs:
image: registry.fedoraproject.org/fedora:latest
steps:
- name: Install tools
run: sudo dnf -y install git make python3-flake8 xz clang-tools-extra which codespell git-clang-format ShellCheck
run: sudo dnf -y install git make ruff xz clang-tools-extra codespell git-clang-format ShellCheck
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Set git safe directory
# https://github.com/actions/checkout/issues/760

View file

@ -11,5 +11,5 @@ jobs:
build:
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- run: sudo make -C scripts/ci loongarch64-qemu-test

24
.github/workflows/nftables-test.yml vendored Normal file
View file

@ -0,0 +1,24 @@
name: Nftables bases testing
on: [push, pull_request]
# Cancel any preceding run on the pull request.
concurrency:
group: nftables-test-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: ${{ github.ref != 'refs/heads/criu-dev' }}
jobs:
build:
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
- name: Remove iptables
run: sudo apt remove -y iptables
- name: Install libnftables-dev
run: sudo contrib/apt-install libnftables-dev
- name: chmod 755 /home/runner
# CRIU's tests are sometimes running as some random user and need
# to be able to access the test files.
run: sudo chmod 755 /home/runner
- name: Build with nftables network locking backend
run: sudo make -C scripts/ci local COMPILE_FLAGS="NETWORK_LOCK_DEFAULT=NETWORK_LOCK_NFTABLES"

View file

@ -9,8 +9,8 @@ concurrency:
jobs:
build:
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Run Podman Test
run: sudo make -C scripts/ci podman-test

View file

@ -9,9 +9,9 @@ concurrency:
jobs:
build:
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Run CRIU Image Streamer Test
run: sudo -E make -C scripts/ci local STREAM_TEST=1

View file

@ -9,8 +9,8 @@ concurrency:
jobs:
build:
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Run X86_64 CLANG Test
run: sudo make -C scripts/ci x86_64 CLANG=1

View file

@ -9,8 +9,8 @@ concurrency:
jobs:
build:
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Run X86_64 GCC Test
run: sudo make -C scripts/ci x86_64

8
.gitignore vendored
View file

@ -20,17 +20,9 @@ compel/compel
compel/compel-host-bin
images/*.c
images/*.h
images/google/protobuf/*.c
images/google/protobuf/*.h
.gitid
criu/criu
criu/unittest/unittest
criu/arch/*/sys-exec-tbl*.c
# x86 syscalls-table is not generated
!criu/arch/x86/sys-exec-tbl.c
criu/arch/*/syscalls*.S
criu/include/syscall-codes*.h
criu/include/syscall*.h
criu/include/version.h
criu/pie/restorer-blob.h
criu/pie/parasite-blob.h

View file

@ -23,8 +23,3 @@ extraction:
- "python3-yaml"
- "libnl-route-3-dev"
- "gnutls-dev"
configure:
command:
- "ls -laR images/google"
- "ln -s /usr/include/google/protobuf/descriptor.proto images/google/protobuf/descriptor.proto"
- "ls -laR images/google"

View file

@ -1,35 +0,0 @@
language: c
os: linux
dist: bionic
services:
- docker
jobs:
include:
- os: linux
arch: ppc64le
env: TR_ARCH=local
dist: bionic
- os: linux
arch: ppc64le
env: TR_ARCH=local CLANG=1
dist: bionic
- os: linux
arch: s390x
env: TR_ARCH=local
dist: bionic
- os: linux
arch: arm64-graviton2
env: TR_ARCH=local RUN_TESTS=1
dist: focal
group: edge
virt: vm
- os: linux
arch: arm64-graviton2
env: TR_ARCH=local CLANG=1 RUN_TESTS=1
group: edge
virt: vm
dist: bionic
script:
- sudo make -C scripts/ci $TR_ARCH
after_success:
- make -C scripts/ci after_success

1
CLAUDE.md Symbolic link
View file

@ -0,0 +1 @@
GEMINI.md

View file

@ -8,8 +8,8 @@ Here are some useful hints to get involved.
* We have both -- [very simple](https://github.com/checkpoint-restore/criu/issues?q=is%3Aissue+is%3Aopen+label%3Aenhancement) and [more sophisticated](https://github.com/checkpoint-restore/criu/issues?q=is%3Aissue+is%3Aopen+label%3A%22new+feature%22) coding tasks;
* CRIU does need [extensive testing](https://github.com/checkpoint-restore/criu/issues?q=is%3Aissue+is%3Aopen+label%3Atesting);
* Documentation is always hard, we have [some information](https://criu.org/Category:Empty_articles) that is to be extracted from people's heads into wiki pages as well as [some texts](https://criu.org/Category:Editor_help_needed) that all need to be converted into useful articles;
* Feedback is expected on the GitHub issues page and on the [mailing list](https://lists.openvz.org/mailman/listinfo/criu);
* We accept GitHub pull requests and this is the preferred way to contribute to CRIU. If you prefer to send patches by email, you are welcome to send them to [CRIU development mailing list](https://lists.openvz.org/mailman/listinfo/criu).
* Feedback is expected on the GitHub issues page and on the [mailing list](https://lore.kernel.org/criu);
* We accept GitHub pull requests and this is the preferred way to contribute to CRIU. If you prefer to send patches by email, you are welcome to send them to [CRIU development mailing list](https://lore.kernel.org/criu).
Below we describe in more detail recommend practices for CRIU development.
* Spread the word about CRIU in [social networks](http://criu.org/Contacts);
* If you're giving a talk about CRIU -- let us know, we'll mention it on the [wiki main page](https://criu.org/News/events);
@ -27,19 +27,43 @@ The repository may contain multiple branches. Development happens in the **criu-
To clone CRIU repo and switch to the proper branch, run:
```
git clone https://github.com/checkpoint-restore/criu criu
cd criu
git checkout criu-dev
git clone https://github.com/checkpoint-restore/criu criu
cd criu
git checkout criu-dev
```
### Compile
### Building from source
First, you need to install compile-time dependencies. Check [Installation dependencies](https://criu.org/Installation#Dependencies) for more info.
Follow these steps to compile CRIU from source code.
To compile CRIU, run:
#### Installing build dependencies
First, you need to install the required build dependencies. We provide scripts to simplify this process for several Linux distributions in [contrib/dependencies](contrib/dependencies). For a complete list of dependencies, please refer to the [installation guide](https://criu.org/Installation).
##### On Ubuntu/Debian-based systems:
```
make
./contrib/dependencies/apt-packages.sh
```
##### On Fedora/CentOS-based systems:
```
./contrib/dependencies/dnf-packages.sh
```
##### Using Nix:
```
nix develop
```
#### Compiling CRIU
Once the dependencies are installed, you can compile CRIU by running the `make` command from the root of the source directory:
```
make
```
This should create the `./criu/criu` executable.
@ -59,11 +83,11 @@ Other conventions can be learned from the source code itself. In short, make sur
Important: These tools are there to advise you, but should not be considered as a "source of truth", as tools also make nasty mistakes from time to time which can completely break code readability.
The following command can be used to automatically run a code linter for Python files (flake8), Shell scripts (shellcheck),
The following command can be used to automatically run a code linter for Python files (ruff), Shell scripts (shellcheck),
text spelling (codespell), and a number of CRIU-specific checks (usage of print macros and EOL whitespace for C files).
```
make lint
make lint
```
In addition, we have adopted a [clang-format configuration file](https://www.kernel.org/doc/Documentation/process/clang-format.rst)
@ -73,7 +97,7 @@ results in decreased readability, we may choose to ignore these errors.
Run the following command to check if your changes are compliant with the clang-format rules:
```
make indent
make indent
```
This command is built upon the `git-clang-format` tool and supports two options `BASE` and `OPTS`. The `BASE` option allows you to
@ -83,7 +107,7 @@ can use `BASE=origin/criu-dev`. The `OPTS` option can be used to pass additional
to check the last *N* commits for formatting errors, without applying the changes to the codebase you can use the following command.
```
make indent OPTS=--diff BASE=HEAD~N
make indent OPTS=--diff BASE=HEAD~N
```
Note that for pull requests, the "Run code linter" workflow runs these checks for all commits. If a clang-format error is detected
@ -96,7 +120,7 @@ Here are some bad examples of clang-format-ing:
```
@@ -58,8 +59,7 @@ static int register_membarriers(void)
}
if (!all_ok) {
- fail("can't register membarrier()s - tried %#x, kernel %#x",
- barriers_registered, barriers_supported);
@ -129,16 +153,11 @@ Here are some bad examples of clang-format-ing:
CRIU comes with an extensive test suite. To check whether your changes introduce any regressions, run
```
make test
make test
```
The command runs [ZDTM Test Suite](https://criu.org/ZDTM_Test_Suite). Check for any error messages produced by it.
In case you'd rather have someone else run the tests, you can use travis-ci for your
own GitHub fork of CRIU. It will check the compilation for various supported platforms,
as well as run most of the tests from the suite. See https://travis-ci.org/checkpoint-restore/criu
for more details.
## Describe your changes
Describe your problem. Whether your change is a one-line bug fix or
@ -166,21 +185,21 @@ If your change fixes a bug in a specific commit, e.g. you found an issue using
the SHA-1 ID, and the one line summary. For example:
```
Fixes: 9433b7b9db3e ("make: use cflags/ldflags for config.h detection mechanism")
Fixes: 9433b7b9db3e ("make: use cflags/ldflags for config.h detection mechanism")
```
The following `git config` settings can be used to add a pretty format for
outputting the above style in the `git log` or `git show` commands:
```
[pretty]
fixes = Fixes: %h (\"%s\")
[pretty]
fixes = Fixes: %h (\"%s\")
```
If your change address an issue listed in GitHub, please use `Fixes:` tag with the number of the issue. For instance:
```
Fixes: #339
Fixes: #339
```
The `Fixes:` tags should be put at the end of the detailed description.
@ -263,7 +282,7 @@ can certify the below:
then you just add a line saying
```
Signed-off-by: Random J Developer <random at developer.example.org>
Signed-off-by: Random J Developer <random at developer.example.org>
```
using your real name (please, no pseudonyms or anonymous contributions if
@ -275,14 +294,14 @@ commit message. To append such line to a commit you already made, use
```
From: Random J Developer <random at developer.example.org>
Subject: [PATCH] component: Short patch description
Subject: [PATCH] component: Short patch description
Long patch description (could be skipped if patch
is trivial enough)
Long patch description (could be skipped if patch
is trivial enough)
Signed-off-by: Random J Developer <random at developer.example.org>
---
Patch body here
Signed-off-by: Random J Developer <random at developer.example.org>
---
Patch body here
```
## Submit your work upstream
@ -316,8 +335,8 @@ contains the following:
revisions should be listed. For example:
```
v3: rebase on the current criu-dev
v2: add commit to foo() and update bar() coding style
v3: rebase on the current criu-dev
v2: add commit to foo() and update bar() coding style
```
If there are only minor updates to the commits in a pull request, it is
@ -335,7 +354,7 @@ Historically, CRIU worked with mailing lists and patches so if you still prefer
To create a patch, run
```
git format-patch --signoff origin/criu-dev
git format-patch --signoff origin/criu-dev
```
You might need to read GIT documentation on how to prepare patches
@ -346,8 +365,8 @@ at all.
We recommend to post patches using `git send-email`
```
git send-email --cover-letter --no-chain-reply-to --annotate \
--confirm=always --to=criu@openvz.org criu-dev
git send-email --cover-letter --no-chain-reply-to --annotate \
--confirm=always --to=criu@lists.linux.dev criu-dev
```
Note that the `git send-email` subcommand may not be in
@ -359,14 +378,14 @@ If this is your first time using git send-email, you might need to
configure it to point it to your SMTP server with something like:
```
git config --global sendemail.smtpServer stmp.example.net
git config --global sendemail.smtpServer stmp.example.net
```
If you get tired of typing `--to=criu@openvz.org` all the time,
If you get tired of typing `--to=criu@lists.linux.dev` all the time,
you can configure that to be automatically handled as well:
```
git config sendemail.to criu@openvz.org
git config sendemail.to criu@lists.linux.dev
```
If a developer is sending another version of the patch (e.g. to address
@ -379,7 +398,7 @@ version if needed though).
### Mail patches
The patches should be sent to CRIU development mailing list, `criu AT openvz.org`. Note that you need to be subscribed first in order to post. The list web interface is available at https://openvz.org/mailman/listinfo/criu; you can also use standard mailman aliases to work with it.
The patches should be sent to CRIU development mailing list, `criu AT lists.linux.dev`. Note that you need to be subscribed first in order to post. The list web interface is available at https://lore.kernel.org/criu; you can also use standard mailman aliases to work with it.
Please make sure the email client you're using doesn't screw your patch (line wrapping and so on).
@ -396,5 +415,3 @@ sometimes a patch may fly around a week before it gets reviewed.
Wiki article: [Continuous integration](https://criu.org/Continuous_integration)
CRIU tests are run for each series sent to the mailing list. If you get a message from our patchwork that patches failed to pass the tests, you have to investigate what is wrong.
We also recommend you to [enable Travis CI for your repo](https://criu.org/Continuous_integration#Enable_Travis_CI_for_your_repo) to check patches in your git branch, before sending them to the mailing list.

View file

@ -15,6 +15,7 @@ Checkpoint / Restore inside a docker container
Pytorch
Tensorflow
Using CRIU Image Streamer
Parallel Restore
DESCRIPTION
-----------
@ -27,14 +28,10 @@ to criu to allow Checkpoint / Restore with ROCm.
Dependencies
~~~~~~~~~~~~~~
------------
*amdkfd support*::
In order to snapshot the *VRAM* and other *GPU* device states, we require
an updated version of amdkfd(amdgpu) driver. The kernel patches are under
review currently.
*criu 3.16*::
This work is rebased on latest criu release available at this time.
an updated version of amdkfd(amdgpu) driver.
OPTIONS
-------

View file

@ -465,6 +465,30 @@ The 'mode' may be one of the following:
*skip*::: Don't lock the network. If *--tcp-close* is not used, the network
must be locked externally to allow CRIU to dump TCP connections.
*--allow-uprobes*::
Allow dumping when uprobes vma is present. When used on dump, this option is
required on restore as well.
A uprobes vma is automatically created by the kernel once a uprobe is
triggered. This mapping is not removed even once the uprobe is deleted. So,
even if a process once had uprobes attached to it, and they're removed by
the time the process is dumped, this option is still required because criu
has no way of knowing whether there are active uprobes or not.
When using this option on restore, make sure the uprobes (if any) active on
the dumped processes are still active. Otherwise, when execution reaches
a uprobe'd location in any of the restored processes, that process will be
sent a SIGTRAP.
As an example, say a uprobe is set at function foo in the executable of the
process p_bar. Whenever execution in p_bar reaches function foo, the uprobe
is triggered. If the uprobe has been triggered at least once, then the kernel
will have created the uprobes vma. To dump p_bar, this option is
necessary. After dumping, say the uprobe is deleted. Now, on restoring with
this option, once execution reaches function foo, SIGTRAP will be sent to
the restored p_bar. Unless it has a signal handler installed for SIGTRAP,
it will be terminated and core dumped.
*restore*
~~~~~~~~~
Restores previously checkpointed processes.
@ -478,8 +502,8 @@ Restores previously checkpointed processes.
The 'resource' argument can be one of the following:
+
- **tty[**__rdev__**:**__dev__**]**
- **pipe[**__inode__**]**
- **socket[**__inode__*]*
- **pipe:[**__inode__**]**
- **socket:[**__inode__*]*
- **file[**__mnt_id__**:**__inode__**]**
- 'path/to/file'
@ -692,6 +716,10 @@ The 'mode' may be one of the following:
*--skip-file-rwx-check*::
Skip checking file permissions (r/w/x for u/g/o) on restore.
*--allow-uprobes*::
Required when dumped with this option. Refer to this option in the section
on dumping for more details.
*check*
~~~~~~~
Checks whether the kernel supports the features needed by *criu* to

136
Documentation/logo.svg Normal file
View file

@ -0,0 +1,136 @@
<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 16.0.1, SVG Export Plug-In . SVG Version: 6.00 Build 0) -->
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
width="560px" height="560px" viewBox="0 0 560 560" enable-background="new 0 0 560 560" xml:space="preserve">
<path opacity="0.3" fill="#990000" d="M315.137,360.271c-18.771-7.159-41.548-8.85-68.479-8.85c-16.661,0-46.255,2.939-74.654,3.38
c11.209-4.884,20.734-10.265,24.842-16.87c14.531-23.346,17.645-65.893,17.645-65.893l-20.758,3.114c0,0-2.591,35.8-16.085,47.733
c-5.35,4.736-15.96,7.834-27.916,10.856c2.447-26.071,29.477-57.552,29.477-57.552l-14.874-3.966l-5.88-7.448
c0,0-3.011,1.761-7.588,5.315c-18.298,4.208-75.946,20.443-75.946,57.983c0,15.292,5.77,26.308,14.768,34.244
c-22.858,26.966-20.755,61.618-20.755,61.618s-8.945,16.61-8.021,31.254c2.083,32.973,34.931,25.097,44.313,26.374
c9.644,1.313,34.313-4.18,34.313-4.18s-16.276-2.639-15.329-18.562c0.5-8.369-0.947-27.628-21.404-37.307
c-1.13-10.066,2.111-18.309,6.379-28.015c18.452,45.263,92.601,53.97,92.601,53.97c0.393-0.097-10.269,20.047,0.221,35.632
c4.652,6.915,18.284,10.019,22.436,19.356c4.151,9.341,2.199,30.354,2.199,30.354s21.267-16.864,27.239-30.18
c3.334-7.432,25.989,0.926,25.989-34.047c0-14.077-12.26-26.841-13.675-29.815c-20.858-20.334-5.427-4.743,2.677-8.236
c12.758-5.499,35.412,11.657,35.412,11.657s-10.402-20.119-11.437-31.013c-0.795-8.335-4.537-16.816-16.624-30.042
c7.166-0.752,20.362,2.327,20.362,2.327s-5.202,11.251-0.879,25.515c3.588,11.84,7.193,7.193,14.736,14.737
c6.599,6.598,3.146,26.284,3.146,26.284s4.674-4.513,18.081-18.235c9.072-9.29,23.645-16.717,23.645-47.86
C355.312,365.969,334.97,360.979,315.137,360.271z M134.108,285.901c-11.5,13.048-23.667,32.329-28.23,58.293
c-4.821-3.519-7.613-8.1-7.613-14.043C98.265,309.699,117.078,295.016,134.108,285.901z"/>
<path fill="#990000" d="M382.184,115.435c3.654,1.208,7.327,2.37,10.968,3.444c14.16,4.183,26.745-9.798,26.745-9.798
s-8.785-2.243-17.857-3.497c12.173-2.653,21.085-18.66,21.085-18.66s-17.366,4.819-27.224,5.087
c-2.042,0.057-4.107,0.118-6.189,0.186c2.464-0.37,4.925-0.847,7.361-1.485c14.201-3.714,21.505-23.382,21.505-23.382
s-15.411,6.743-24.951,9.239c-2.694,0.703-5.438,1.437-8.197,2.185c3.038-1.071,6.008-2.306,8.815-3.82
c12.922-6.965,12.241-29.347,12.241-29.347s-10.162,11.926-18.844,16.605c-3.557,1.916-7.199,3.904-10.846,5.911
c3.798-2.277,7.45-4.743,10.596-7.569c10.918-9.814,7.722-29.605,7.722-29.605s-9.801,12.54-17.135,19.131
c-8.939,8.037-18.775,14.104-27.014,21.81c-6.427,6.011-25.14,35.236-36.812,46.283c-11.671,11.047-18.301,12.476-19.159,14.388
c-0.863,1.913,1.006,30.46-14.078,39.145c-16.476-21.583-50.565-44.007-53.101-72.033c-2.079-22.959,5.209-34.055,19.149-35.316
c14.994-1.359,15.998,24.507,15.998,24.507s-1.379,1.064-1.708,6.391c-0.097,0.629-0.145,1.272-0.083,1.934
c0.004,0.031,0.008,0.06,0.011,0.091c-0.014,1.674,0.065,3.664,0.278,6.039c1.131,12.474,4.53,14.574,4.53,14.574l2.075-0.722
c0,0-2.24-4.079-2.554-7.529c-0.172-1.917-0.187-3.556-0.079-4.977c0.45,0.067,0.949,0.081,1.506,0.031
c4.398-0.399,6.049-4.141,5.65-8.539c-0.042-0.45-0.069-0.885-0.094-1.316c2.485-26.032-1.756-29.637,4.788-41.391
c9.032-16.218,17.279-16.015,17.279-16.015l1.402-8.155c0,0-6.817,2.462-14.819,13.652c-8.833,12.354-8.983,26.229-9.066,47.958
c-0.188-0.761-0.502-1.37-1.017-1.784c-2.457-11.192-9.087-32.13-24.112-30.77c-16.72,1.514-29.419,14.974-26.773,44.171
c3.609,39.832,26.186,52.701,29.829,80.84c-13.47-2.349-23.883-10.656-30.866-20.282c-7.803-10.749-7.297-22.949-8.324-24.779
c-1.027-1.829-7.761-2.662-20.367-12.627c-12.605-9.965-33.845-37.41-40.78-42.824c-8.895-6.942-19.229-12.111-28.848-19.32
c-7.892-5.915-18.769-17.531-18.769-17.531s-1.419,19.995,10.323,28.8c3.386,2.536,7.246,4.665,11.229,6.597
c-3.808-1.674-7.616-3.33-11.327-4.925c-9.062-3.887-20.246-14.861-20.246-14.861s1.31,22.353,14.803,28.143
c2.931,1.257,6,2.223,9.12,3.019c-2.818-0.5-5.615-0.985-8.357-1.447c-9.728-1.636-25.677-6.981-25.677-6.981
s9.025,18.94,23.5,21.376c2.485,0.417,4.975,0.674,7.466,0.822c-2.08,0.118-4.148,0.242-6.183,0.368
c-9.843,0.61-27.566-2.645-27.566-2.645S85.667,120.333,110,120c-8.922,2.057-25.678,6.008-25.678,6.008s13.778,12.806,27.508,7.38
c3.533-1.394,7.087-2.876,10.62-4.404c-3.726,1.804-7.424,3.581-11.005,5.273c-8.963,4.243-19.428,10.176-19.428,10.176
s15.069,9.759,27.305,1.497c0.558-0.378,3.121-1.76,3.678-2.143c-7.904,5.808-19.754,14.937-19.754,14.937
s15.802,6.027,27.092-3.354c4.663-3.875,8.104-7.185,12.238-11.618c-3.773,4.55-6.699,8.018-10.634,12.106
c-6.839,7.104-13.06,19.791-13.06,19.791s15.597,0.39,24.359-11.388c4.488-6.035,7.482-11.633,10.974-18.191
c-3.113,6.479-5.468,11.95-8.911,17.788c-5.018,8.49-7.574,22.624-7.574,22.624s15.342-3.655,21.07-17.17
c2.231-5.266,2.107-9.783,3.694-15.291c-1.257,5.272-0.666,9.475-2.24,14.319c-3.045,9.379,0.011,25.554,0.011,25.554
s9.713-5.855,10.359-20.52c0.006-0.153,0.5-8.47,0.5-8.625L171,171.496c0,9.917,6.295,23.276,6.295,23.276
s11.459-10.649,9.369-25.266c-0.188-1.31-0.1-2.627-0.305-3.947c0.408,1.507,0.998,3.016,1.493,4.524
c3.075,9.429,3.5,15.957,3.5,15.957s6.483,1.251,8.73-1.594c0.764,5.625-0.843,10.2-0.843,10.2s5.471-1.1,8.893-3.756
c0.705,5.331,0.155,8.789,0.155,8.789s5.106-1.603,8.419-4.323c0.611,4.642,1.764,7.542,1.764,7.542s6.398-0.88,9.021-5.393
c0.199,0.038,0.395,0.079,0.59,0.117c2.269,4.875,1.438,8.517,1.438,8.517s7.492-2.14,9.492-6.14c0.003,0,0.007,0,0.01,0
c1.798,4,2.727,6.102,2.727,6.102s4.853-2.349,7.093-6.064c0.189,0.009,0.364-0.093,0.547-0.086
c-4.702,19.629-23.62,29.658-42.207,42.764c-1.392,0.981-2.712,1.925-3.97,2.884c-2.891,1.512-6.788,3.495-11.311,5.724
c-9.829,3.363-23.7,6.057-41.038,4.084c-9.798-1.115-21.037,10.02-21.037,10.02s6.87,4.843,16.565,5.028
c-8.819,3.621-17.438,12.632-17.438,12.632s0.045,0.019,0.069,0.029c-27.096,11.688-51.621,29.917-47.651,57.105
c2.375,16.27,14.692,25.475,31.704,30.254c-17.81,14.742-32.921,36.129-30.707,60.59c0.134,1.487,0.309,2.916,0.508,4.311
c-2.209,5.6-3.288,17.842-2.674,24.886c0.949,10.838,13.686,8.662,18.219,6.729c14.139,12.202,32.258,10.252,32.258,10.252
s-17.301,1.211-30.306-11.156c5.551-2.659,6.424-3.925,6.788-11.579c0.36-7.61-9.104-20.759-20.57-21.966
c-1.25-20.07,9.861-43.32,30.603-60.203c0.02,0.249,0.023,0.491,0.048,0.742c4.248,46.957,30.584,54.634,81.148,63.26
c12.603,2.15,22.04,5.821,29.042,10.457c-3.844,5.388-5.706,21.559-2.895,32.325c3.045,11.655,12.647,14.53,19.429,14.955
c-3.304,16.035-11.235,29.024-11.235,29.024s10.015-11.628,15.04-29.016c0.48-0.031,0.928-0.069,1.319-0.114
c10.922-1.262,16.17-11.338,14.743-23.071c-1.195-9.826-13.974-24.54-28.598-25.992c-33.117-21.52-109.104-9.05-113.877-61.769
c-0.341-3.746-0.517-7.367-0.571-10.888c5.709,1.111,11.782,1.844,18.104,2.244c14.111,28.517,62.158,22.269,95.818,20.694
c1.764,3.09,7.043,7.064,13.929,9.779c11.751,4.633,14.889,3.742,18.869,1.502c1.484-0.835,2.828-1.92,3.979-3.155
c10.822,10.456,25.37,30.251,25.37,30.251s-12.29-22.284-22.733-33.97c2.601-4.923,2.433-10.619-2.559-13.297
c-6.956-3.732-31.321,1.581-36.316,4.981c-30.811,1.668-71.853,6.551-89.576-16.474c41.005,1.192,88.786-9.133,102.385-10.365
c21.726-1.966,47.319,1.367,64.887,8.228c-0.783,5.681,1.867,18.47,4.641,25.318c3.316,8.197,11.561,5.887,16.562,3.028
c-0.588,13.3-4.495,22.638-4.495,22.638s7.86-14.125,9.117-26.183c4.354-4.041,4.774-5.562,2.904-12.887
c-1.849-7.24-14.317-16.821-25.47-15.096c-21.855-8.906-54.594-11.087-75.74-9.175c-18.253,1.653-61.404,10.802-97.611,10.237
c-1.895-3.338-3.402-7.122-4.412-11.479c5.113-2.364,10.551-4.388,16.307-5.975c30.999-8.551,40.97-29.258,42.943-48.579
c1.127,1.303,1.938,2.069,1.938,2.069s7.087-12.679,5.522-27.275c-0.264-2.469-0.429-4.737-0.553-6.911
c2.499,6.741,7.778,13.001,7.778,13.001s16.438-20.208,5.846-27.268c-11.583-7.714-6.836-13.283-4.31-15.299
c3.354-1.984,6.973-3.94,10.859-5.817c26.561-12.817,59.903-20.002,64.443-40.039c0.265-1.172,0.388-2.34,0.443-3.507
c3.701,2.396,9.165,2.053,9.165,2.053s-0.367-2.88-0.601-7.556c3.747,2.081,8.874,1.758,8.874,1.758s-0.986-2.319-1.255-7.689
c3.846,1.998,8.434,2.278,8.434,2.278s-0.725-2.246-1.24-5.573c3.788,0.719,8.84,0.419,8.84,0.419s-3.543-7.302-1.316-16.965
c0.357-1.547,0.666-3.09,0.938-4.626c-0.087,1.332-0.169,2.662-0.238,3.985c-0.783,14.742,10.85,24.47,10.85,24.47
S337,172.178,337,162.303c0-0.021,0-0.042,0-0.061c0,0.153-0.804,0.309-0.782,0.46c1.951,14.548,13.499,20.839,13.499,20.839
s2.388-16.471-1.478-25.542c-1.998-4.686-3.966-9.742-5.688-14.881c2.068,5.344,4.374,10.673,7.067,15.72
c6.909,12.952,20.498,15.406,20.498,15.406s-1.832-14.029-7.581-22.041c-3.952-5.505-7.874-11.654-11.551-17.83
c4.059,6.22,8.622,12.438,13.631,18.048c9.774,10.953,25.27,9.178,25.27,9.178s-7.323-12.085-14.767-18.552
c-4.283-3.722-8.589-7.824-12.754-12.019c4.513,4.047,9.319,7.944,14.31,11.39c12.077,8.341,27.281,0.931,27.281,0.931
s-10.533-7.219-18.926-12.302c0.595,0.332,1.186,0.662,1.777,0.988c12.922,7.14,28.146-3.013,28.146-3.013
s-12.036-5.887-21.343-9.313C389.896,118.341,386.055,116.903,382.184,115.435z M116.917,367.418
c-0.172,0.131-0.344,0.268-0.516,0.398c-17.301-3.899-29.646-12.415-31.124-28.752c-2.244-24.777,21.669-42.631,47.562-54.59
c3.553,1,9.203,1.919,15.541,0.503c-4.694,4.817-7.998,9.859-7.998,9.859s2.076,0.564,5.3,0.733
C133.582,308.673,115.917,333.715,116.917,367.418z M146.295,295.598c1.834,0.062,3.979-0.014,6.326-0.386
c-0.141,0.365-0.274,0.72-0.401,1.069c-10.511,14.57-18.745,34.363-17.404,59.912c-4.522,2.267-9.248,5.074-13.939,8.343
C122.237,330.3,136.218,307.613,146.295,295.598z M121.776,368.86c4.131-2.979,8.589-5.697,13.361-8.115
c0.358,3.527,1.032,6.741,2.025,9.634C131.805,370.131,126.629,369.657,121.776,368.86z M150.478,350.278
c-3.791,0.864-8.16,2.403-12.812,4.546c-0.062-0.425-0.168-0.803-0.224-1.236c-2.557-19.875,3.873-37.276,13.005-51.347
c0,0.005-0.007,0.032-0.007,0.032s13.533-3.395,23.088-14.017c-1.715,7.205,0.158,14.79,0.158,14.79s9.774-5.185,16.654-15.216
c-0.131,5.548,2.84,10.803,5.451,14.331C193.303,321.731,182.711,342.934,150.478,350.278z M259.516,275.357
c0.846-4.127,1.649-8.135,2.42-12.012c2.199-4.002,5.203-6.524,9.011-7.55c3.808-1.04,7.78-1.559,11.919-1.559l1.739-17.042
c-5.942,0.378-11.657,1.419-17.144,3.105c-5.492,1.672-10.946,3.611-16.369,5.8c-4.526,4.131-7.915,8.875-10.169,14.237
c-2.262,5.359-3.755,11.051-4.655,17.055c-0.906,6.007-1.268,12.17-1.268,18.489v18.209c0,3.23,0.201,6.368,0.779,9.393
c0.584,3.045,1.728,5.66,3.543,7.85c3.614,2.588,7.203,3.85,10.822,3.771c3.619-0.066,7.224-0.712,10.842-1.925
c3.611-1.23,7.162-2.757,10.647-4.558c3.484-1.811,6.904-3.293,10.266-4.457l7.159-14.521c-2.066,0.505-4.2,1.23-6.394,2.127
c-2.199,0.9-4.453,1.643-6.777,2.224c-2.322,0.585-4.649,0.773-6.977,0.585c-2.322-0.189-4.649-1.2-6.976-2.994
c-2.063-3.626-3.355-7.475-3.87-11.541c-0.519-4.065-0.612-8.165-0.289-12.296C258.1,283.619,258.674,279.488,259.516,275.357z
M367.6,320.582c-0.196-3.025-1.001-5.908-2.42-8.623c-1.031-3.608-2.649-6.588-4.846-8.905c-2.193-2.333-4.682-4.162-7.458-5.516
c-2.773-1.358-5.712-2.364-8.812-3.014c-3.098-0.643-6.004-1.056-8.717-1.259c-2.711-0.188-5.101-0.285-7.166-0.285
s-3.419-0.062-4.064-0.189c0.25-1.037,0.449-2.302,0.574-3.783c0.133-1.481,0.322-2.866,0.584-4.162
c0.258-1.419,0.512-2.977,0.773-4.65c6.326,0,12.073-0.581,17.242-1.749c5.165-1.148,9.688-3.059,13.558-5.705
c3.876-2.646,7.135-6.131,9.781-10.469c2.649-4.318,4.558-9.583,5.715-15.776c-5.684,0-11.596,0.029-17.727,0.093
s-12.328,0.158-18.593,0.284c-6.266,0.143-12.431,0.332-18.5,0.583c-6.066,0.27-11.812,0.584-17.236,0.979
c0.128,0,0.221,1.387,0.293,4.161c0.062,2.775,0.062,6.465,0,11.035c-0.072,4.588-0.2,9.788-0.386,15.589
c-0.199,5.819-0.49,11.73-0.875,17.734c-0.386,6.007-0.878,11.901-1.451,17.72c-0.584,5.815-1.262,10.908-2.035,15.304
c5.552-0.268,11.432-0.488,17.624-0.677c2.162-0.065,4.33-0.127,6.503-0.176l1.247-5.547c0.385-2.192,0.708-4.776,0.969-7.739
c0.259-2.979,0.513-5.754,0.773-8.338c0.259-3.093,0.386-6.196,0.386-9.286c0.646-0.127,1.677-0.206,3.103-0.206
c1.547,0,3.225,0.269,5.039,0.773c1.804,0.519,3.68,1.292,5.612,2.334c1.938,1.041,3.615,2.522,5.034,4.46
c1.42,1.925,2.45,4.352,3.104,7.252c0.638,2.914,0.638,6.495,0,10.75l0.631,5.39c1.609,0.033,3.207,0.079,4.796,0.144
c6.068,0.189,11.812,0.471,17.234,0.866C367.891,326.747,367.795,323.609,367.6,320.582z M327.506,263.345
c0.707-4.397,1.323-8.133,1.835-11.238c1.168-0.521,2.522-0.835,4.069-0.962c1.549-0.125,3.103-0.205,4.65-0.205
c1.677,0,3.291,0.031,4.845,0.112c1.547,0.062,2.901,0.093,4.069,0.093c0,1.151-0.041,2.586-0.103,4.256
c-0.066,1.688-0.189,3.42-0.389,5.232c-0.189,1.815-0.512,3.578-0.97,5.331c-0.446,1.732-1.127,3.182-2.034,4.347
c-0.896,0.918-2.128,1.657-3.681,2.224c-1.543,0.584-3.159,1.042-4.84,1.357c-1.677,0.33-3.291,0.55-4.838,0.677
c-1.555,0.141-2.78,0.207-3.682,0.207C326.439,271.542,326.798,267.727,327.506,263.345z M393.035,246.385
c-2.517,0.33-4.84,0.584-6.97,0.773c-2.135,0.205-3.781,0.172-4.939-0.096l3.678,2.711c0.899,5.423,1.356,11.051,1.356,16.851
c0,5.818-0.195,11.695-0.584,17.642c-0.385,5.941-0.872,11.805-1.45,17.624c-0.581,5.801-1,11.427-1.261,16.85
c-0.907,4.522-1.519,9.238-1.835,14.139c-0.331,4.901-0.843,9.713-1.554,14.425c-0.708,4.712-1.812,9.3-3.297,13.761
c-1.48,4.443-3.773,8.481-6.869,12.107l-2.908,1.543c0.513,0.52,1.323,0.993,2.42,1.45c1.093,0.457,1.842,0.678,2.23,0.678
c2.708-3.23,4.712-6.558,6.004-9.978c1.286-3.419,2.64-6.746,4.069-9.963c1.544-2.711,2.969-5.626,4.261-8.716
c1.286-3.107,2.774-6.008,4.455-8.719c1.671-2.708,3.681-5.045,6.008-6.984c2.322-1.938,5.285-3.15,8.903-3.67
c0.386-6.319,0.836-13.114,1.354-20.335c0.517-7.235,1.001-14.534,1.451-21.896c0.457-7.361,0.846-14.596,1.168-21.689
c0.323-7.111,0.482-13.684,0.482-19.769c-2.713,0-5.458,0.143-8.229,0.394C398.196,245.785,395.553,246.07,393.035,246.385z
M483.002,245c0,4-0.061,5.618-0.188,7.038c-0.135,1.419-0.323,3.525-0.581,5.259c-0.261,1.751-0.584,4.166-0.972,6.752
c-0.386,2.584-0.843,6.388-1.354,11.165c-0.519,4.791-1.135,11.551-1.839,19.167c-0.715,7.612-1.519,18.619-2.427,29.619h-32.15
c0-15,1.065-26.686,3.192-39.535c2.138-12.847,4.101-25.911,5.911-38.695c-5.034,0.52-9.85,1.042-14.427,1.812
c-4.589,0.773-9.136,0.898-13.662,0.52c-0.513,13.682-1.543,27.507-3.097,41.521c-1.553,13.998-3.23,27.586-5.038,40.749
c4.52,0,9.396-0.166,14.631-0.496c5.224-0.316,10.292-0.479,15.2-0.479c0.649,1.152,1.285,2.776,1.942,4.838
c0.638,2.065,1.22,4.318,1.738,6.779c0.517,2.457,0.997,5.027,1.454,7.753c0.447,2.715,0.873,5.424,1.258,8.135
c0.9,6.32,1.681,13.102,2.327,20.336c2.192-6.196,4.454-12.28,6.777-18.209c1.938-5.045,4.004-10.262,6.196-15.699
c2.199-5.423,4.327-10.073,6.393-13.936c2.323,0.254,4.649,0.316,6.974,0.188c2.326-0.124,4.681-0.25,7.071-0.392
c2.386-0.127,4.775-0.127,7.163,0c2.389,0.142,4.681,0.52,6.88,1.165c-0.257-6.716-0.164-13.619,0.293-20.728
c0.449-7.093,1.096-14.204,1.932-21.297c0.841-7.111,1.707-15.14,2.615-22.062c0.907-6.901,1.742-13.27,2.522-21.27H483.002z"/>
</svg>

After

Width:  |  Height:  |  Size: 15 KiB

136
GEMINI.md Normal file
View file

@ -0,0 +1,136 @@
# CRIU (Checkpoint/Restore In User-space)
CRIU is a tool for saving the state of a running application to a set of files
(checkpointing) and restoring it back to a live state. It is primarily used for
live migration of containers, in-place updates, and fast application startup.
It is implemented as a command-line tool called `criu`. The two primary commands
are `dump` and `restore`.
- `dump`: Saves a process tree and all its related resources (file
descriptors, IPC, sockets, namespaces, etc.) into a collection of image
files.
- `restore`: Restores processes from image files to the same state they were
in before the dump.
## Quick Start
To get a feel for `criu`, you can try checkpointing and restoring a simple
process.
1. **Run a simple process:**
Open a terminal and run a command that will run for a while. Find its PID.
```bash
sleep 1000 &
[1] 12345
```
2. **Dump the process:**
As root, use `criu dump` with the process ID (`-t`) and a directory for the
image files (`-D`).
```bash
sudo criu dump -t 12345 -D /tmp/sleep_images -v4 --shell-job
```
The `sleep` process will no longer be running.
3. **Restore the process:**
Use `criu restore` to bring the process back to life from the images.
```bash
sudo criu restore -D /tmp/sleep_images -v4 --shell-job
```
The `sleep` process will be running again as if nothing happened.
# For Developers and Contributors
This section contains more technical details about CRIU's internals and
development process.
## Dump Process
On dump, CRIU uses available kernel interfaces to collect information about
processes. For properties that can only be retrieved from within the process
itself, CRIU injects a binary blob (called a "parasite") into the process's
address space and executes it in the context of one of the process's threads.
This injection is handled by a subproject called **Compel**.
## Restore Process
On restore, CRIU reads the image files to reconstruct the processes. The goal is
to restore them to the exact state they were in before the dump. The restore
process is divided into several stages (defined as `CR_STATE_*` in
`./criu/include/restorer.h`).
The main `criu` process acts as a coordinator. It first restores resources with
inter-process dependencies (file descriptors, sockets, shared memory,
namespaces, etc.). It then forks the process tree and sets up namespaces.
Finally, it restores process-specific resources like file descriptors and memory
mappings.
A key step involves a small, self-contained binary called the "restorer". All
restored processes switch to executing this code, which unmaps the CRIU-specific
memory and restores the application's original memory mappings. On the final
step, the restorer calls `sigreturn` on a prepared signal frame to resume the
process with the state it had at the moment of the dump.
## Compel
Compel is a subproject responsible for generating the binary blobs used for the
parasite code (for dumping) and the restorer code (for restoring). It provides a
library for injecting and executing this code within the target process's
address space. It is a separate project because the logic for generating and
injecting Position-Independent Executable (PIE) code is complex and
self-contained.
## Coding Style
The C code in the CRIU project follows the
[Linux Kernel Coding Style](https://www.kernel.org/doc/html/latest/process/coding-style.html).
Here are some of the main points:
- **Indentation**: Use tabs, which are set to 8 characters.
- **Line Length**: The preferred line limit is 80 characters, but it can be
extended to 120 if it improves code readability.
- **Braces**:
- The opening brace for a function goes on a new line.
- The opening brace for a block (like `if`, `for`, `while`, `switch`) goes
on the same line.
- **Spaces**: Use spaces around operators (`+`, `-`, `*`, `/`, `%`, `<`, `>`,
`=`, etc.).
- **Naming**: Use descriptive names for functions and variables.
- **Comments**: Use C-style comments (`/* ... */`). For multi-line comments,
the preferred format is:
```c
/*
* This is a multi-line
* comment.
*/
```
## Code Layout
The code is organized into the following directories:
- `./compel`: The Compel sub-project.
- `./criu`: The main `criu` tool source code.
- `./images`: Protobuf descriptions for the image files.
- `./test`: All tests.
- `./test/zdtm`: The Zero-Downtime Migration (ZDTM) test suite.
- `./test/zdtm.py`: The executor script for ZDTM tests.
- `./scripts`: Helper scripts.
- `./scripts/build`: Docker image files used for CI and cross-compilation
checks.
- `./crit`: A tool to inspect and manipulate CRIU image files.
- `./soccr`: A library for TCP socket checkpoint/restore.
## Tests
The main test suite is ZDTM. Here is an example of how to run a single test:
```bash
sudo ./test/zdtm.py run -t zdtm/static/env00
```
Each ZDTM test has three stages: preparation, C/R, and results checks. During
the test, a process calls `test_daemon()` to signal it is ready for C/R, then
calls `test_waitsig()` to wait for the C/R stage to complete. After being
restored, the test checks that all its resources are still in a valid state.

View file

@ -1,11 +1,31 @@
## Building CRIU from source code
First, you need to install compile-time dependencies. Check [Installation dependencies](https://criu.org/Installation#Dependencies) for more info.
To compile CRIU, run:
```
make
```
This should create the `./criu/criu` executable.
To change the default behaviour of CRIU, the following variables can be passed
to the make command:
* **NETWORK_LOCK_DEFAULT**, can be set to one of the following
values: `NETWORK_LOCK_IPTABLES`, `NETWORK_LOCK_NFTABLES`,
`NETWORK_LOCK_SKIP`. CRIU defaults to `NETWORK_LOCK_IPTABLES`
if nothing is specified. If another network locking backend is
needed, `make` can be called like this:
`make NETWORK_LOCK_DEFAULT=NETWORK_LOCK_NFTABLES`
## Installing CRIU from source code
Once CRIU is built one can easily setup the complete CRIU package
(which includes executable itself, CRIT tool, libraries, manual
and etc) simply typing
make install
```
make install
```
this command accepts the following variables:
* **DESTDIR**, to specify global root where all components will be placed under (empty by default);
@ -16,17 +36,17 @@ this command accepts the following variables:
* **LIBDIR**, to specify directory where to put libraries (guess the correct path by default).
Thus one can type
make DESTDIR=/some/new/place install
```
make DESTDIR=/some/new/place install
```
and get everything installed under `/some/new/place`.
## Uninstalling CRIU
To clean up previously installed CRIU instance one can type
make uninstall
```
make uninstall
```
and everything should be removed. Note though that if some variable (**DESTDIR**, **BINDIR**
and such) has been used during installation procedure, the same *must* be passed with
uninstall action.

View file

@ -19,7 +19,7 @@ endif
#
# Supported Architectures
ifneq ($(filter-out x86 arm aarch64 ppc64 s390 mips loongarch64,$(ARCH)),)
ifneq ($(filter-out x86 arm aarch64 ppc64 s390 mips loongarch64 riscv64,$(ARCH)),)
$(error "The architecture $(ARCH) isn't supported")
endif
@ -43,7 +43,7 @@ ifeq ($(ARCH),arm)
endif
ifeq ($(ARMV),8)
# Running 'setarch linux32 uname -m' returns armv8l on travis aarch64.
# Running 'setarch linux32 uname -m' returns armv8l on aarch64.
# This tells CRIU to handle armv8l just as armv7hf. Right now this is
# only used for compile testing. No further verification of armv8l exists.
ARCHCFLAGS += -march=armv7-a
@ -64,6 +64,8 @@ endif
ifeq ($(ARCH),aarch64)
DEFINES := -DCONFIG_AARCH64
CC_MBRANCH_PROT := $(shell $(CC) -c -x c /dev/null -mbranch-protection=none -o /dev/null >/dev/null 2>&1 && echo "-mbranch-protection=none")
CFLAGS_PIE := $(CC_MBRANCH_PROT)
endif
ifeq ($(ARCH),ppc64)
@ -84,6 +86,10 @@ ifeq ($(ARCH),loongarch64)
DEFINES := -DCONFIG_LOONGARCH64
endif
ifeq ($(ARCH),riscv64)
DEFINES := -DCONFIG_RISCV64
endif
#
# CFLAGS_PIE:
#
@ -106,6 +112,7 @@ export PROTOUFIX DEFINES
#
# Independent options for all tools.
DEFINES += -D_FILE_OFFSET_BITS=64
DEFINES += -D_LARGEFILE64_SOURCE
DEFINES += -D_GNU_SOURCE
WARNINGS := -Wall -Wformat-security -Wdeclaration-after-statement -Wstrict-prototypes
@ -127,7 +134,7 @@ WARNINGS := -rdynamic
endif
ifeq ($(ARCH),loongarch64)
WARNINGS := -Wno-implicit-function-declaration
WARNINGS += -Wno-implicit-function-declaration
endif
ifneq ($(GCOV),)
@ -135,6 +142,10 @@ ifneq ($(GCOV),)
CFLAGS += $(CFLAGS-GCOV)
endif
ifneq ($(NETWORK_LOCK_DEFAULT),)
CFLAGS += -DNETWORK_LOCK_DEFAULT=$(NETWORK_LOCK_DEFAULT)
endif
ifeq ($(ASAN),1)
CFLAGS-ASAN := -fsanitize=address
export CFLAGS-ASAN
@ -164,7 +175,7 @@ HOSTCFLAGS += $(WARNINGS) $(DEFINES) -iquote include/
export AFLAGS CFLAGS USERCLFAGS HOSTCFLAGS
# Default target
all: criu lib crit
all: criu lib crit cuda_plugin
.PHONY: all
#
@ -297,15 +308,19 @@ clean-amdgpu_plugin:
$(Q) $(MAKE) -C plugins/amdgpu clean
.PHONY: clean-amdgpu_plugin
clean-cuda_plugin:
$(Q) $(MAKE) -C plugins/cuda clean
.PHONY: clean-cuda_plugin
clean-top:
$(Q) $(MAKE) -C Documentation clean
$(Q) $(MAKE) $(build)=test/compel clean
$(Q) $(RM) .gitid
.PHONY: clean-top
clean: clean-top clean-amdgpu_plugin
clean: clean-top clean-amdgpu_plugin clean-cuda_plugin
mrproper-top: clean-top clean-amdgpu_plugin
mrproper-top: clean-top clean-amdgpu_plugin clean-cuda_plugin
$(Q) $(RM) $(CONFIG_HEADER)
$(Q) $(RM) $(VERSION_HEADER)
$(Q) $(RM) $(COMPEL_VERSION_HEADER)
@ -337,6 +352,10 @@ amdgpu_plugin: criu
$(Q) $(MAKE) -C plugins/amdgpu all
.PHONY: amdgpu_plugin
cuda_plugin: criu
$(Q) $(MAKE) -C plugins/cuda all
.PHONY: cuda_plugin
crit: lib
$(Q) $(MAKE) -C crit
.PHONY: crit
@ -423,31 +442,44 @@ help:
@echo ' lint - Run code linters'
@echo ' indent - Indent C code'
@echo ' amdgpu_plugin - Make AMD GPU plugin'
@echo ' cuda_plugin - Make NVIDIA CUDA plugin'
.PHONY: help
lint:
flake8 --version
flake8 --config=scripts/flake8.cfg test/zdtm.py
flake8 --config=scripts/flake8.cfg test/inhfd/*.py
flake8 --config=scripts/flake8.cfg test/others/rpc/config_file.py
flake8 --config=scripts/flake8.cfg lib/pycriu/images/pb2dict.py
flake8 --config=scripts/flake8.cfg lib/pycriu/images/images.py
flake8 --config=scripts/flake8.cfg scripts/criu-ns
flake8 --config=scripts/flake8.cfg test/others/criu-ns/run.py
flake8 --config=scripts/flake8.cfg crit/*.py
flake8 --config=scripts/flake8.cfg crit/crit/*.py
flake8 --config=scripts/flake8.cfg scripts/uninstall_module.py
flake8 --config=scripts/flake8.cfg coredump/ coredump/coredump
flake8 --config=scripts/flake8.cfg scripts/github-indent-warnings.py
ruff:
@ruff --version
ruff check ${RUFF_FLAGS} --config=scripts/ruff.toml \
test/zdtm.py \
test/inhfd/*.py \
test/others/rpc/config_file.py \
test/others/action-script/check_actions.py \
test/others/pycriu/*.py \
lib/pycriu/criu.py \
lib/pycriu/__init__.py \
lib/pycriu/images/pb2dict.py \
lib/pycriu/images/images.py \
scripts/criu-ns \
test/others/criu-ns/run.py \
crit/*.py \
crit/crit/*.py \
scripts/uninstall_module.py \
coredump/ coredump/coredump \
scripts/github-indent-warnings.py
shellcheck:
shellcheck --version
shellcheck scripts/*.sh
shellcheck scripts/ci/*.sh scripts/ci/apt-install
shellcheck scripts/ci/*.sh
shellcheck contrib/apt-install contrib/dependencies/*.sh
shellcheck -x test/others/crit/*.sh
shellcheck -x test/others/libcriu/*.sh
shellcheck -x test/others/crit/*.sh test/others/criu-coredump/*.sh
shellcheck -x test/others/config-file/*.sh
shellcheck -x test/others/action-script/*.sh
codespell -S tags
codespell:
codespell
lint: ruff shellcheck codespell
# Do not append \n to pr_perror, pr_pwarn or fail
! git --no-pager grep -E '^\s*\<(pr_perror|pr_pwarn|fail)\>.*\\n"'
# Do not use %m with pr_* or fail
@ -458,9 +490,9 @@ lint:
! git --no-pager grep -En '^\s*\<pr_(err|warn|msg|info|debug)\>.*);$$' | grep -v '\\n'
# No EOL whitespace for C files
! git --no-pager grep -E '\s+$$' \*.c \*.h
.PHONY: lint
.PHONY: lint ruff shellcheck codespell
codecov: SHELL := $(shell which bash)
codecov: SHELL := $(shell command -v bash)
codecov:
curl -Os https://uploader.codecov.io/latest/linux/codecov
chmod +x codecov

View file

@ -50,8 +50,8 @@ compel/plugins/%: $(compel-deps) .FORCE
#
# GNU make 4.x supports targets matching via wide
# match targeting, where GNU make 3.x series (used on
# Travis) is not, so we have to write them here explicitly.
# match targeting, where GNU make 3.x series is not,
# so we have to write them here explicitly.
compel/plugins/std.lib.a: $(compel-deps) .FORCE
$(Q) $(MAKE) $(build)=compel/plugins $@

View file

@ -2,12 +2,15 @@ include $(__nmk_dir)utils.mk
include $(__nmk_dir)msg.mk
include scripts/feature-tests.mak
# This is a kludge for $(info ...) to not eat spaces.
S :=
ifeq ($(call try-cc,$(FEATURE_TEST_LIBBSD_DEV),-lbsd),true)
LIBS_FEATURES += -lbsd
FEATURE_DEFINES += -DCONFIG_HAS_LIBBSD
else
$(info Note: Building without setproctitle() and strlcpy() support.)
$(info $(info) To enable these features, please install libbsd-devel (RPM) / libbsd-dev (DEB).)
$(info Note: Building without setproctitle() support.)
$(info $S Install libbsd-devel (RPM) / libbsd-dev (DEB) to fix.)
endif
ifeq ($(call pkg-config-check,libselinux),y)
@ -23,10 +26,10 @@ endif
ifeq ($(call pkg-config-check,libdrm),y)
export CONFIG_AMDGPU := y
$(info Note: Building criu with amdgpu_plugin.)
$(info Note: Building with amdgpu_plugin.)
else
$(info Note: Building criu without amdgpu_plugin.)
$(info Note: libdrm and libdrm_amdgpu are required to build amdgpu_plugin.)
$(info Note: Building without amdgpu_plugin.)
$(info $S Install libdrm-devel (RPM) or libdrm-dev (DEB) to fix.)
endif
ifeq ($(NO_GNUTLS)x$(call pkg-config-check,gnutls),xy)
@ -34,7 +37,8 @@ ifeq ($(NO_GNUTLS)x$(call pkg-config-check,gnutls),xy)
export CONFIG_GNUTLS := y
FEATURE_DEFINES += -DCONFIG_GNUTLS
else
$(info Note: Building without GnuTLS support)
$(info Note: Building without GnuTLS support.)
$(info $S Install gnutls-devel (RPM) or gnutls-dev (DEB) to fix.)
endif
ifeq ($(call pkg-config-check,libnftables),y)
@ -46,16 +50,19 @@ ifeq ($(call pkg-config-check,libnftables),y)
LIBS_FEATURES += $(LIB_NFTABLES)
FEATURE_DEFINES += -DCONFIG_HAS_NFTABLES_LIB_API_1
else
$(warning Warn: you have libnftables installed but it has incompatible API)
$(warning Warn: Building without nftables support)
$(info Warn: Building without nftables support (incompatible API version).)
endif
else
$(warning Warn: you have no libnftables installed)
$(warning Warn: Building without nftables support)
$(info Warn: Building without nftables support.)
$(info $S Install nftables-devel (RPM) or libnftables-dev (DEB) to fix.)
endif
export LIBS += $(LIBS_FEATURES)
ifneq ($(PLUGINDIR),)
FEATURE_DEFINES += -DCR_PLUGIN_DEFAULT="\"$(PLUGINDIR)\""
endif
CONFIG_FILE = .config
$(CONFIG_FILE):
@ -67,17 +74,17 @@ ifeq ($(call try-asm,$(FEATURE_TEST_X86_COMPAT)),true)
export CONFIG_COMPAT := y
FEATURE_DEFINES += -DCONFIG_COMPAT
else
$(info Note: Building without ia32 C/R, missed ia32 support in gcc)
$(info $(info) That may be related to missing gcc-multilib in your)
$(info $(info) distribution or you may have Debian with buggy toolchain)
$(info $(info) (issue https://github.com/checkpoint-restore/criu/issues/315))
$(info Note: Building without ia32 C/R, missing ia32 support in gcc.)
$(info $S It may be related to missing gcc-multilib in your)
$(info $S distribution, or you may have Debian with buggy toolchain.)
$(info $S See https://github.com/checkpoint-restore/criu/issues/315.)
endif
endif
export DEFINES += $(FEATURE_DEFINES)
export CFLAGS += $(FEATURE_DEFINES)
FEATURES_LIST := TCP_REPAIR STRLCPY STRLCAT PTRACE_PEEKSIGINFO \
FEATURES_LIST := TCP_REPAIR PTRACE_PEEKSIGINFO \
SETPROCTITLE_INIT TCP_REPAIR_WINDOW MEMFD_CREATE \
OPENAT2 NO_LIBC_RSEQ_DEFS

View file

@ -29,6 +29,33 @@ LIBDIR ?= $(PREFIX)/lib
export PREFIX BINDIR SBINDIR MANDIR RUNDIR
export LIBDIR INCLUDEDIR LIBEXECDIR PLUGINDIR
# Detect externally managed Python environment (PEP 668).
PYTHON_EXTERNALLY_MANAGED := $(shell $(PYTHON) -c 'import os, sysconfig; print(int(os.path.isfile(os.path.join(sysconfig.get_path("stdlib"), "EXTERNALLY-MANAGED"))))')
PIP_BREAK_SYSTEM_PACKAGES ?= 0
# If Python environment is externally managed and PIP_BREAK_SYSTEM_PACKAGES is not set, skip pip install.
SKIP_PIP_INSTALL := 0
ifeq ($(PYTHON_EXTERNALLY_MANAGED),1)
ifeq ($(PIP_BREAK_SYSTEM_PACKAGES),0)
SKIP_PIP_INSTALL := 1
$(info Warn: Externally managed python environment)
$(info Consider using PIP_BREAK_SYSTEM_PACKAGES=1)
endif
endif
# Default flags for pip install:
# --ignore-installed: Overwrite already installed pycriu/crit packages
# --no-build-isolation: Use current Python environment to build pycriu/crit packages
# --no-deps: Don't install any dependencies
# --no-index: Don't use PyPI index to find packages
# --progress-bar: Cleaner output
# --upgrade: Treat the install as an upgrade when replacing the installed version
PIPFLAGS ?= --ignore-installed --no-build-isolation --no-deps --no-index --progress-bar off --upgrade
export SKIP_PIP_INSTALL PIPFLAGS
install-man:
$(Q) $(MAKE) -C Documentation install
.PHONY: install-man
@ -49,12 +76,16 @@ install-amdgpu_plugin: amdgpu_plugin
$(Q) $(MAKE) -C plugins/amdgpu install
.PHONY: install-amdgpu_plugin
install-cuda_plugin: cuda_plugin
$(Q) $(MAKE) -C plugins/cuda install
.PHONY: install-cuda_plugin
install-compel: $(compel-install-targets)
$(Q) $(MAKE) $(build)=compel install
$(Q) $(MAKE) $(build)=compel/plugins install
.PHONY: install-compel
install: install-man install-lib install-crit install-criu install-compel install-amdgpu_plugin ;
install: install-man install-lib install-crit install-criu install-compel install-amdgpu_plugin install-cuda_plugin ;
.PHONY: install
uninstall:
@ -65,4 +96,5 @@ uninstall:
$(Q) $(MAKE) $(build)=compel $@
$(Q) $(MAKE) $(build)=compel/plugins $@
$(Q) $(MAKE) -C plugins/amdgpu $@
$(Q) $(MAKE) -C plugins/cuda $@
.PHONY: uninstall

View file

@ -1,10 +1,10 @@
#
# CRIU version.
CRIU_VERSION_MAJOR := 3
CRIU_VERSION_MINOR := 19
CRIU_VERSION_MAJOR := 4
CRIU_VERSION_MINOR := 2
CRIU_VERSION_SUBLEVEL :=
CRIU_VERSION_EXTRA :=
CRIU_VERSION_NAME := Bronze Peacock
CRIU_VERSION_NAME := CRIUTIBILITY
CRIU_VERSION := $(CRIU_VERSION_MAJOR)$(if $(CRIU_VERSION_MINOR),.$(CRIU_VERSION_MINOR))$(if $(CRIU_VERSION_SUBLEVEL),.$(CRIU_VERSION_SUBLEVEL))$(if $(CRIU_VERSION_EXTRA),.$(CRIU_VERSION_EXTRA))
export CRIU_VERSION_MAJOR CRIU_VERSION_MINOR CRIU_VERSION_SUBLEVEL

View file

@ -7,7 +7,7 @@
[![CircleCI](https://circleci.com/gh/checkpoint-restore/criu.svg?style=svg)](
https://circleci.com/gh/checkpoint-restore/criu)
<p align="center"><img src="https://criu.org/w/images/1/1c/CRIU.svg" width="256px"/></p>
<p align="center"><img src="Documentation/logo.svg" width="256px"/></p>
## CRIU -- A project to implement checkpoint/restore functionality for Linux
@ -35,7 +35,7 @@ Pages worth starting with are:
- [Installation instructions](http://criu.org/Installation)
- [A simple example of usage](http://criu.org/Simple_loop)
- [Examples of more advanced usage](https://criu.org/Category:HOWTO)
- Troubleshooting can be hard, some help can be found [here](https://criu.org/When_C/R_fails), [here](https://criu.org/What_cannot_be_checkpointed) and [here](https://criu.org/FAQ)
- Troubleshooting can be hard, some help can be found [here](https://criu.org/When_C/R_fails), [here](https://criu.org/What_cannot_be_checkpointed) and [here](https://criu.org/index.php?title=FAQ)
### Checkpoint and restore of simple loop process
<p align="center"><a href="https://asciinema.org/a/232445"><img src="https://asciinema.org/a/232445.png" width="572px" height="412px"/></a></p>

3
compel/.gitignore vendored
View file

@ -4,6 +4,9 @@ arch/arm/plugins/std/syscalls/syscalls.S
arch/aarch64/plugins/std/syscalls/syscalls.S
arch/s390/plugins/std/syscalls/syscalls.S
arch/ppc64/plugins/std/syscalls/syscalls.S
arch/mips/plugins/std/syscalls/syscalls-64.S
arch/loongarch64/plugins/std/syscalls/syscalls-64.S
arch/riscv64/plugins/std/syscalls/syscalls.S
include/version.h
plugins/include/uapi/std/asm/syscall-types.h
plugins/include/uapi/std/syscall-64.h

View file

@ -32,8 +32,8 @@ ifeq ($(ARCH),x86)
lib-y += arch/$(ARCH)/src/lib/thread_area.o
endif
# handle_elf() has no support of ELF relocations on ARM (yet?)
ifneq ($(filter arm aarch64 loongarch64,$(ARCH)),)
# handle_elf() has no support of ELF relocations on ARM and RISCV64 (yet?)
ifneq ($(filter arm aarch64 loongarch64 riscv64,$(ARCH)),)
CFLAGS += -DNO_RELOCS
HOSTCFLAGS += -DNO_RELOCS
endif

View file

@ -0,0 +1,47 @@
#ifndef __UAPI_ASM_GCS_TYPES_H__
#define __UAPI_ASM_GCS_TYPES_H__
#ifndef NT_ARM_GCS
#define NT_ARM_GCS 0x410 /* ARM GCS state */
#endif
/* Shadow Stack/Guarded Control Stack interface */
#define PR_GET_SHADOW_STACK_STATUS 74
#define PR_SET_SHADOW_STACK_STATUS 75
#define PR_LOCK_SHADOW_STACK_STATUS 76
/* When set PR_SHADOW_STACK_ENABLE flag allocates a Guarded Control Stack */
#ifndef PR_SHADOW_STACK_ENABLE
#define PR_SHADOW_STACK_ENABLE (1UL << 0)
#endif
/* Allows explicit GCS stores (eg. using GCSSTR) */
#ifndef PR_SHADOW_STACK_WRITE
#define PR_SHADOW_STACK_WRITE (1UL << 1)
#endif
/* Allows explicit GCS pushes (eg. using GCSPUSHM) */
#ifndef PR_SHADOW_STACK_PUSH
#define PR_SHADOW_STACK_PUSH (1UL << 2)
#endif
#ifndef SHADOW_STACK_SET_TOKEN
#define SHADOW_STACK_SET_TOKEN 0x1 /* Set up a restore token in the shadow stack */
#endif
#define PR_SHADOW_STACK_ALL_MODES \
PR_SHADOW_STACK_ENABLE | PR_SHADOW_STACK_WRITE | PR_SHADOW_STACK_PUSH
/* copied from: arch/arm64/include/asm/sysreg.h */
#define GCS_CAP_VALID_TOKEN 0x1
#define GCS_CAP_ADDR_MASK 0xFFFFFFFFFFFFF000ULL
#define GCS_CAP(x) ((((unsigned long)x) & GCS_CAP_ADDR_MASK) | GCS_CAP_VALID_TOKEN)
#define GCS_SIGNAL_CAP(addr) (((unsigned long)addr) & GCS_CAP_ADDR_MASK)
#include <asm/hwcap.h>
#ifndef HWCAP_GCS
#define HWCAP_GCS (1UL << 32)
#endif
#endif /* __UAPI_ASM_GCS_TYPES_H__ */

View file

@ -2,6 +2,7 @@
#define UAPI_COMPEL_ASM_TYPES_H__
#include <stdint.h>
#include <stdbool.h>
#include <signal.h>
#include <sys/mman.h>
#include <asm/ptrace.h>
@ -16,7 +17,24 @@
*/
typedef struct user_pt_regs user_regs_struct_t;
typedef struct user_fpsimd_state user_fpregs_struct_t;
/*
* GCS (Guarded Control Stack)
*
* This mirrors the kernel definition but renamed to cr_user_gcs
* to avoid conflict with kernel headers (/usr/include/asm/ptrace.h).
*/
struct cr_user_gcs {
__u64 features_enabled;
__u64 features_locked;
__u64 gcspr_el0;
};
struct user_fpregs_struct {
struct user_fpsimd_state fpstate;
struct cr_user_gcs gcs;
};
typedef struct user_fpregs_struct user_fpregs_struct_t;
#define __compel_arch_fetch_thread_area(tid, th) 0
#define compel_arch_fetch_thread_area(tctl) 0
@ -39,4 +57,12 @@ typedef struct user_fpsimd_state user_fpregs_struct_t;
__NR_##syscall; \
})
extern bool __compel_host_supports_gcs(void);
#define compel_host_supports_gcs __compel_host_supports_gcs
struct parasite_ctl;
extern int __parasite_setup_shstk(struct parasite_ctl *ctl,
user_fpregs_struct_t *ext_regs);
#define parasite_setup_shstk __parasite_setup_shstk
#endif /* UAPI_COMPEL_ASM_TYPES_H__ */

View file

@ -1,19 +1,29 @@
#ifndef UAPI_COMPEL_ASM_SIGFRAME_H__
#define UAPI_COMPEL_ASM_SIGFRAME_H__
#include <asm/sigcontext.h>
#include <signal.h>
#include <sys/ucontext.h>
#include <stdint.h>
#include <asm/types.h>
/* Copied from the kernel header arch/arm64/include/uapi/asm/sigcontext.h */
#define FPSIMD_MAGIC 0x46508001
#define GCS_MAGIC 0x47435300
typedef struct fpsimd_context fpu_state_t;
struct gcs_context {
struct _aarch64_ctx head;
__u64 gcspr;
__u64 features_enabled;
__u64 reserved;
};
struct aux_context {
struct fpsimd_context fpsimd;
struct gcs_context gcs;
/* additional context to be added before "end" */
struct _aarch64_ctx end;
};
@ -62,6 +72,7 @@ struct cr_sigcontext {
#define RT_SIGFRAME_AUX_CONTEXT(rt_sigframe) ((struct aux_context *)&(RT_SIGFRAME_SIGCONTEXT(rt_sigframe)->__reserved))
#define RT_SIGFRAME_FPU(rt_sigframe) (&RT_SIGFRAME_AUX_CONTEXT(rt_sigframe)->fpsimd)
#define RT_SIGFRAME_OFFSET(rt_sigframe) 0
#define RT_SIGFRAME_GCS(rt_sigframe) (&RT_SIGFRAME_AUX_CONTEXT(rt_sigframe)->gcs)
#define rt_sigframe_erase_sigset(sigframe) memset(&sigframe->uc.uc_sigmask, 0, sizeof(k_rtsigset_t))
#define rt_sigframe_copy_sigset(sigframe, from) memcpy(&sigframe->uc.uc_sigmask, from, sizeof(k_rtsigset_t))

View file

@ -2,8 +2,8 @@
#include <sys/ptrace.h>
#include <sys/types.h>
#include <sys/uio.h>
#include <sys/auxv.h>
#include <asm/ptrace.h>
#include <linux/elf.h>
#include <compel/plugins/std/syscall-codes.h>
#include "common/page.h"
@ -13,6 +13,8 @@
#include "infect.h"
#include "infect-priv.h"
#include "asm/breakpoints.h"
#include "asm/gcs-types.h"
#include <linux/prctl.h>
unsigned __page_size = 0;
unsigned __page_shift = 0;
@ -33,24 +35,54 @@ static inline void __always_unused __check_code_syscall(void)
BUILD_BUG_ON(!is_log2(sizeof(code_syscall)));
}
bool __compel_host_supports_gcs(void)
{
unsigned long hwcap = getauxval(AT_HWCAP);
return (hwcap & HWCAP_GCS) != 0;
}
static bool __compel_gcs_enabled(struct cr_user_gcs *gcs)
{
if (!compel_host_supports_gcs())
return false;
return gcs && (gcs->features_enabled & PR_SHADOW_STACK_ENABLE) != 0;
}
int sigreturn_prep_regs_plain(struct rt_sigframe *sigframe, user_regs_struct_t *regs, user_fpregs_struct_t *fpregs)
{
struct fpsimd_context *fpsimd = RT_SIGFRAME_FPU(sigframe);
struct gcs_context *gcs = RT_SIGFRAME_GCS(sigframe);
memcpy(sigframe->uc.uc_mcontext.regs, regs->regs, sizeof(regs->regs));
pr_debug("sigreturn_prep_regs_plain: sp %lx pc %lx\n", (long)regs->sp, (long)regs->pc);
sigframe->uc.uc_mcontext.sp = regs->sp;
sigframe->uc.uc_mcontext.pc = regs->pc;
sigframe->uc.uc_mcontext.pstate = regs->pstate;
memcpy(fpsimd->vregs, fpregs->vregs, 32 * sizeof(__uint128_t));
memcpy(fpsimd->vregs, fpregs->fpstate.vregs, 32 * sizeof(__uint128_t));
fpsimd->fpsr = fpregs->fpsr;
fpsimd->fpcr = fpregs->fpcr;
fpsimd->fpsr = fpregs->fpstate.fpsr;
fpsimd->fpcr = fpregs->fpstate.fpcr;
fpsimd->head.magic = FPSIMD_MAGIC;
fpsimd->head.size = sizeof(*fpsimd);
if (__compel_gcs_enabled(&fpregs->gcs)) {
gcs->head.magic = GCS_MAGIC;
gcs->head.size = sizeof(*gcs);
gcs->reserved = 0;
gcs->gcspr = fpregs->gcs.gcspr_el0 - 8;
gcs->features_enabled = fpregs->gcs.features_enabled;
pr_debug("sigframe gcspr=%llx features_enabled=%llx\n", fpregs->gcs.gcspr_el0 - 8, fpregs->gcs.features_enabled);
} else {
pr_debug("sigframe gcspr=[disabled]\n");
memset(gcs, 0, sizeof(*gcs));
}
return 0;
}
@ -62,7 +94,6 @@ int sigreturn_prep_fpu_frame_plain(struct rt_sigframe *sigframe, struct rt_sigfr
int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct_t *ext_regs, save_regs_t save,
void *arg, __maybe_unused unsigned long flags)
{
user_fpregs_struct_t tmp, *fpsimd = ext_regs ? ext_regs : &tmp;
struct iovec iov;
int ret;
@ -75,14 +106,28 @@ int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct
goto err;
}
iov.iov_base = fpsimd;
iov.iov_len = sizeof(*fpsimd);
iov.iov_base = &ext_regs->fpstate;
iov.iov_len = sizeof(ext_regs->fpstate);
if ((ret = ptrace(PTRACE_GETREGSET, pid, NT_PRFPREG, &iov))) {
pr_perror("Failed to obtain FPU registers for %d", pid);
goto err;
}
ret = save(arg, regs, fpsimd);
memset(&ext_regs->gcs, 0, sizeof(ext_regs->gcs));
iov.iov_base = &ext_regs->gcs;
iov.iov_len = sizeof(ext_regs->gcs);
if (ptrace(PTRACE_GETREGSET, pid, NT_ARM_GCS, &iov) == 0) {
pr_info("gcs: GCSPR_EL0 for %d: 0x%llx, features: 0x%llx\n",
pid, ext_regs->gcs.gcspr_el0, ext_regs->gcs.features_enabled);
if (!__compel_gcs_enabled(&ext_regs->gcs))
pr_info("gcs: GCS is NOT enabled\n");
} else {
pr_info("gcs: GCS state not available for %d\n", pid);
}
ret = save(pid, arg, regs, ext_regs);
err:
return ret;
}
@ -91,14 +136,44 @@ int compel_set_task_ext_regs(pid_t pid, user_fpregs_struct_t *ext_regs)
{
struct iovec iov;
struct cr_user_gcs gcs;
struct iovec gcs_iov = { .iov_base = &gcs, .iov_len = sizeof(gcs) };
pr_info("Restoring GP/FPU registers for %d\n", pid);
iov.iov_base = ext_regs;
iov.iov_len = sizeof(*ext_regs);
iov.iov_base = &ext_regs->fpstate;
iov.iov_len = sizeof(ext_regs->fpstate);
if (ptrace(PTRACE_SETREGSET, pid, NT_PRFPREG, &iov)) {
pr_perror("Failed to set FPU registers for %d", pid);
return -1;
}
if (ptrace(PTRACE_GETREGSET, pid, NT_ARM_GCS, &gcs_iov) < 0) {
pr_warn("gcs: Failed to get GCS for %d\n", pid);
} else {
ext_regs->gcs = gcs;
compel_set_task_gcs_regs(pid, ext_regs);
}
return 0;
}
int compel_set_task_gcs_regs(pid_t pid, user_fpregs_struct_t *ext_regs)
{
struct iovec iov;
pr_info("gcs: restoring GCS registers for %d\n", pid);
pr_info("gcs: restoring GCS: gcspr=%llx features=%llx\n",
ext_regs->gcs.gcspr_el0, ext_regs->gcs.features_enabled);
iov.iov_base = &ext_regs->gcs;
iov.iov_len = sizeof(ext_regs->gcs);
if (ptrace(PTRACE_SETREGSET, pid, NT_ARM_GCS, &iov)) {
pr_perror("gcs: Failed to set GCS registers for %d", pid);
return -1;
}
return 0;
}
@ -287,3 +362,68 @@ int ptrace_flush_breakpoints(pid_t pid)
return 0;
}
int inject_gcs_cap_token(struct parasite_ctl *ctl, pid_t pid, struct cr_user_gcs *gcs)
{
struct iovec gcs_iov = { .iov_base = gcs, .iov_len = sizeof(*gcs) };
uint64_t token_addr = gcs->gcspr_el0 - 8;
uint64_t sigtramp_addr = gcs->gcspr_el0 - 16;
uint64_t cap_token = ALIGN_DOWN(GCS_SIGNAL_CAP(token_addr), 8);
unsigned long restorer_addr;
pr_info("gcs: (setup) CAP token: 0x%lx at addr: 0x%lx\n", cap_token, token_addr);
/* Inject capability token at gcspr_el0 - 8 */
if (ptrace(PTRACE_POKEDATA, pid, (void *)token_addr, cap_token)) {
pr_perror("gcs: (setup) Inject GCS cap token failed");
return -1;
}
/* Inject restorer trampoline address (gcspr_el0 - 16) */
restorer_addr = ctl->parasite_ip;
if (ptrace(PTRACE_POKEDATA, pid, (void *)sigtramp_addr, restorer_addr)) {
pr_perror("gcs: (setup) Inject GCS restorer failed");
return -1;
}
/* Update GCSPR_EL0 */
gcs->gcspr_el0 = token_addr;
if (ptrace(PTRACE_SETREGSET, pid, NT_ARM_GCS, &gcs_iov)) {
pr_perror("gcs: PTRACE_SETREGS FAILED");
return -1;
}
pr_debug("gcs: parasite_ip=%#lx sp=%#llx gcspr_el0=%#llx\n",
ctl->parasite_ip, ctl->orig.regs.sp, gcs->gcspr_el0);
return 0;
}
int parasite_setup_shstk(struct parasite_ctl *ctl, user_fpregs_struct_t *ext_regs)
{
struct cr_user_gcs gcs;
struct iovec gcs_iov = { .iov_base = &gcs, .iov_len = sizeof(gcs) };
pid_t pid = ctl->rpid;
if(!__compel_host_supports_gcs())
return 0;
if (ptrace(PTRACE_GETREGSET, pid, NT_ARM_GCS, &gcs_iov) != 0) {
pr_perror("GCS state not available for %d", pid);
return -1;
}
if (!__compel_gcs_enabled(&gcs))
return 0;
if (inject_gcs_cap_token(ctl, pid, &gcs)) {
pr_perror("Failed to inject GCS cap token for %d", pid);
return -1;
}
pr_info("gcs: GCS enabled for %d\n", pid);
return 0;
}

View file

@ -39,7 +39,7 @@ recvfrom 207 292 (int sockfd, void *ubuf, size_t size, unsigned int flags, str
sendmsg 211 296 (int sockfd, const struct msghdr *msg, int flags)
recvmsg 212 297 (int sockfd, struct msghdr *msg, int flags)
shutdown 210 293 (int sockfd, int how)
bind 235 282 (int sockfd, const struct sockaddr *addr, int addrlen)
bind 200 282 (int sockfd, const struct sockaddr *addr, int addrlen)
setsockopt 208 294 (int sockfd, int level, int optname, const void *optval, socklen_t optlen)
getsockopt 209 295 (int sockfd, int level, int optname, const void *optval, socklen_t *optlen)
clone 220 120 (unsigned long flags, void *child_stack, void *parent_tid, unsigned long newtls, void *child_tid)
@ -85,7 +85,7 @@ timer_settime 110 258 (kernel_timer_t timer_id, int flags, const struct itimer
timer_gettime 108 259 (int timer_id, const struct itimerspec *setting)
timer_getoverrun 109 260 (int timer_id)
timer_delete 111 261 (kernel_timer_t timer_id)
clock_gettime 113 263 (const clockid_t which_clock, const struct timespec *tp)
clock_gettime 113 263 (clockid_t which_clock, struct timespec *tp)
exit_group 94 248 (int error_code)
set_robust_list 99 338 (struct robust_list_head *head, size_t len)
get_robust_list 100 339 (int pid, struct robust_list_head **head_ptr, size_t *len_ptr)
@ -118,8 +118,10 @@ fsopen 430 430 (char *fsname, unsigned int flags)
fsconfig 431 431 (int fd, unsigned int cmd, const char *key, const char *value, int aux)
fsmount 432 432 (int fd, unsigned int flags, unsigned int attr_flags)
clone3 435 435 (struct clone_args *uargs, size_t size)
close_range 436 436 (unsigned int fd, unsigned int max_fd, unsigned int flags)
pidfd_open 434 434 (pid_t pid, unsigned int flags)
openat2 437 437 (int dirfd, char *pathname, struct open_how *how, size_t size)
pidfd_getfd 438 438 (int pidfd, int targetfd, unsigned int flags)
rseq 293 398 (void *rseq, uint32_t rseq_len, int flags, uint32_t sig)
membarrier 283 389 (int cmd, unsigned int flags, int cpu_id)
map_shadow_stack 453 ! (unsigned long addr, unsigned long size, unsigned int flags)

View file

@ -65,10 +65,9 @@ int sigreturn_prep_fpu_frame_plain(struct rt_sigframe *sigframe, struct rt_sigfr
}
#define PTRACE_GETVFPREGS 27
int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct_t *ext_regs, save_regs_t save,
int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct_t *vfp, save_regs_t save,
void *arg, __maybe_unused unsigned long flags)
{
user_fpregs_struct_t tmp, *vfp = ext_regs ? ext_regs : &tmp;
int ret = -1;
pr_info("Dumping GP/FPU registers for %d\n", pid);
@ -95,7 +94,7 @@ int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct
}
}
ret = save(arg, regs, vfp);
ret = save(pid, arg, regs, vfp);
err:
return ret;
}

View file

@ -46,7 +46,7 @@ __NR_sys_timer_gettime 108 sys_timer_gettime (int timer_id, const struct itimer
__NR_sys_timer_getoverrun 109 sys_timer_getoverrun (int timer_id)
__NR_sys_timer_settime 110 sys_timer_settime (kernel_timer_t timer_id, int flags, const struct itimerspec *new_setting, struct itimerspec *old_setting)
__NR_sys_timer_delete 111 sys_timer_delete (kernel_timer_t timer_id)
__NR_clock_gettime 113 sys_clock_gettime (const clockid_t which_clock, const struct timespec *tp)
__NR_clock_gettime 113 sys_clock_gettime (clockid_t which_clock, struct timespec *tp)
__NR_sched_setscheduler 119 sys_sched_setscheduler (int pid, int policy, struct sched_param *p)
__NR_restart_syscall 128 sys_restart_syscall (void)
__NR_kill 129 sys_kill (long pid, int sig)

View file

@ -91,7 +91,7 @@ int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct
goto err;
}
ret = save(arg, regs, fpregs);
ret = save(pid, arg, regs, fpregs);
err:
return 0;
}

View file

@ -84,7 +84,7 @@ __NR_sys_timer_settime 5217 sys_timer_settime (kernel_timer_t timer_id, int fl
__NR_sys_timer_gettime 5218 sys_timer_gettime (int timer_id, const struct itimerspec *setting)
__NR_sys_timer_getoverrun 5219 sys_timer_getoverrun (int timer_id)
__NR_sys_timer_delete 5220 sys_timer_delete (kernel_timer_t timer_id)
__NR_clock_gettime 5222 sys_clock_gettime (const clockid_t which_clock, const struct timespec *tp)
__NR_clock_gettime 5222 sys_clock_gettime (clockid_t which_clock, struct timespec *tp)
__NR_exit_group 5205 sys_exit_group (int error_code)
__NR_set_thread_area 5242 sys_set_thread_area (unsigned long *addr)
__NR_openat 5247 sys_openat (int dfd, const char *filename, int flags, int mode)
@ -115,6 +115,7 @@ __NR_fsopen 5430 sys_fsopen (char *fsname, unsigned int flags)
__NR_fsconfig 5431 sys_fsconfig (int fd, unsigned int cmd, const char *key, const char *value, int aux)
__NR_fsmount 5432 sys_fsmount (int fd, unsigned int flags, unsigned int attr_flags)
__NR_clone3 5435 sys_clone3 (struct clone_args *uargs, size_t size)
__NR_close_range 5436 sys_close_range (unsigned int fd, unsigned int max_fd, unsigned int flags)
__NR_pidfd_open 5434 sys_pidfd_open (pid_t pid, unsigned int flags)
__NR_openat2 5437 sys_openat2 (int dirfd, char *pathname, struct open_how *how, size_t size)
__NR_pidfd_getfd 5438 sys_pidfd_getfd (int pidfd, int targetfd, unsigned int flags)

View file

@ -5,18 +5,31 @@
#include "piegen.h"
#include "log.h"
static const unsigned char __maybe_unused elf_ident_64_le[EI_NIDENT] = {
0x7f, 0x45, 0x4c, 0x46, 0x02, 0x01, 0x01, 0x00, /* clang-format */
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
};
extern int __handle_elf(void *mem, size_t size);
int handle_binary(void *mem, size_t size)
{
if (memcmp(mem, elf_ident_64_le, sizeof(elf_ident_64_le)) == 0)
return __handle_elf(mem, size);
Elf64_Ehdr *ehdr = (Elf64_Ehdr *)mem;
pr_err("Unsupported Elf format detected\n");
return -EINVAL;
/* check ELF magic */
if (ehdr->e_ident[EI_MAG0] != ELFMAG0 ||
ehdr->e_ident[EI_MAG1] != ELFMAG1 ||
ehdr->e_ident[EI_MAG2] != ELFMAG2 ||
ehdr->e_ident[EI_MAG3] != ELFMAG3) {
pr_err("Invalid ELF magic\n");
return -EINVAL;
}
/* check ELF class and data encoding */
if (ehdr->e_ident[EI_CLASS] != ELFCLASS64 ||
ehdr->e_ident[EI_DATA] != ELFDATA2LSB) {
pr_err("Unsupported ELF class or data encoding\n");
return -EINVAL;
}
if (ehdr->e_ident[EI_ABIVERSION] != 0) {
pr_warn("Unusual ABI version: %d\n", ehdr->e_ident[EI_ABIVERSION]);
}
return __handle_elf(mem, size);
}

View file

@ -119,10 +119,9 @@ int sigreturn_prep_fpu_frame_plain(struct rt_sigframe *sigframe, struct rt_sigfr
return 0;
}
int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct_t *ext_regs, save_regs_t save,
int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct_t *xs, save_regs_t save,
void *arg, __maybe_unused unsigned long flags)
{
user_fpregs_struct_t xsave = {}, *xs = ext_regs ? ext_regs : &xsave;
int ret = -1;
pr_info("Dumping GP/FPU registers for %d\n", pid);
@ -150,7 +149,7 @@ int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct
regs->regs[0] = 0;
}
ret = save(arg, regs, xs);
ret = save(pid, arg, regs, xs);
return ret;
}

View file

@ -82,7 +82,7 @@ __NR_sys_timer_settime 241 sys_timer_settime (kernel_timer_t timer_id, int flag
__NR_sys_timer_gettime 242 sys_timer_gettime (int timer_id, const struct itimerspec *setting)
__NR_sys_timer_getoverrun 243 sys_timer_getoverrun (int timer_id)
__NR_sys_timer_delete 244 sys_timer_delete (kernel_timer_t timer_id)
__NR_clock_gettime 246 sys_clock_gettime (const clockid_t which_clock, const struct timespec *tp)
__NR_clock_gettime 246 sys_clock_gettime (clockid_t which_clock, struct timespec *tp)
__NR_exit_group 234 sys_exit_group (int error_code)
__NR_waitid 272 sys_waitid (int which, pid_t pid, struct siginfo *infop, int options, struct rusage *ru)
__NR_set_robust_list 300 sys_set_robust_list (struct robust_list_head *head, size_t len)
@ -114,6 +114,7 @@ __NR_fsopen 430 sys_fsopen (char *fsname, unsigned int flags)
__NR_fsconfig 431 sys_fsconfig (int fd, unsigned int cmd, const char *key, const char *value, int aux)
__NR_fsmount 432 sys_fsmount (int fd, unsigned int flags, unsigned int attr_flags)
__NR_clone3 435 sys_clone3 (struct clone_args *uargs, size_t size)
__NR_close_range 436 sys_close_range (unsigned int fd, unsigned int max_fd, unsigned int flags)
__NR_pidfd_open 434 sys_pidfd_open (pid_t pid, unsigned int flags)
__NR_openat2 437 sys_openat2 (int dirfd, char *pathname, struct open_how *how, size_t size)
__NR_pidfd_getfd 438 sys_pidfd_getfd (int pidfd, int targetfd, unsigned int flags)

View file

@ -391,17 +391,16 @@ static int __get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_stru
return 0;
}
int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct_t *ext_regs, save_regs_t save,
int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct_t *fpregs, save_regs_t save,
void *arg, __maybe_unused unsigned long flags)
{
user_fpregs_struct_t tmp, *fpregs = ext_regs ? ext_regs : &tmp;
int ret;
ret = __get_task_regs(pid, regs, fpregs);
if (ret)
return ret;
return save(arg, regs, fpregs);
return save(pid, arg, regs, fpregs);
}
int compel_set_task_ext_regs(pid_t pid, user_fpregs_struct_t *ext_regs)

View file

@ -0,0 +1,35 @@
#ifndef __ASM_PROLOGUE_H__
#define __ASM_PROLOGUE_H__
#ifndef __ASSEMBLY__
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <errno.h>
#define sys_recv(sockfd, ubuf, size, flags) sys_recvfrom(sockfd, ubuf, size, flags, NULL, NULL)
typedef struct prologue_init_args {
struct sockaddr_un ctl_sock_addr;
unsigned int ctl_sock_addr_len;
unsigned int arg_s;
void *arg_p;
void *sigframe;
} prologue_init_args_t;
#endif /* __ASSEMBLY__ */
/*
* Reserve enough space for sigframe.
*
* FIXME It is rather should be taken from sigframe header.
*/
#define PROLOGUE_SGFRAME_SIZE 4096
#define PROLOGUE_INIT_ARGS_SIZE 1024
#endif /* __ASM_PROLOGUE_H__ */

View file

@ -0,0 +1,28 @@
#ifndef COMPEL_ARCH_SYSCALL_TYPES_H__
#define COMPEL_ARCH_SYSCALL_TYPES_H__
#define SA_RESTORER 0x04000000
typedef void rt_signalfn_t(int, siginfo_t *, void *);
typedef rt_signalfn_t *rt_sighandler_t;
typedef void rt_restorefn_t(void);
typedef rt_restorefn_t *rt_sigrestore_t;
#define _KNSIG 64 // number of signals
#define _NSIG_BPW 64 // number of signals per word
#define _KNSIG_WORDS (_KNSIG / _NSIG_BPW)
typedef struct {
unsigned long sig[_KNSIG_WORDS];
} k_rtsigset_t;
typedef struct {
rt_sighandler_t rt_sa_handler;
unsigned long rt_sa_flags;
rt_sigrestore_t rt_sa_restorer;
k_rtsigset_t rt_sa_mask;
} rt_sigaction_t;
#endif /* COMPEL_ARCH_SYSCALL_TYPES_H__ */

View file

@ -0,0 +1,4 @@
#ifndef __COMPEL_ARCH_FEATURES_H
#define __COMPEL_ARCH_FEATURES_H
#endif /* __COMPEL_ARCH_FEATURES_H */

View file

@ -0,0 +1,7 @@
#include "common/asm/linkage.h"
.section .head.text, "ax"
ENTRY(__export_parasite_head_start)
jal parasite_service
ebreak
END(__export_parasite_head_start)

View file

@ -0,0 +1,59 @@
ccflags-y += -iquote $(PLUGIN_ARCH_DIR)/std/syscalls/
asflags-y += -iquote $(PLUGIN_ARCH_DIR)/std/syscalls/
sys-types := $(obj)/include/uapi/std/syscall-types.h
sys-codes := $(obj)/include/uapi/std/syscall-codes.h
sys-proto := $(obj)/include/uapi/std/syscall.h
sys-def := $(PLUGIN_ARCH_DIR)/std/syscalls/syscall.def
sys-asm-common-name := std/syscalls/syscall-common.S
sys-asm-common := $(PLUGIN_ARCH_DIR)/$(sys-asm-common-name)
sys-asm-types := $(obj)/include/uapi/std/asm/syscall-types.h
sys-exec-tbl = $(PLUGIN_ARCH_DIR)/std/sys-exec-tbl.c
sys-gen := $(PLUGIN_ARCH_DIR)/std/syscalls/gen-syscalls.pl
sys-gen-tbl := $(PLUGIN_ARCH_DIR)/std/syscalls/gen-sys-exec-tbl.pl
sys-asm := ./$(PLUGIN_ARCH_DIR)/std/syscalls/syscalls.S
std-lib-y += $(sys-asm:.S=).o
ifeq ($(ARCH),arm)
arch_bits := 32
else
arch_bits := 64
endif
sys-exec-tbl := sys-exec-tbl.c
$(sys-asm) $(sys-types) $(sys-codes) $(sys-proto): $(sys-gen) $(sys-def) $(sys-asm-common) $(sys-asm-types)
$(E) " GEN " $@
$(Q) perl \
$(sys-gen) \
$(sys-def) \
$(sys-codes) \
$(sys-proto) \
$(sys-asm) \
$(sys-asm-common-name) \
$(sys-types) \
$(arch_bits)
$(sys-asm:.S=).o: $(sys-asm)
$(sys-exec-tbl): $(sys-gen-tbl) $(sys-def)
$(E) " GEN " $@
$(Q) perl \
$(sys-gen-tbl) \
$(sys-def) \
$(sys-exec-tbl) \
$(arch_bits)
$(sys-asm-types): $(PLUGIN_ARCH_DIR)/include/asm/syscall-types.h
$(call msg-gen, $@)
$(Q) ln -s ../../../../../../$(PLUGIN_ARCH_DIR)/include/asm/syscall-types.h $(sys-asm-types)
$(Q) ln -s ../../../../../$(PLUGIN_ARCH_DIR)/std/syscalls/syscall-aux.S $(obj)/include/uapi/std/syscall-aux.S
$(Q) ln -s ../../../../../$(PLUGIN_ARCH_DIR)/std/syscalls/syscall-aux.h $(obj)/include/uapi/std/syscall-aux.h
std-headers-deps += $(sys-asm) $(sys-codes) $(sys-proto) $(sys-asm-types) $(sys-codes)
mrproper-y += $(std-headers-deps)
mrproper-y += $(obj)/include/uapi/std/syscall-aux.S
mrproper-y += $(obj)/include/uapi/std/syscall-aux.h

View file

@ -0,0 +1,43 @@
#!/usr/bin/perl
use strict;
use warnings;
my $in = $ARGV[0];
my $tblout = $ARGV[1];
my $bits = $ARGV[2];
my $code = "code$bits";
open TBLOUT, ">", $tblout or die $!;
open IN, "<", $in or die $!;
print TBLOUT "/* Autogenerated, don't edit */\n";
print TBLOUT "static struct syscall_exec_desc sc_exec_table[] = {\n";
for (<IN>) {
if ($_ =~ /\#/) {
next;
}
my $sys_name;
my $sys_num;
if (/(?<name>\S+)\s+(?<alias>\S+)\s+(?<code64>\d+|\!)\s+(?<code32>(?:\d+|\!))\s+\((?<args>.+)\)/) {
$sys_name = $+{alias};
} elsif (/(?<name>\S+)\s+(?<code64>\d+|\!)\s+(?<code32>(?:\d+|\!))\s+\((?<args>.+)\)/) {
$sys_name = $+{name};
} else {
unlink $tblout;
die "Invalid syscall definition file: invalid entry $_\n";
}
$sys_num = $+{$code};
if ($sys_num ne "!") {
print TBLOUT "SYSCALL($sys_name, $sys_num)\n";
}
}
print TBLOUT " { }, /* terminator */";
print TBLOUT "};"

View file

@ -0,0 +1,99 @@
#!/usr/bin/perl
use strict;
use warnings;
my $in = $ARGV[0];
my $codesout = $ARGV[1];
my $codes = $ARGV[1];
$codes =~ s/.*include\/uapi\//compel\/plugins\//g;
my $protosout = $ARGV[2];
my $protos = $ARGV[2];
$protos =~ s/.*include\/uapi\//compel\/plugins\//g;
my $asmout = $ARGV[3];
my $asmcommon = $ARGV[4];
my $prototypes = $ARGV[5];
$prototypes =~ s/.*include\/uapi\//compel\/plugins\//g;
my $bits = $ARGV[6];
my $codesdef = $codes;
$codesdef =~ tr/.\-\//_/;
my $protosdef = $protos;
$protosdef =~ tr/.\-\//_/;
my $code = "code$bits";
my $need_aux = 0;
unlink $codesout;
unlink $protosout;
unlink $asmout;
open CODESOUT, ">", $codesout or die $!;
open PROTOSOUT, ">", $protosout or die $!;
open ASMOUT, ">", $asmout or die $!;
open IN, "<", $in or die $!;
print CODESOUT <<"END";
/* Autogenerated, don't edit */
#ifndef $codesdef
#define $codesdef
END
print PROTOSOUT <<"END";
/* Autogenerated, don't edit */
#ifndef $protosdef
#define $protosdef
#include <$prototypes>
#include <$codes>
END
print ASMOUT <<"END";
/* Autogenerated, don't edit */
#include <$codes>
#include "$asmcommon"
END
for (<IN>) {
if ($_ =~ /\#/) {
next;
}
my $code_macro;
my $sys_macro;
my $sys_name;
if (/(?<name>\S+)\s+(?<alias>\S+)\s+(?<code64>\d+|\!)\s+(?<code32>(?:\d+|\!))\s+\((?<args>.+)\)/) {
$code_macro = "__NR_$+{name}";
$sys_macro = "SYS_$+{name}";
$sys_name = "sys_$+{alias}";
} elsif (/(?<name>\S+)\s+(?<code64>\d+|\!)\s+(?<code32>(?:\d+|\!))\s+\((?<args>.+)\)/) {
$code_macro = "__NR_$+{name}";
$sys_macro = "SYS_$+{name}";
$sys_name = "sys_$+{name}";
} else {
unlink $codesout;
unlink $protosout;
unlink $asmout;
die "Invalid syscall definition file: invalid entry $_\n";
}
if ($+{$code} ne "!") {
print CODESOUT "#ifndef $code_macro\n#define $code_macro $+{$code}\n#endif\n";
print CODESOUT "#ifndef $sys_macro\n#define $sys_macro $code_macro\n#endif\n";
print ASMOUT "syscall $sys_name, $code_macro\n";
} else {
$need_aux = 1;
}
print PROTOSOUT "extern long $sys_name($+{args});\n";
}
if ($need_aux == 1) {
print ASMOUT "#include <compel/plugins/std/syscall-aux.S>\n";
print CODESOUT "#include <compel/plugins/std/syscall-aux.h>\n";
}
print CODESOUT "#endif /* $codesdef */";
print PROTOSOUT "#endif /* $protosdef */";

View file

@ -0,0 +1,37 @@
/**
* This source contains emulation of syscalls
* that are not implemented in the riscv64 Linux kernel
*/
ENTRY(sys_open)
add a3, x0, a2
add a2, x0, a1
add a1, x0, a0
addi a0, x0, -100
j sys_openat
END(sys_open)
ENTRY(sys_mkdir)
add a3,x0, a2
add a2, x0, a1
add a1, x0, a0
addi a0, x0, -100
j sys_mkdirat
END(sys_mkdir)
ENTRY(sys_rmdir)
addi a2, x0, 0x200 // flags = AT_REMOVEDIR
add a1, x0, a0
addi a0, x0, -100
j sys_unlinkat
END(sys_rmdir)
ENTRY(sys_unlink)
addi a2, x0, 0 // flags = 0
add a1, x0, a0
addi a0, x0, -100
j sys_unlinkat
END(sys_unlink)

View file

@ -0,0 +1,3 @@
#ifndef __NR_openat
#define __NR_openat 56
#endif

View file

@ -0,0 +1,17 @@
#include "common/asm/linkage.h"
syscall_common:
ecall
ret
.macro syscall name, nr
ENTRY(\name)
li a7, \nr
j syscall_common
END(\name)
.endm
ENTRY(__cr_restore_rt)
li a7, __NR_rt_sigreturn
ecall
END(__cr_restore_rt)

View file

@ -0,0 +1,125 @@
#
# System calls table, please make sure the table consists of only the syscalls
# really used somewhere in the project.
#
# The template is (name and arguments are optional if you need only __NR_x
# defined, but no real entry point in syscalls lib).
#
# name/alias code64 code32 arguments
# -----------------------------------------------------------------------
#
read 63 3 (int fd, void *buf, unsigned long count)
write 64 4 (int fd, const void *buf, unsigned long count)
open ! 5 (const char *filename, unsigned long flags, unsigned long mode)
close 57 6 (int fd)
lseek 62 19 (int fd, unsigned long offset, unsigned long origin)
mmap 222 ! (void *addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long fd, unsigned long offset)
mprotect 226 125 (const void *addr, unsigned long len, unsigned long prot)
munmap 215 91 (void *addr, unsigned long len)
brk 214 45 (void *addr)
rt_sigaction sigaction 134 174 (int signum, const rt_sigaction_t *act, rt_sigaction_t *oldact, size_t sigsetsize)
rt_sigprocmask sigprocmask 135 175 (int how, k_rtsigset_t *set, k_rtsigset_t *old, size_t sigsetsize)
rt_sigreturn 139 173 (void)
ioctl 29 54 (unsigned int fd, unsigned int cmd, unsigned long arg)
pread64 67 180 (unsigned int fd, char *buf, size_t count, loff_t pos)
ptrace 117 26 (long request, pid_t pid, void *addr, void *data)
mremap 216 163 (unsigned long addr, unsigned long old_len, unsigned long new_len, unsigned long flag, unsigned long new_addr)
mincore 232 219 (void *addr, unsigned long size, unsigned char *vec)
madvise 233 220 (unsigned long start, size_t len, int behavior)
shmat 196 305 (int shmid, void *shmaddr, int shmflag)
pause 1061 29 (void)
nanosleep 101 162 (struct timespec *req, struct timespec *rem)
getitimer 102 105 (int which, const struct itimerval *val)
setitimer 103 104 (int which, const struct itimerval *val, struct itimerval *old)
getpid 172 20 (void)
socket 198 281 (int domain, int type, int protocol)
connect 203 283 (int sockfd, struct sockaddr *addr, int addrlen)
sendto 206 290 (int sockfd, void *buff, size_t len, unsigned int flags, struct sockaddr *addr, int addr_len)
recvfrom 207 292 (int sockfd, void *ubuf, size_t size, unsigned int flags, struct sockaddr *addr, int *addr_len)
sendmsg 211 296 (int sockfd, const struct msghdr *msg, int flags)
recvmsg 212 297 (int sockfd, struct msghdr *msg, int flags)
shutdown 210 293 (int sockfd, int how)
bind 235 282 (int sockfd, const struct sockaddr *addr, int addrlen)
setsockopt 208 294 (int sockfd, int level, int optname, const void *optval, socklen_t optlen)
getsockopt 209 295 (int sockfd, int level, int optname, const void *optval, socklen_t *optlen)
clone 220 120 (unsigned long flags, void *child_stack, void *parent_tid, unsigned long newtls, void *child_tid)
exit 93 1 (unsigned long error_code)
wait4 260 114 (int pid, int *status, int options, struct rusage *ru)
waitid 95 280 (int which, pid_t pid, struct siginfo *infop, int options, struct rusage *ru)
kill 129 37 (long pid, int sig)
fcntl 25 55 (int fd, int type, long arg)
flock 32 143 (int fd, unsigned long cmd)
mkdir ! 39 (const char *name, int mode)
rmdir ! 40 (const char *name)
unlink ! 10 (char *pathname)
readlinkat 78 332 (int fd, const char *path, char *buf, int bufsize)
umask 166 60 (int mask)
getgroups 158 205 (int gsize, unsigned int *groups)
setgroups 159 206 (int gsize, unsigned int *groups)
setresuid 147 164 (int uid, int euid, int suid)
getresuid 148 165 (int *uid, int *euid, int *suid)
setresgid 149 170 (int gid, int egid, int sgid)
getresgid 150 171 (int *gid, int *egid, int *sgid)
getpgid 155 132 (pid_t pid)
setfsuid 151 138 (int fsuid)
setfsgid 152 139 (int fsgid)
getsid 156 147 (void)
capget 90 184 (struct cap_header *h, struct cap_data *d)
capset 91 185 (struct cap_header *h, struct cap_data *d)
rt_sigqueueinfo 138 178 (pid_t pid, int sig, siginfo_t *info)
setpriority 140 97 (int which, int who, int nice)
sched_setscheduler 119 156 (int pid, int policy, struct sched_param *p)
sigaltstack 132 186 (const void *uss, void *uoss)
personality 92 136 (unsigned int personality)
prctl 167 172 (int option, unsigned long arg2, unsigned long arg3, unsigned long arg4, unsigned long arg5)
arch_prctl ! 17 (int option, unsigned long addr)
setrlimit 164 75 (int resource, struct krlimit *rlim)
mount 40 21 (char *dev_nmae, char *dir_name, char *type, unsigned long flags, void *data)
umount2 39 52 (char *name, int flags)
gettid 178 224 (void)
futex 98 240 (uint32_t *uaddr, int op, uint32_t val, struct timespec *utime, uint32_t *uaddr2, uint32_t val3)
set_tid_address 96 256 (int *tid_addr)
restart_syscall 128 0 (void)
timer_create 107 257 (clockid_t which_clock, struct sigevent *timer_event_spec, kernel_timer_t *created_timer_id)
timer_settime 110 258 (kernel_timer_t timer_id, int flags, const struct itimerspec *new_setting, struct itimerspec *old_setting)
timer_gettime 108 259 (int timer_id, const struct itimerspec *setting)
timer_getoverrun 109 260 (int timer_id)
timer_delete 111 261 (kernel_timer_t timer_id)
clock_gettime 113 263 (clockid_t which_clock, struct timespec *tp)
exit_group 94 248 (int error_code)
set_robust_list 99 338 (struct robust_list_head *head, size_t len)
get_robust_list 100 339 (int pid, struct robust_list_head **head_ptr, size_t *len_ptr)
signalfd4 74 355 (int fd, k_rtsigset_t *mask, size_t sizemask, int flags)
rt_tgsigqueueinfo 240 363 (pid_t tgid, pid_t pid, int sig, siginfo_t *info)
vmsplice 75 343 (int fd, const struct iovec *iov, unsigned long nr_segs, unsigned int flags)
timerfd_settime 86 353 (int ufd, int flags, const struct itimerspec *utmr, struct itimerspec *otmr)
fanotify_init 262 367 (unsigned int flags, unsigned int event_f_flags)
fanotify_mark 263 368 (int fanotify_fd, unsigned int flags, uint64_t mask, int dfd, const char *pathname)
open_by_handle_at 265 371 (int mountdirfd, struct file_handle *handle, int flags)
setns 268 375 (int fd, int nstype)
kcmp 272 378 (pid_t pid1, pid_t pid2, int type, unsigned long idx1, unsigned long idx2)
openat 56 322 (int dirfd, const char *pathname, int flags, mode_t mode)
mkdirat 34 323 (int dirfd, const char *pathname, mode_t mode)
unlinkat 35 328 (int dirfd, const char *pathname, int flags)
memfd_create 279 385 (const char *name, unsigned int flags)
io_setup 0 243 (unsigned nr_events, aio_context_t *ctx)
io_submit 2 246 (aio_context_t ctx_id, long nr, struct iocb **iocbpp)
io_getevents 4 245 (aio_context_t ctx, long min_nr, long nr, struct io_event *evs, struct timespec *tmo)
seccomp 277 383 (unsigned int op, unsigned int flags, const char *uargs)
gettimeofday 169 78 (struct timeval *tv, struct timezone *tz)
preadv_raw 69 361 (int fd, struct iovec *iov, unsigned long nr, unsigned long pos_l, unsigned long pos_h)
userfaultfd 282 388 (int flags)
fallocate 47 352 (int fd, int mode, loff_t offset, loff_t len)
cacheflush ! 983042 (void *start, void *end, int flags)
ppoll 73 336 (struct pollfd *fds, unsigned int nfds, const struct timespec *tmo, const sigset_t *sigmask, size_t sigsetsize)
fsopen 430 430 (char *fsname, unsigned int flags)
fsconfig 431 431 (int fd, unsigned int cmd, const char *key, const char *value, int aux)
fsmount 432 432 (int fd, unsigned int flags, unsigned int attr_flags)
clone3 435 435 (struct clone_args *uargs, size_t size)
pidfd_open 434 434 (pid_t pid, unsigned int flags)
pidfd_getfd 438 438 (int pidfd, int targetfd, unsigned int flags)
rseq 293 293 (void *rseq, uint32_t rseq_len, int flags, uint32_t sig)
move_mount 429 429 (int from_dfd, const char *from_pathname, int to_dfd, const char *to_pathname, int flags)
open_tree 428 428 (int dirfd, const char *pathname, unsigned int flags)
openat2 437 437 (int dirfd, char *pathname, struct open_how *how, size_t size)
membarrier 283 283 (int cmd, unsigned int flags, int cpu_id)

View file

@ -0,0 +1,32 @@
OUTPUT_ARCH(riscv)
EXTERN(__export_parasite_head_start)
SECTIONS
{
.crblob 0x0 : {
*(.head.text)
ASSERT(DEFINED(__export_parasite_head_start),
"Symbol __export_parasite_head_start is missing");
*(.text*)
. = ALIGN(32);
*(.data*)
. = ALIGN(32);
*(.rodata*)
. = ALIGN(32);
*(.bss*)
. = ALIGN(32);
*(.got*)
. = ALIGN(32);
*(.toc*)
. = ALIGN(32);
} =0x00000000,
/DISCARD/ : {
*(.debug*)
*(.comment*)
*(.note*)
*(.group*)
*(.eh_frame*)
*(*)
}
}

View file

@ -0,0 +1,78 @@
#include <string.h>
#include <stdbool.h>
#include "compel-cpu.h"
#include "common/bitops.h"
#include "log.h"
#undef LOG_PREFIX
#define LOG_PREFIX "cpu: "
static compel_cpuinfo_t rt_info;
static void fetch_rt_cpuinfo(void)
{
static bool rt_info_done = false;
if (!rt_info_done) {
compel_cpuid(&rt_info);
rt_info_done = true;
}
}
void compel_set_cpu_cap(compel_cpuinfo_t *info, unsigned int feature)
{
}
void compel_clear_cpu_cap(compel_cpuinfo_t *info, unsigned int feature)
{
}
int compel_test_cpu_cap(compel_cpuinfo_t *info, unsigned int feature)
{
return 0;
}
int compel_test_fpu_cap(compel_cpuinfo_t *info, unsigned int feature)
{
return 0;
}
int compel_cpuid(compel_cpuinfo_t *info)
{
return 0;
}
bool compel_cpu_has_feature(unsigned int feature)
{
fetch_rt_cpuinfo();
return compel_test_cpu_cap(&rt_info, feature);
}
bool compel_fpu_has_feature(unsigned int feature)
{
fetch_rt_cpuinfo();
return compel_test_fpu_cap(&rt_info, feature);
}
uint32_t compel_fpu_feature_size(unsigned int feature)
{
fetch_rt_cpuinfo();
return 0;
}
uint32_t compel_fpu_feature_offset(unsigned int feature)
{
fetch_rt_cpuinfo();
return 0;
}
void compel_cpu_clear_feature(unsigned int feature)
{
fetch_rt_cpuinfo();
return compel_clear_cpu_cap(&rt_info, feature);
}
void compel_cpu_copy_cpuinfo(compel_cpuinfo_t *c)
{
fetch_rt_cpuinfo();
memcpy(c, &rt_info, sizeof(rt_info));
}

View file

@ -0,0 +1 @@
handle-elf.c

View file

@ -0,0 +1,32 @@
#include <string.h>
#include <errno.h>
#include "handle-elf.h"
#include "piegen.h"
#include "log.h"
static const unsigned char __maybe_unused elf_ident_64_le[EI_NIDENT] = {
0x7f, 0x45, 0x4c, 0x46, 0x02, 0x01, 0x01, 0x00, /* clang-format */
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
};
static const unsigned char __maybe_unused elf_ident_64_be[EI_NIDENT] = {
0x7f, 0x45, 0x4c, 0x46, 0x02, 0x02, 0x01, 0x00, /* clang-format */
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
};
int handle_binary(void *mem, size_t size)
{
const unsigned char *elf_ident =
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
elf_ident_64_le;
#else
elf_ident_64_be;
#endif
if (memcmp(mem, elf_ident, sizeof(elf_ident_64_le)) == 0)
return handle_elf_riscv64(mem, size);
pr_err("Unsupported Elf format detected\n");
return -EINVAL;
}

View file

@ -0,0 +1,12 @@
#ifndef COMPEL_HANDLE_ELF_H__
#define COMPEL_HANDLE_ELF_H__
#include "elf64-types.h"
#define __handle_elf handle_elf_riscv64
#define ELF_RISCV
#define arch_is_machine_supported(e_machine) (e_machine == EM_RISCV)
extern int handle_elf_riscv64(void *mem, size_t size);
#endif /* COMPEL_HANDLE_ELF_H__ */

View file

@ -0,0 +1,8 @@
#ifndef __COMPEL_SYSCALL_H__
#define __COMPEL_SYSCALL_H__
#define __NR(syscall, compat) \
({ \
(void)compat; \
__NR_##syscall; \
})
#endif

View file

@ -0,0 +1,15 @@
#ifndef __COMPEL_BREAKPOINTS_H__
#define __COMPEL_BREAKPOINTS_H__
#define ARCH_SI_TRAP TRAP_BRKPT
static inline int ptrace_set_breakpoint(pid_t pid, void *addr)
{
return 0;
}
static inline int ptrace_flush_breakpoints(pid_t pid)
{
return 0;
}
#endif

View file

@ -0,0 +1,7 @@
#ifndef UAPI_COMPEL_ASM_CPU_H__
#define UAPI_COMPEL_ASM_CPU_H__
typedef struct {
} compel_cpuinfo_t;
#endif /* UAPI_COMPEL_ASM_CPU_H__ */

View file

@ -0,0 +1,4 @@
#ifndef __CR_ASM_FPU_H__
#define __CR_ASM_FPU_H__
#endif /* __CR_ASM_FPU_H__ */

View file

@ -0,0 +1,52 @@
#ifndef UAPI_COMPEL_ASM_TYPES_H__
#define UAPI_COMPEL_ASM_TYPES_H__
#include <stdint.h>
#include <signal.h>
#include <sys/mman.h>
#include <asm/ptrace.h>
#define SIGMAX 64
#define SIGMAX_OLD 31
/*
* Copied from the Linux kernel header arch/riscv/include/uapi/asm/ptrace.h
*
* A thread RISC-V CPU context
*/
typedef struct user_regs_struct user_regs_struct_t;
typedef struct __riscv_d_ext_state user_fpregs_struct_t;
#define __compel_arch_fetch_thread_area(tid, th) 0
#define compel_arch_fetch_thread_area(tctl) 0
#define compel_arch_get_tls_task(ctl, tls)
#define compel_arch_get_tls_thread(tctl, tls)
#define REG_RES(registers) ((uint64_t)(registers).a0)
#define REG_IP(registers) ((uint64_t)(registers).pc)
#define SET_REG_IP(registers, val) ((registers).pc = (val))
/*
* REG_SP is also defined in riscv64-linux-gnu/include/sys/ucontext.h
* with a different meaning, and it's not used in CRIU. So we have to
* undefine it here.
*/
#ifdef REG_SP
#undef REG_SP
#endif
#define REG_SP(registers) ((uint64_t)((registers).sp))
#define REG_SYSCALL_NR(registers) ((uint64_t)(registers).a7)
#define user_regs_native(pregs) true
#define ARCH_SI_TRAP TRAP_BRKPT
#define __NR(syscall, compat) \
({ \
(void)compat; \
__NR_##syscall; \
})
#endif /* UAPI_COMPEL_ASM_TYPES_H__ */

View file

@ -0,0 +1,26 @@
#ifndef COMPEL_RELOCATIONS_H__
#define COMPEL_RELOCATIONS_H__
#include <stdint.h>
static inline uint32_t riscv_b_imm(uint32_t val)
{
return (val & 0x00001000) << 19 | (val & 0x000007e0) << 20 | (val & 0x0000001e) << 7 | (val & 0x00000800) >> 4;
}
static inline uint32_t riscv_i_imm(uint32_t val)
{
return val << 20;
}
static inline uint32_t riscv_u_imm(uint32_t val)
{
return val & 0xfffff000;
}
static inline uint32_t riscv_j_imm(uint32_t val)
{
return (val & 0x00100000) << 11 | (val & 0x000007fe) << 20 | (val & 0x00000800) << 9 | (val & 0x000ff000);
}
#endif /* COMPEL_RELOCATIONS_H__ */

View file

@ -0,0 +1,4 @@
#ifndef UAPI_COMPEL_ASM_PROCESSOR_FLAGS_H__
#define UAPI_COMPEL_ASM_PROCESSOR_FLAGS_H__
#endif /* UAPI_COMPEL_ASM_PROCESSOR_FLAGS_H__ */

View file

@ -0,0 +1,68 @@
#ifndef UAPI_COMPEL_ASM_SIGFRAME_H__
#define UAPI_COMPEL_ASM_SIGFRAME_H__
#include <sys/ucontext.h>
#include <stdint.h>
#include <signal.h>
/* Copied from the kernel header arch/riscv/include/uapi/asm/sigcontext.h */
/*
* Signal context structure
*
* This contains the context saved before a signal handler is invoked;
* it is restored by sys_sigreturn / sys_rt_sigreturn.
*/
// struct sigcontext {
// struct user_regs_struct sc_regs;
// union __riscv_fp_state sc_fpregs;
// /*
// * 4K + 128 reserved for vector state and future expansion.
// * This space is enough to store the vector context whose VLENB
// * is less or equal to 128.
// * (The size of the vector context is 4144 byte as VLENB is 128)
// */
// __u8 __reserved[4224] __attribute__((__aligned__(16)));
// };
#define rt_sigcontext sigcontext
#include <compel/sigframe-common.h>
/* Copied from the kernel source arch/riscv/kernel/signal.c */
struct rt_sigframe {
siginfo_t info;
ucontext_t uc; //ucontext_t structure holds the user context, e.g., the signal mask, GP regs
};
/*
generates inline assembly code for triggering the rt_sigreturn system call.
used to return from a signal handler back to the normal execution flow of the process.
*/
/* clang-format off */
#define ARCH_RT_SIGRETURN(new_sp, rt_sigframe) \
asm volatile( \
"mv sp, %0\n" \
"li a7, "__stringify(__NR_rt_sigreturn)" \n" \
"ecall\n" \
: \
: "r"(new_sp) \
: "a7", "memory")
/* clang-format on */
#define RT_SIGFRAME_UC(rt_sigframe) (&rt_sigframe->uc)
#define RT_SIGFRAME_REGIP(rt_sigframe) ((long unsigned int)(rt_sigframe)->uc.uc_mcontext.__gregs[REG_PC])
#define RT_SIGFRAME_HAS_FPU(rt_sigframe) 1
#define RT_SIGFRAME_OFFSET(rt_sigframe) 0
// #define RT_SIGFRAME_SIGCONTEXT(rt_sigframe) ((struct cr_sigcontext *)&(rt_sigframe)->uc.uc_mcontext)
// #define RT_SIGFRAME_AUX_CONTEXT(rt_sigframe) ((struct sigcontext *)&(RT_SIGFRAME_SIGCONTEXT(rt_sigframe)->__reserved))
// #define RT_SIGFRAME_FPU(rt_sigframe) (&RT_SIGFRAME_AUX_CONTEXT(rt_sigframe)->fpsimd)
#define rt_sigframe_erase_sigset(sigframe) \
memset(&sigframe->uc.uc_sigmask, 0, sizeof(k_rtsigset_t)) // erase the signal mask
#define rt_sigframe_copy_sigset(sigframe, from) \
memcpy(&sigframe->uc.uc_sigmask, from, sizeof(k_rtsigset_t)) // copy the signal mask
#endif /* UAPI_COMPEL_ASM_SIGFRAME_H__ */

View file

@ -0,0 +1,224 @@
#include <stdlib.h>
#include <sys/ptrace.h>
#include <sys/types.h>
#include <sys/uio.h>
#include <linux/elf.h>
#include <compel/plugins/std/syscall-codes.h>
#include "common/page.h"
#include "uapi/compel/asm/infect-types.h"
#include "log.h"
#include "errno.h"
#include "infect.h"
#include "infect-priv.h"
unsigned __page_size = 0;
unsigned __page_shift = 0;
/*
* Injected syscall instruction
*/
const char code_syscall[] = {
0x73, 0x00, 0x00, 0x00, /* ecall */
0x73, 0x00, 0x10, 0x00 /* ebreak */
};
static const int code_syscall_aligned = round_up(sizeof(code_syscall), sizeof(long));
static inline void __always_unused __check_code_syscall(void)
{
BUILD_BUG_ON(code_syscall_aligned != BUILTIN_SYSCALL_SIZE);
BUILD_BUG_ON(!is_log2(sizeof(code_syscall)));
}
int sigreturn_prep_regs_plain(struct rt_sigframe *sigframe, user_regs_struct_t *regs, user_fpregs_struct_t *fpregs)
{
sigframe->uc.uc_mcontext.__gregs[0] = regs->pc;
sigframe->uc.uc_mcontext.__gregs[1] = regs->ra;
sigframe->uc.uc_mcontext.__gregs[2] = regs->sp;
sigframe->uc.uc_mcontext.__gregs[3] = regs->gp;
sigframe->uc.uc_mcontext.__gregs[4] = regs->tp;
sigframe->uc.uc_mcontext.__gregs[5] = regs->t0;
sigframe->uc.uc_mcontext.__gregs[6] = regs->t1;
sigframe->uc.uc_mcontext.__gregs[7] = regs->t2;
sigframe->uc.uc_mcontext.__gregs[8] = regs->s0;
sigframe->uc.uc_mcontext.__gregs[9] = regs->s1;
sigframe->uc.uc_mcontext.__gregs[10] = regs->a0;
sigframe->uc.uc_mcontext.__gregs[11] = regs->a1;
sigframe->uc.uc_mcontext.__gregs[12] = regs->a2;
sigframe->uc.uc_mcontext.__gregs[13] = regs->a3;
sigframe->uc.uc_mcontext.__gregs[14] = regs->a4;
sigframe->uc.uc_mcontext.__gregs[15] = regs->a5;
sigframe->uc.uc_mcontext.__gregs[16] = regs->a6;
sigframe->uc.uc_mcontext.__gregs[17] = regs->a7;
sigframe->uc.uc_mcontext.__gregs[18] = regs->s2;
sigframe->uc.uc_mcontext.__gregs[19] = regs->s3;
sigframe->uc.uc_mcontext.__gregs[20] = regs->s4;
sigframe->uc.uc_mcontext.__gregs[21] = regs->s5;
sigframe->uc.uc_mcontext.__gregs[22] = regs->s6;
sigframe->uc.uc_mcontext.__gregs[23] = regs->s7;
sigframe->uc.uc_mcontext.__gregs[24] = regs->s8;
sigframe->uc.uc_mcontext.__gregs[25] = regs->s9;
sigframe->uc.uc_mcontext.__gregs[26] = regs->s10;
sigframe->uc.uc_mcontext.__gregs[27] = regs->s11;
sigframe->uc.uc_mcontext.__gregs[28] = regs->t3;
sigframe->uc.uc_mcontext.__gregs[29] = regs->t4;
sigframe->uc.uc_mcontext.__gregs[30] = regs->t5;
sigframe->uc.uc_mcontext.__gregs[31] = regs->t6;
memcpy(sigframe->uc.uc_mcontext.__fpregs.__d.__f, fpregs->f, sizeof(fpregs->f));
sigframe->uc.uc_mcontext.__fpregs.__d.__fcsr = fpregs->fcsr;
return 0;
}
int sigreturn_prep_fpu_frame_plain(struct rt_sigframe *sigframe, struct rt_sigframe *rsigframe)
{
return 0;
}
int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct_t *ext_regs, save_regs_t save,
void *arg, __maybe_unused unsigned long flags)
{
user_fpregs_struct_t tmp, *fpsimd = ext_regs ? ext_regs : &tmp;
struct iovec iov;
int ret = -1;
pr_info("Dumping FPU registers for %d\n", pid);
iov.iov_base = fpsimd;
iov.iov_len = sizeof(*fpsimd);
if ((ret = ptrace(PTRACE_GETREGSET, pid, NT_PRFPREG, &iov))) {
pr_perror("Failed to obtain FPU registers for %d", pid);
return -1;
}
ret = save(pid, arg, regs, fpsimd);
return ret;
}
int compel_set_task_ext_regs(pid_t pid, user_fpregs_struct_t *ext_regs)
{
struct iovec iov;
pr_info("Restoring GP/FPU registers for %d\n", pid);
iov.iov_base = ext_regs;
iov.iov_len = sizeof(*ext_regs);
if (ptrace(PTRACE_SETREGSET, pid, NT_PRFPREG, &iov)) {
pr_perror("Failed to set FPU registers for %d", pid);
return -1;
}
return 0;
}
int compel_syscall(struct parasite_ctl *ctl, int nr, long *ret, unsigned long arg1, unsigned long arg2,
unsigned long arg3, unsigned long arg4, unsigned long arg5, unsigned long arg6)
{
user_regs_struct_t regs = ctl->orig.regs;
int err;
regs.a7 = (unsigned long)nr;
regs.a0 = arg1;
regs.a1 = arg2;
regs.a2 = arg3;
regs.a3 = arg4;
regs.a4 = arg5;
regs.a5 = arg6;
regs.a6 = 0;
err = compel_execute_syscall(ctl, &regs, code_syscall);
*ret = regs.a0;
return err;
}
/*
* Calling the mmap system call in the context of the target (victim) process using the compel_syscall function.
* Used during the infection process to allocate memory for the parasite code.
*/
void *remote_mmap(struct parasite_ctl *ctl, void *addr, size_t length, int prot, int flags, int fd, off_t offset)
{
long map;
int err;
err = compel_syscall(ctl, __NR_mmap, &map, (unsigned long)addr, length, prot, flags, fd, offset);
if (err < 0 || (long)map < 0)
map = 0;
return (void *)map;
}
void parasite_setup_regs(unsigned long new_ip, void *stack, user_regs_struct_t *regs)
{
regs->pc = new_ip;
if (stack)
regs->sp = (unsigned long)stack;
}
bool arch_can_dump_task(struct parasite_ctl *ctl)
{
/*
* TODO: Add proper check here.
*/
return true;
}
/*
* Fetch the signal alternate stack (sigaltstack),
* sas is a separate memory area for the signal handler to run on,
* avoiding potential issues with the main process stack
*/
int arch_fetch_sas(struct parasite_ctl *ctl, struct rt_sigframe *s)
{
long ret;
int err;
err = compel_syscall(ctl, __NR_sigaltstack, &ret, 0, (unsigned long)&s->uc.uc_stack, 0, 0, 0, 0);
return err ? err : ret;
}
/*
* Task size is the maximum virtual address space size that a process can occupy in the memory
* Refer to linux kernel arch/riscv/include/asm/pgtable.h,
* task size is:
* - 0x9fc00000 (~2.5GB) for RV32.
* - 0x4000000000 ( 256GB) for RV64 using SV39 mmu
* - 0x800000000000 ( 128TB) for RV64 using SV48 mmu
* - 0x100000000000000 ( 64PB) for RV64 using SV57 mmu
*/
#define TASK_SIZE_MIN (1UL << 38)
#define TASK_SIZE_MAX (1UL << 56)
unsigned long compel_task_size(void)
{
unsigned long task_size;
for (task_size = TASK_SIZE_MIN; task_size < TASK_SIZE_MAX; task_size <<= 1)
if (munmap((void *)task_size, page_size()))
break;
return task_size;
}
/*
* Get task registers (overwrites weak function)
*/
int ptrace_get_regs(int pid, user_regs_struct_t *regs)
{
struct iovec iov;
iov.iov_base = regs;
iov.iov_len = sizeof(user_regs_struct_t);
return ptrace(PTRACE_GETREGSET, pid, NT_PRSTATUS, &iov);
}
/*
* Set task registers (overwrites weak function)
*/
int ptrace_set_regs(int pid, user_regs_struct_t *regs)
{
struct iovec iov;
iov.iov_base = regs;
iov.iov_len = sizeof(user_regs_struct_t);
return ptrace(PTRACE_SETREGSET, pid, NT_PRSTATUS, &iov);
}

View file

@ -82,7 +82,7 @@ __NR_sys_timer_settime 255 sys_timer_settime (kernel_timer_t timer_id, int flag
__NR_sys_timer_gettime 256 sys_timer_gettime (int timer_id, const struct itimerspec *setting)
__NR_sys_timer_getoverrun 257 sys_timer_getoverrun (int timer_id)
__NR_sys_timer_delete 258 sys_timer_delete (kernel_timer_t timer_id)
__NR_clock_gettime 260 sys_clock_gettime (const clockid_t which_clock, const struct timespec *tp)
__NR_clock_gettime 260 sys_clock_gettime (clockid_t which_clock, struct timespec *tp)
__NR_exit_group 248 sys_exit_group (int error_code)
__NR_waitid 281 sys_waitid (int which, pid_t pid, struct siginfo *infop, int options, struct rusage *ru)
__NR_set_robust_list 304 sys_set_robust_list (struct robust_list_head *head, size_t len)
@ -114,6 +114,7 @@ __NR_fsopen 430 sys_fsopen (char *fsname, unsigned int flags)
__NR_fsconfig 431 sys_fsconfig (int fd, unsigned int cmd, const char *key, const char *value, int aux)
__NR_fsmount 432 sys_fsmount (int fd, unsigned int flags, unsigned int attr_flags)
__NR_clone3 435 sys_clone3 (struct clone_args *uargs, size_t size)
__NR_close_range 436 sys_close_range (unsigned int fd, unsigned int max_fd, unsigned int flags)
__NR_pidfd_open 434 sys_pidfd_open (pid_t pid, unsigned int flags)
__NR_openat2 437 sys_openat2 (int dirfd, char *pathname, struct open_how *how, size_t size)
__NR_pidfd_getfd 438 sys_pidfd_getfd (int pidfd, int targetfd, unsigned int flags)

View file

@ -293,10 +293,9 @@ static int s390_disable_ri_bit(pid_t pid, user_regs_struct_t *regs)
/*
* Prepare task registers for restart
*/
int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct_t *ext_regs, save_regs_t save,
int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct_t *fpregs, save_regs_t save,
void *arg, __maybe_unused unsigned long flags)
{
user_fpregs_struct_t tmp, *fpregs = ext_regs ? ext_regs : &tmp;
struct iovec iov;
int rewind;
@ -349,7 +348,7 @@ int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct
}
}
/* Call save_task_regs() */
return save(arg, regs, fpregs);
return save(pid, arg, regs, fpregs);
}
int compel_set_task_ext_regs(pid_t pid, user_fpregs_struct_t *ext_regs)

View file

@ -34,7 +34,21 @@ END(__export_parasite_head_start_compat)
.code64
#endif
/*
* When parasite_service() runs in the daemon mode it will return the stack
* pointer for the sigreturn frame in %rax and we call sigreturn directly
* from here.
* Since a valid stack pointer is positive, it is safe to presume that
* return value <= 0 means that parasite_service() called parasite_trap_cmd()
* in non-daemon mode, and the parasite should stop at int3.
*/
ENTRY(__export_parasite_head_start)
call parasite_service
cmp $0, %rax
jle 1f
movq %rax, %rsp
movq $15, %rax
syscall
1:
int $0x03
END(__export_parasite_head_start)

View file

@ -102,6 +102,7 @@ __NR_fsopen 430 sys_fsopen (char *fsname, unsigned int flags)
__NR_fsconfig 431 sys_fsconfig (int fd, unsigned int cmd, const char *key, const char *value, int aux)
__NR_fsmount 432 sys_fsmount (int fd, unsigned int flags, unsigned int attr_flags)
__NR_clone3 435 sys_clone3 (struct clone_args *uargs, size_t size)
__NR_close_range 436 sys_close_range (unsigned int fd, unsigned int max_fd, unsigned int flags)
__NR_pidfd_open 434 sys_pidfd_open (pid_t pid, unsigned int flags)
__NR_openat2 437 sys_openat2 (int dirfd, char *pathname, struct open_how *how, size_t size)
__NR_pidfd_getfd 438 sys_pidfd_getfd (int pidfd, int targetfd, unsigned int flags)

View file

@ -85,7 +85,7 @@ __NR_sys_timer_settime 223 sys_timer_settime (kernel_timer_t timer_id, int fla
__NR_sys_timer_gettime 224 sys_timer_gettime (int timer_id, const struct itimerspec *setting)
__NR_sys_timer_getoverrun 225 sys_timer_getoverrun (int timer_id)
__NR_sys_timer_delete 226 sys_timer_delete (kernel_timer_t timer_id)
__NR_clock_gettime 228 sys_clock_gettime (const clockid_t which_clock, const struct timespec *tp)
__NR_clock_gettime 228 sys_clock_gettime (clockid_t which_clock, struct timespec *tp)
__NR_exit_group 231 sys_exit_group (int error_code)
__NR_openat 257 sys_openat (int dfd, const char *filename, int flags, int mode)
__NR_waitid 247 sys_waitid (int which, pid_t pid, struct siginfo *infop, int options, struct rusage *ru)
@ -113,8 +113,10 @@ __NR_fsopen 430 sys_fsopen (char *fsname, unsigned int flags)
__NR_fsconfig 431 sys_fsconfig (int fd, unsigned int cmd, const char *key, const char *value, int aux)
__NR_fsmount 432 sys_fsmount (int fd, unsigned int flags, unsigned int attr_flags)
__NR_clone3 435 sys_clone3 (struct clone_args *uargs, size_t size)
__NR_close_range 436 sys_close_range (unsigned int fd, unsigned int max_fd, unsigned int flags)
__NR_pidfd_open 434 sys_pidfd_open (pid_t pid, unsigned int flags)
__NR_openat2 437 sys_openat2 (int dirfd, char *pathname, struct open_how *how, size_t size)
__NR_pidfd_getfd 438 sys_pidfd_getfd (int pidfd, int targetfd, unsigned int flags)
__NR_rseq 334 sys_rseq (void *rseq, uint32_t rseq_len, int flags, uint32_t sig)
__NR_membarrier 324 sys_membarrier (int cmd, unsigned int flags, int cpu_id)
__NR_map_shadow_stack 453 sys_map_shadow_stack (unsigned long addr, unsigned long size, unsigned int flags)

View file

@ -244,6 +244,7 @@ enum cpuid_leafs {
#define X86_FEATURE_PKU (11 * 32 + 3) /* Protection Keys for Userspace */
#define X86_FEATURE_OSPKE (11 * 32 + 4) /* OS Protection Keys Enable */
#define X86_FEATURE_AVX512_VBMI2 (11 * 32 + 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
#define X86_FEATURE_SHSTK (11 * 32 + 7) /* Shadow Stack */
#define X86_FEATURE_GFNI (11 * 32 + 8) /* Galois Field New Instructions */
#define X86_FEATURE_VAES (11 * 32 + 9) /* Vector AES */
#define X86_FEATURE_VPCLMULQDQ (11 * 32 + 10) /* Carry-Less Multiplication Double Quadword */

View file

@ -245,6 +245,14 @@ struct pkru_state {
uint32_t pad;
} __packed;
/*
* State component 11 is Control-flow Enforcement user states
*/
struct cet_user_state {
uint64_t cet; /* user control-flow settings */
uint64_t ssp; /* user shadow stack pointer */
};
/*
* This is our most modern FPU state format, as saved by the XSAVE
* and restored by the XRSTOR instructions.
@ -260,7 +268,7 @@ struct pkru_state {
* Of course it was not ;-) Now using four pages...
*
*/
#define EXTENDED_STATE_AREA_SIZE (XSAVE_SIZE - sizeof(struct i387_fxsave_struct) - sizeof(struct xsave_hdr_struct))
#define EXTENDED_STATE_AREA_SIZE (XSAVE_SIZE - sizeof(struct i387_fxsave_struct) - sizeof(struct xsave_hdr_struct) - sizeof(struct cet_user_state))
/*
* cpu requires it to be 64 byte aligned
@ -276,6 +284,7 @@ struct xsave_struct {
struct ymmh_struct ymmh;
uint8_t extended_state_area[EXTENDED_STATE_AREA_SIZE];
};
struct cet_user_state cet;
} __aligned(FP_MIN_ALIGN_BYTES) __packed;
struct xsave_struct_ia32 {

View file

@ -143,4 +143,11 @@ typedef struct xsave_struct user_fpregs_struct_t;
*/
#define __NR32_mmap __NR32_mmap2
extern bool __compel_shstk_enabled(user_fpregs_struct_t *ext_regs);
#define compel_shstk_enabled __compel_shstk_enabled
extern int __parasite_setup_shstk(struct parasite_ctl *ctl,
user_fpregs_struct_t *ext_regs);
#define parasite_setup_shstk __parasite_setup_shstk
#endif /* UAPI_COMPEL_ASM_TYPES_H__ */

View file

@ -177,6 +177,24 @@ static inline void rt_sigframe_erase_sigset(struct rt_sigframe *sigframe)
#define USER32_CS 0x23
/* clang-format off */
/*
* rst_sigreturn in resorer is noninline call which adds an entry to the
* shadow stack above the sigframe token;
* if shadow stack is enabled, increment the shadow stack pointer to remove
* that entry
*/
#define ARCH_SHSTK_POP() \
asm volatile( \
"xor %%rax, %%rax\n" \
"rdsspq %%rax\n" \
"cmpq $0, %%rax\n" \
"jz 1f\n" \
"movq $1, %%rax\n" \
"incsspq %%rax\n" \
"1:\n" \
: : \
: "rax")
#define ARCH_RT_SIGRETURN_NATIVE(new_sp) \
asm volatile( \
"movq %0, %%rax \n" \
@ -203,10 +221,19 @@ static inline void rt_sigframe_erase_sigset(struct rt_sigframe *sigframe)
: "rdi"(new_sp) \
: "eax", "r8", "r9", "r10", "r11", "memory")
#define ARCH_RT_SIGRETURN(new_sp, rt_sigframe) \
#define ARCH_RT_SIGRETURN_RST(new_sp, rt_sigframe) \
do { \
if ((rt_sigframe)->is_native) { \
ARCH_SHSTK_POP(); \
ARCH_RT_SIGRETURN_NATIVE(new_sp); \
} else \
ARCH_RT_SIGRETURN_COMPAT(new_sp); \
} while (0)
#define ARCH_RT_SIGRETURN_DUMP(new_sp, rt_sigframe) \
do { \
if ((rt_sigframe)->is_native) \
ARCH_RT_SIGRETURN_NATIVE(new_sp); \
return new_sp; \
else \
ARCH_RT_SIGRETURN_COMPAT(new_sp); \
} while (0)

View file

@ -26,6 +26,16 @@
#ifndef NT_X86_XSTATE
#define NT_X86_XSTATE 0x202 /* x86 extended state using xsave */
#endif
#ifndef NT_X86_SHSTK
#define NT_X86_SHSTK 0x204 /* x86 shstk state */
#endif
#ifndef ARCH_SHSTK_STATUS
#define ARCH_SHSTK_STATUS 0x5005
#define ARCH_SHSTK_SHSTK (1ULL << 0)
#endif
#ifndef NT_PRSTATUS
#define NT_PRSTATUS 1 /* Contains copy of prstatus struct */
#endif
@ -250,7 +260,49 @@ static int get_task_xsave(pid_t pid, user_fpregs_struct_t *xsave)
// [1] Intel® 64 and IA-32 Architectures Software Developer's
// Manual Volume 1: Basic Architecture
// Section 13.6: Processor tracking of XSAVE-managed state
return get_task_fpregs(pid, xsave);
if (get_task_fpregs(pid, xsave))
return -1;
}
/*
* xsave may be on stack, if we don't clear it explicitly we get
* funky shadow stack state
*/
memset(&xsave->cet, 0, sizeof(xsave->cet));
if (compel_cpu_has_feature(X86_FEATURE_SHSTK)) {
unsigned long ssp = 0;
unsigned long features = 0;
if (ptrace(PTRACE_ARCH_PRCTL, pid, (unsigned long)&features, ARCH_SHSTK_STATUS)) {
/*
* kernels that don't support shadow stack return
* -EINVAL
*/
if (errno == EINVAL)
return 0;
pr_perror("shstk: can't get shadow stack status for %d", pid);
return -1;
}
if (!(features & ARCH_SHSTK_SHSTK))
return 0;
iov.iov_base = &ssp;
iov.iov_len = sizeof(ssp);
if (ptrace(PTRACE_GETREGSET, pid, (unsigned int)NT_X86_SHSTK, &iov) < 0) {
/* ENODEV means CET is not supported by the CPU */
if (errno != ENODEV) {
pr_perror("shstk: can't get SSP for %d", pid);
return -1;
}
}
xsave->cet.cet = features;
xsave->cet.ssp = ssp;
pr_debug("%d: shstk: cet: %lx ssp: %lx\n", pid, xsave->cet.cet, xsave->cet.ssp);
}
return 0;
@ -345,10 +397,9 @@ static int corrupt_extregs(pid_t pid)
return 0;
}
int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct_t *ext_regs, save_regs_t save,
int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct_t *xs, save_regs_t save,
void *arg, unsigned long flags)
{
user_fpregs_struct_t xsave = {}, *xs = ext_regs ? ext_regs : &xsave;
int ret = -1;
pr_info("Dumping general registers for %d in %s mode\n", pid, user_regs_native(regs) ? "native" : "compat");
@ -402,7 +453,7 @@ int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct
goto err;
out:
ret = save(arg, regs, xs);
ret = save(pid, arg, regs, xs);
err:
return ret;
}
@ -698,3 +749,59 @@ unsigned long compel_task_size(void)
{
return TASK_SIZE;
}
bool __compel_shstk_enabled(user_fpregs_struct_t *ext_regs)
{
if (!compel_cpu_has_feature(X86_FEATURE_SHSTK))
return false;
if (ext_regs->cet.cet & ARCH_SHSTK_SHSTK)
return true;
return false;
}
int parasite_setup_shstk(struct parasite_ctl *ctl, __maybe_unused user_fpregs_struct_t *ext_regs)
{
pid_t pid = ctl->rpid;
unsigned long sa_restorer = ctl->parasite_ip;
unsigned long long ssp;
unsigned long token;
struct iovec iov;
if (!compel_shstk_enabled(ext_regs))
return 0;
iov.iov_base = &ssp;
iov.iov_len = sizeof(ssp);
if (ptrace(PTRACE_GETREGSET, pid, (unsigned int)NT_X86_SHSTK, &iov) < 0) {
/* ENODEV means CET is not supported by the CPU */
if (errno != ENODEV) {
pr_perror("shstk: %d: cannot get SSP", pid);
return -1;
}
}
/* The token is for 64-bit */
token = ALIGN_DOWN(ssp, 8);
token |= (1UL << 63);
ssp = ALIGN_DOWN(ssp, 8) - 8;
if (ptrace(PTRACE_POKEDATA, pid, (void *)ssp, token)) {
pr_perror("shstk: %d: failed to inject shadow stack token", pid);
return -1;
}
ssp = ssp - sizeof(uint64_t);
if (ptrace(PTRACE_POKEDATA, pid, (void *)ssp, sa_restorer)) {
pr_perror("shstk: %d: failed to inject restorer address", pid);
return -1;
}
ssp = ssp + sizeof(uint64_t);
if (ptrace(PTRACE_SETREGSET, pid, (unsigned int)NT_X86_SHSTK, &iov) < 0) {
pr_perror("shstk: %d: cannot write SSP", pid);
return -1;
}
return 0;
}

View file

@ -72,6 +72,7 @@ extern bool arch_can_dump_task(struct parasite_ctl *ctl);
extern int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct_t *ext_regs, save_regs_t save,
void *arg, unsigned long flags);
extern int compel_set_task_ext_regs(pid_t pid, user_fpregs_struct_t *ext_regs);
extern int compel_set_task_gcs_regs(pid_t pid, user_fpregs_struct_t *ext_regs);
extern int arch_fetch_sas(struct parasite_ctl *ctl, struct rt_sigframe *s);
extern int sigreturn_prep_regs_plain(struct rt_sigframe *sigframe, user_regs_struct_t *regs,
user_fpregs_struct_t *fpregs);

View file

@ -3,11 +3,20 @@
#include "common/compiler.h"
/**
* The length of the hash is based on what libuuid provides.
* According to the manpage this is:
*
* The uuid_unparse() function converts the supplied UUID uu from the binary
* representation into a 36-byte string (plus trailing '\0')
*/
#define RUN_ID_HASH_LENGTH 37
/*
* compel_run_id is a unique value of the current run. It can be used to
* generate resource ID-s to avoid conflicts with other processes.
*/
extern uint64_t compel_run_id;
extern char compel_run_id[RUN_ID_HASH_LENGTH];
struct parasite_ctl;
extern int __must_check compel_util_send_fd(struct parasite_ctl *ctl, int fd);

View file

@ -13,6 +13,15 @@
#define PARASITE_START_AREA_MIN (4096)
#define PARASITE_STACK_SIZE (16 << 10)
/*
* A stack redzone is a small, protected region of memory located immediately
* after a parasite stack. It is intended to remain unchanged. While it can be
* implemented as a guard page, we want to avoid the overhead of additional
* remote system calls.
*/
#define PARASITE_STACK_REDZONE 128
extern int __must_check compel_interrupt_task(int pid);
struct seize_task_status {
@ -97,7 +106,7 @@ extern k_rtsigset_t *compel_thread_sigmask(struct parasite_thread_ctl *tctl);
struct rt_sigframe;
typedef int (*open_proc_fn)(int pid, int mode, const char *fmt, ...) __attribute__((__format__(__printf__, 3, 4)));
typedef int (*save_regs_t)(void *, user_regs_struct_t *, user_fpregs_struct_t *);
typedef int (*save_regs_t)(pid_t pid, void *, user_regs_struct_t *, user_fpregs_struct_t *);
typedef int (*make_sigframe_t)(void *, struct rt_sigframe *, struct rt_sigframe *, k_rtsigset_t *);
struct infect_ctx {
@ -120,6 +129,7 @@ struct infect_ctx {
open_proc_fn open_proc;
int log_fd; /* fd for parasite code to send messages to */
unsigned long remote_map_addr; /* User-specified address where to mmap parasitic code, default not set */
};
extern struct infect_ctx *compel_infect_ctx(struct parasite_ctl *);
@ -182,4 +192,29 @@ void compel_set_thread_ip(struct parasite_thread_ctl *tctl, uint64_t v);
extern void compel_get_stack(struct parasite_ctl *ctl, void **rstack, void **r_thread_stack);
#ifndef compel_host_supports_gcs
static inline bool compel_host_supports_gcs(void)
{
return false;
}
#define compel_host_supports_gcs
#endif
#ifndef compel_shstk_enabled
static inline bool compel_shstk_enabled(user_fpregs_struct_t *ext_regs)
{
return false;
}
#define compel_shstk_enabled
#endif
#ifndef parasite_setup_shstk
static inline int parasite_setup_shstk(struct parasite_ctl *ctl,
user_fpregs_struct_t *ext_regs)
{
return 0;
}
#define parasite_setup_shstk parasite_setup_shstk
#endif
#endif

View file

@ -86,6 +86,19 @@ struct __ptrace_rseq_configuration {
#define PTRACE_EVENT_STOP 128
#endif
/*
* Amazon Linux 2 uses glibc 2.26. PTRACE_ARCH_PRCTL was added in glibc 2.27.
* This allows CRIU to build on Amazon Linux 2.
*
* Note that in sys/ptrace.h, PTRACE_ARCH_PRCTL is an enum value so the
* preprocessor doesn't know about it. PT_ARCH_PRCTL is the preprocessor symbol
* that matches the value of PTRACE_ARCH_PRCTL. So look for PT_ARCH_PRCTL to
* decide if PTRACE_ARCH_PRCTL is available or not.
*/
#if defined(__x86_64__) && !defined(PT_ARCH_PRCTL)
#define PTRACE_ARCH_PRCTL 30 /* From asm/ptrace-abi.h. */
#endif
extern int ptrace_suspend_seccomp(pid_t pid);
extern int __must_check ptrace_peek_area(pid_t pid, void *dst, void *addr, long bytes);

View file

@ -7,7 +7,7 @@ extern int parasite_get_rpc_sock(void);
extern unsigned int __export_parasite_service_cmd;
extern void *__export_parasite_service_args_ptr;
extern int __must_check parasite_service(void);
extern unsigned long __must_check parasite_service(void);
/*
* Must be supplied by user plugins.

View file

@ -16,6 +16,10 @@
#include "rpc-pie-priv.h"
#ifndef ARCH_RT_SIGRETURN_DUMP
#define ARCH_RT_SIGRETURN_DUMP ARCH_RT_SIGRETURN
#endif
static int tsock = -1;
static struct rt_sigframe *sigframe;
@ -79,12 +83,13 @@ static int __parasite_daemon_wait_msg(struct ctl_msg *m)
/* Core infect code */
static noinline void fini_sigreturn(unsigned long new_sp)
static noinline unsigned long fini_sigreturn(unsigned long new_sp)
{
ARCH_RT_SIGRETURN(new_sp, sigframe);
ARCH_RT_SIGRETURN_DUMP(new_sp, sigframe);
return new_sp;
}
static int fini(void)
static unsigned long fini(void)
{
unsigned long new_sp;
@ -96,14 +101,14 @@ static int fini(void)
sys_close(tsock);
std_log_set_fd(-1);
fini_sigreturn(new_sp);
return fini_sigreturn(new_sp);
BUG();
return -1;
}
static noinline __used int noinline parasite_daemon(void *args)
static noinline __used unsigned long parasite_daemon(void *args)
{
struct ctl_msg m;
int ret = -1;
@ -140,12 +145,10 @@ static noinline __used int noinline parasite_daemon(void *args)
}
out:
fini();
return 0;
return fini();
}
static noinline __used int parasite_init_daemon(void *data)
static noinline __used unsigned long parasite_init_daemon(void *data)
{
struct parasite_init_args *args = data;
int ret;
@ -178,14 +181,11 @@ static noinline __used int parasite_init_daemon(void *data)
} else
goto err;
parasite_daemon(data);
return parasite_daemon(data);
err:
futex_set_and_wake(&args->daemon_connected, ret);
fini();
BUG();
return -1;
return fini();
}
#ifndef __parasite_entry
@ -203,7 +203,7 @@ err:
unsigned int __export_parasite_service_cmd = 0;
void *__export_parasite_service_args_ptr = NULL;
int __used __parasite_entry parasite_service(void)
unsigned long __used __parasite_entry parasite_service(void)
{
unsigned int cmd = __export_parasite_service_cmd;
void *args = __export_parasite_service_args_ptr;

View file

@ -7,7 +7,7 @@
#include "infect-rpc.h"
#include "infect-util.h"
uint64_t compel_run_id;
char compel_run_id[RUN_ID_HASH_LENGTH];
int compel_util_send_fd(struct parasite_ctl *ctl, int fd)
{

View file

@ -38,8 +38,6 @@
#define UNIX_PATH_MAX (sizeof(struct sockaddr_un) - (size_t)((struct sockaddr_un *)0)->sun_path)
#endif
#define PARASITE_STACK_SIZE (16 << 10)
#ifndef SECCOMP_MODE_DISABLED
#define SECCOMP_MODE_DISABLED 0
#endif
@ -427,7 +425,7 @@ static int gen_parasite_saddr(struct sockaddr_un *saddr, int key)
int sun_len;
saddr->sun_family = AF_UNIX;
snprintf(saddr->sun_path, UNIX_PATH_MAX, "X/crtools-pr-%d-%" PRIx64, key, compel_run_id);
snprintf(saddr->sun_path, UNIX_PATH_MAX, "X/crtools-pr-%d-%s", key, compel_run_id);
sun_len = SUN_LEN(saddr);
*saddr->sun_path = '\0';
@ -739,6 +737,7 @@ static int parasite_start_daemon(struct parasite_ctl *ctl)
{
pid_t pid = ctl->rpid;
struct infect_ctx *ictx = &ctl->ictx;
user_fpregs_struct_t ext_regs;
/*
* Get task registers before going daemon, since the
@ -746,7 +745,7 @@ static int parasite_start_daemon(struct parasite_ctl *ctl)
* while in daemon it is not such.
*/
if (compel_get_task_regs(pid, &ctl->orig.regs, NULL, ictx->save_regs, ictx->regs_arg, ictx->flags)) {
if (compel_get_task_regs(pid, &ctl->orig.regs, &ext_regs, ictx->save_regs, ictx->regs_arg, ictx->flags)) {
pr_err("Can't obtain regs for thread %d\n", pid);
return -1;
}
@ -759,6 +758,9 @@ static int parasite_start_daemon(struct parasite_ctl *ctl)
if (ictx->make_sigframe(ictx->regs_arg, ctl->sigframe, ctl->rsigframe, &ctl->orig.sigmask))
return -1;
if (parasite_setup_shstk(ctl, &ext_regs))
return -1;
if (parasite_init_daemon(ctl))
return -1;
@ -812,7 +814,7 @@ static int parasite_memfd_exchange(struct parasite_ctl *ctl, unsigned long size,
uint8_t orig_code[MEMFD_FNAME_SZ] = MEMFD_FNAME;
pid_t pid = ctl->rpid;
long sret = -ENOSYS;
int ret, fd, lfd;
int ret, fd, lfd, remote_flags;
if (ctl->ictx.flags & INFECT_NO_MEMFD)
return 1;
@ -856,7 +858,11 @@ static int parasite_memfd_exchange(struct parasite_ctl *ctl, unsigned long size,
goto err_cure;
}
ctl->remote_map = remote_mmap(ctl, NULL, size, remote_prot, MAP_FILE | MAP_SHARED, fd, 0);
remote_flags = MAP_FILE | MAP_SHARED;
if (ctl->ictx.remote_map_addr){
remote_flags |= MAP_FIXED_NOREPLACE;
}
ctl->remote_map = remote_mmap(ctl, (void *)ctl->ictx.remote_map_addr, size, remote_prot, remote_flags, fd, 0);
if (!ctl->remote_map) {
pr_err("Can't rmap memfd for parasite blob\n");
goto err_curef;
@ -1048,6 +1054,16 @@ int compel_infect_no_daemon(struct parasite_ctl *ctl, unsigned long nr_threads,
memcpy(ctl->local_map, ctl->pblob.hdr.mem, ctl->pblob.hdr.bsize);
compel_relocs_apply(ctl->local_map, ctl->remote_map, &ctl->pblob);
/*
* Ensure the infected thread sees the updated code.
*
* On architectures like ARM64, the Data Cache (D-cache) and
* Instruction Cache (I-cache) are not automatically coherent.
* Modifications land in the D-cache, so we must flush (clean) the
* D-cache to push changes to RAM to ensure the CPU fetches the updated
* instructions.
*/
__builtin___clear_cache(ctl->local_map, ctl->local_map + ctl->pblob.hdr.bsize);
p = parasite_size;
@ -1056,7 +1072,7 @@ int compel_infect_no_daemon(struct parasite_ctl *ctl, unsigned long nr_threads,
p += RESTORE_STACK_SIGFRAME;
p += PARASITE_STACK_SIZE;
ctl->rstack = ctl->remote_map + p;
ctl->rstack = ctl->remote_map + p - PARASITE_STACK_REDZONE;
/*
* x86-64 ABI requires a 16 bytes aligned stack.
@ -1070,7 +1086,7 @@ int compel_infect_no_daemon(struct parasite_ctl *ctl, unsigned long nr_threads,
if (nr_threads > 1) {
p += PARASITE_STACK_SIZE;
ctl->r_thread_stack = ctl->remote_map + p;
ctl->r_thread_stack = ctl->remote_map + p - PARASITE_STACK_REDZONE;
}
ret = arch_fetch_sas(ctl, ctl->rsigframe);
@ -1292,7 +1308,7 @@ struct plain_regs_struct {
user_fpregs_struct_t fpregs;
};
static int save_regs_plain(void *to, user_regs_struct_t *r, user_fpregs_struct_t *f)
static int save_regs_plain(pid_t pid, void *to, user_regs_struct_t *r, user_fpregs_struct_t *f)
{
struct plain_regs_struct *prs = to;

View file

@ -60,6 +60,9 @@ static const flags_t flags = {
#elif defined CONFIG_LOONGARCH64
.arch = "loongarch64",
.cflags = COMPEL_CFLAGS_PIE,
#elif defined CONFIG_RISCV64
.arch = "riscv64",
.cflags = COMPEL_CFLAGS_PIE,
#else
#error "CONFIG_<ARCH> not defined, or unsupported ARCH"
#endif

View file

@ -3,6 +3,11 @@ CFLAGS ?= -O2 -g -Wall -Werror
COMPEL := ../../../compel/compel-host
ifeq ($(GCS_ENABLE),1)
CFLAGS += -mbranch-protection=standard -DGCS_TEST_ENABLE=1
LDFLAGS += -z experimental-gcs=check
endif
all: victim spy
run:
@ -17,7 +22,7 @@ clean:
rm -f parasite.o
victim: victim.c
$(CC) $(CFLAGS) -o $@ $^
$(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
spy: spy.c parasite.h
$(CC) $(CFLAGS) $(shell $(COMPEL) includes) -o $@ $< $(shell $(COMPEL) --static libs)

Some files were not shown because too many files have changed in this diff Show more