When `opts.pid` is explicitly set to `os.getpid()`, `pycriu` fails to
daemonize the `criu` process. This causes `criu` to run as a child of
the dumped process, leading to the error "The criu itself is within
dumped tree".
This can be fixed by modifying `_send_req_and_recv_resp` to check if the
target PID matches the current process PID. If so, it enables daemon
mode, ensuring `criu` is detached and the dump succeeds.
Signed-off-by: unichronic <ishuvam.pal@gmail.com>
In a simple case where the parent process and the child one are in one
pid namespace we can safely use vpid(item) to prace the child. But, for
the cases where the child is a pid namespace init, or the child is put
into external pid namespace, the parent and the child have different pid
namespaces and using pid vpid(item) (which e.g. for init will always be
1 here) to ptrace the child process is inorrect.
Let's use the pid reported to us from clone as it's always the right pid
of the child from the parent's point of view.
Fixes: 7dd583002 ("restore: add infrastructure to enable shadow stack")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The return value of sys_rseq was previously ignored during
unregistration, under the assumption that it would not fail if the rseq
structure was properly registered.
However, if sys_rseq fails, the kernel retains the registration. If the
memory containing the rseq structure is subsequently unmapped or reused,
kernel updates to the rseq area can cause the process to crash (e.g.,
via SIGSEGV).
Check the return value of sys_rseq. If it fails, log the error code and
abort the restoration process. This makes rseq unregistration failures
fatal and explicit, aiding in debugging and preventing later obscure
crashes.
Signed-off-by: liqiang2020 <liqiang64@huawei.com>
Fedora rawhide ships a pre-release of GCC 16 which produces following
error:
uprobes.c:34:22: error: variable ‘dummy’ set but not used [-Werror=unused-but-set-variable=]
34 | volatile int dummy = 0;
| ^~~~~
Marking this variable as "__maybe_unused" to fix the error.
Signed-off-by: Adrian Reber <areber@redhat.com>
In some cases, CRIU can observe tasks that exit during checkpointing,
and sets the state of these tasks to COMPEL_TASK_DEAD.
This patch adds a string representation of this value that can be used
by CRIT when decoding the images.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
CRIU defines the following constants for task state in compel/include/uapi/task-state.h
COMPEL_TASK_ALIVE = 0x01
COMPEL_TASK_STOPPED = 0x03
COMPEL_TASK_ZOMBIE = 0x06
Thus, we need to swap the values for "zombie" and "stopped" used in CRIT.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When running the command 'make docker-test', almost all zdtm tests fail,
logging 'ip: not found'. 'ip' command of the iproute2 package was missing.
So added the package to the list of dependencies in 'apt-packages.sh'. Now
tests run
Signed-off-by: ImranullahKhann <imranullahkhann2004@gmail.com>
As of Alpine Linux 3.19, musl libc no longer contains separate
fopen64(), fseeko64(), or ftello64() functions. This causes building
CRIU with amdgpu plugin to fail with the following error:
amdgpu_plugin.c: In function 'parallel_restore_bo_contents':
amdgpu_plugin.c:2286:17: error: implicit declaration of function 'fseeko64'; did you mean 'fseeko'? [-Wimplicit-function-declaration]
2286 | fseeko64(bo_contents_fp, entry->read_offset + offset, SEEK_SET);
| ^~~~~~~~
| fseeko
make[2]: *** [Makefile:31: amdgpu_plugin.so] Error 1
make[1]: *** [Makefile:363: amdgpu_plugin] Error 2
To fix this, add the missing $(DEFINES) to plugin builds, and since we
always compile with _FILE_OFFSET_BITS=64, we don't need the 64 suffix.
Fixes: #2826
Suggested-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The initialization of the struct timespec used as clockid input
parameter was removed in commit:
b4441d1bd8 ("restorer.c: rm unneded struct init")
This causes the build to fail on Alpine with clang version 21.1.2:
GEN criu/pie/parasite-blob.h
criu/pie/restorer.c:1230:39: error: variable 'ts' is uninitialized when passed as a const pointer argument here [-Werror,-Wuninitialized-const-pointer]
1230 | if (sys_clock_gettime(t->clockid, &ts)) {
| ^~
1 error generated.
make[2]: *** [/criu/scripts/nmk/scripts/build.mk:118: criu/pie/restorer.o] Error 1
make[1]: *** [criu/Makefile:59: pie] Error 2
make: *** [Makefile:278: criu] Error 2
To fix this, we remove the "const" from the declaration of
clock_gettime. Since the kernel writes the current time into
the struct timespec provided by the caller, the pointer must
be writable.
Suggested-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
After commit [1] we accidentally stopped reporting the errors from
kerndat_has_timer_cr_ids(), let's fix that.
Fixes: 1eaa870cc ("kerndat: check that hardware breakpoints work") [1]
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The "man 2 close":"Dealing with error returns from close()" says:
"Retrying the close() after a failure return is the wrong thing to do"
We should not leave the fd there, attempting to close it again on next
close()/close_safe() may lead to accidentally closing something else.
It confirms with the kernel code where sys_close() removes fd from
fdtable in this stack:
+-> sys_close
+-> file_close_fd
+-> file_close_fd_locked
+-> rcu_assign_pointer(fdt->fd[fd], NULL)
If there was an fd this stack is always reached and fd is always
removed.
Let's replace the fd with -1 after close no matter what.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The README currently uses an external link to criu.org for the embedded
CRIU logo. Loading this URL when viewing the README on GitHub sometimes
fails with "Error Fetching Resource". Using a local copy of the logo
fixes this issue.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Docker version 28 broke container restore in combination with network
namespaces. The workaround in the CI script was excluding Docker version
28. Now that there is also Docker version 29, which is still broken,
this also excludes Docker version 29.
Signed-off-by: Adrian Reber <areber@redhat.com>
Commit "plugin: Add DUMP_DEVICES_LATE callback" introduced a new plugin
callback that is invoked in cr_dump_tasks(). The return value of this
callback was assigned to the variable ret. However, this variable is later
used as the return value when goto err is triggered in subsequent
conditions. As a result, CRIU exits with "Dumping finished successfully" even
when some actions have failed and inventory.img has not been created.
To fix this, we replace ret with exit_code and use it only when it is
actually needed.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Introduce an opt-in mode for building and running ZDTM static tests
with Guarded Control Stack (GCS) enabled on AArch64.
Changes:
- Support `GCS_ENABLE=1` builds, adding `-mbranch-protection=standard`
and `-z experimental-gcs=check` to CFLAGS/LDFLAGS.
- Export required GLIBC_TUNABLES at runtime via `TEST_ENV`.
- %.pid rules to prefix test binaries with `$(TEST_ENV)`
so the tunables are set when running tests.
- Makefile rules for selectively enabling GCS in tests
Usage:
# Build and run with GCS enabled
make -C zdtm/static GCS_ENABLE=1 posix_timers
GCS_ENABLE=1 ./zdtm.py run --keep-img=always \
-t zdtm/static/posix_timers
By default (`GCS_ENABLE` unset or 0), test builds and runs are
unchanged.
NOTE: This assumes that the test victim was compiled also using
GCS_ENABLE=1 so that the proper GCS AArch64 ELF headers are present
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Reviewed-by: Alexander Mikhalitsyn aleksandr.mikhalitsyn@canonical.com
This commit finalizes AArch64 Guarded Control Stack (GCS)
support by wiring the full dump and restore flow.
The restore path adds the following steps:
- Define shared AArch64 GCS types and constants in a dedicated header
for both compel and CRIU inclusion
- compel: add get/set NT_ARM_GCS via ptrace, enabling user-space
GCS state save and restore.
- During restore switch to the new GCS (via GCSSTR) to place capability
token sa_restorer address
- arch_shstk_trampoline() — We enable GCS in a trampoline that using
prctl(PR_SET_SHADOW_STACK_STATUS, ...) via inline SVC. The trampoline
ineeded because we can’t RET without a valid GCS.
- restorer: map the recorded GCS VMA, populate contents top-down with
GCSSTR, write the signal capability at GCSPR_EL0 and the valid token at
GCSPR_EL0-8, then switch to the rebuilt GCS (GCSSS1)
- Save and restore registers via ptrace
- Extend restorer argument structures to carry GCS state
into post-restore execution
- Add shstk_set_restorer_stack(): sets tmp_gcs to temporary restorer
shadow stack start
- Add gcs_vma_restore implementation (required for mremap of the GCS VMA)
Tested with:
GCS_ENABLE=1 ./zdtm.py run -t zdtm/static/env00
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Add debug and info messages to log Guarded Control Stack state when
dumping AArch64 threads. This includes the following values:
- gcspr_el0
- features_enabled
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
[ alex: cleanup fixes ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Mike Rapoport <rppt@kernel.org>
- Define user_aarch64_gcs_entry in core-aarch64.proto to store
Guarded Control Stack state (gcspr_el0, features_enabled).
- Extend thread_info_aarch64 with an optional gcs field
Also extend thread_info_aarch64 with an optional gcs field
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Introduce an opt-in mode for building and running compel tests
with Guarded Control Stack (GCS) enabled on AArch64.
Changes:
- Extend compel/test/infect to support `GCS_ENABLE=1` builds,
adding `-mbranch-protection=standard` and
`-z experimental-gcs=check` to CFLAGS/LDFLAGS.
- Export required GLIBC_TUNABLES at runtime via `TEST_ENV`.
Usage:
make -C compel/test/infect GCS_ENABLE=1
make -C compel/test/infect GCS_ENABLE=1 run
By default (`GCS_ENABLE` unset or 0), builds and runs are unchanged.
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
When GCS is enabled, the kernel expects a capability token at GCSPR_EL0-8
and sa_restorer at GCSPR_EL0-16 on rt_sigreturn. The sigframe must be
consistent with the kernel’s expectations, with GCSPR_EL0 advanced by -8
having it point to the token on signal entry. On rt_sigreturn, the kernel
verifies the cap at GCSPR_EL0, invalidates it and increments GCSPR_EL0 by 8
at the end of gcs_restore_signal() .
Implement parasite_setup_gcs() to:
- read NT_ARM_GCS via ptrace(PTRACE_GETREGSET)
- write (via ptrace) the computed capability token and restorer address
- update GCSPR_EL0 to point to the token's location
Call parasite_setup_gcs() into parasite_start_daemon() so the sigreturn
frame satisfies kernel's expectation
Tests with GCS remain opt‑in:
make -C compel/test/infect GCS_ENABLE=1 && make -C compel/test/infect run
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
[ alex: cleanup fixes ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Mike Rapoport <rppt@kernel.org>
Add basic prerequisites for Guarded Control Stack (GCS) state on AArch64.
This adds a gcs_context to the signal frame and extends user_fpregs_struct_t to
carry GCS metadata, preparing the groundwork for GCS in the parasite.
For now, the GCS fields are zeroed during compel_get_task_regs(), technically
ignoring GCS since it does not reach the control logic yet; that will be
introduced in the next commit.
The code path is gated and does not affect normal tests. Can be explicitly
enabled and tested via:
make -C infect GCS_ENABLE=1 && make -C infect run
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
[ alex: clean up fixes ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Mike Rapoport <rppt@kernel.org>
Introduce ARM64 Guarded Control Stack (GCS) constants and macros
in a new uapi header for use in both CRIU and compel.
Includes:
- NT_ARM_GCS type
- prctl(2) constants for GCS enable/write/push modes
- Capability token helpers (GCS_CAP, GCS_SIGNAL_CAP)
- HWCAP_GCS definition
These are based on upstream Linux definitions
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Mike Rapoport <rppt@kernel.org>
Refactor user_fpregs_struct_t to wrap user_fpsimd_state in a
dedicated struct, preparing for future extending by just
adding new members
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
[ alex: fixes ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Mike Rapoport <rppt@kernel.org>
At least on tests running on Fedora rawhide following error could be
seen:
```
criu/tty.c: In function 'pts_fd_get_index':
criu/tty.c:262:21: error: initialization discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers]
262 | char *pos = strrchr(link->name, '/');
|
```
This fixes it.
Signed-off-by: Adrian Reber <areber@redhat.com>
Static code analysis reported:
1. criu/cr-restore.c:2438:2: var_decl: Declaring variable "end_vma"
without initializer.
4. criu/cr-restore.c:2451:5: assign: Assigning: "s_vma" = "&end_vma",
which points to uninitialized data.
7. criu/cr-restore.c:2449:4: uninit_use: Using uninitialized value
"s_vma->list.next".
This tries to fix it by initializing the variable.
Signed-off-by: Adrian Reber <areber@redhat.com>
Static code analysis reported:
criu/cr-dump.c:2328:2: identical_branches: The same code is executed
when the condition "ret" is true or false, because the code in the
if-then branch and after the if statement is identical. Should the if
statement be removed?
This is a fix for the warning.
Signed-off-by: Adrian Reber <areber@redhat.com>
The syntax of the inherit-fd functionality for unix socket and pipe
includes a colon.
Fixes: 0df3f79fc0 ("criu(8): fix --inherit-fd description")
Fixes: c37324b6d0 ("crtools: describe the inherit-fd option")
Signed-off-by: Mark Polyakov <mark@thundercompute.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
On AMD Instinct MI300 systems, restoring a large GPU application can
fail because the checkpoint size is too large and the maximum value of
an offset (with integer type) is insufficient. This problem occurs when
the total size of all buffer objects exceeds int max, not because any
single buffer is too large, but it can also happen with a large number
of small buffers.
Fixes: #2812
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
These header files are copied directly from the Linux kernel and contain
typos. We skip these files in codespell to simplify maintenance.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch updates drm.h and amdgpu_drm.h kernel headers,
and adds drm_mode.h (included by drm.h) from the rocm-7.1.0
release tag.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
amdgpu_plugin_drm.c:167:6: error: variable 'num_bos' set but not used [-Werror,-Wunused-but-set-variable]
167 | int num_bos = 0;
|
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Add a comment that explains the purpose of `retry_needed`.
Co-authored-by: Andrei Vagin <avagin@google.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Co-authored-by: Andrei Vagin <avagin@google.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Use `return 0` on success in `post_dump_dmabuf_check()` for consistency
with other functions.
Co-authored-by: Andrei Vagin <avagin@google.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
These pr_info lines begin with "CC3" and "TWI" were not meant to be
included in the patch.
Co-authored-by: Andrei Vagin <avagin@google.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
amdgpu libraries that use dmabuf fd to share GPU memory between
processes close the dmabuf fds immediately after using them.
However, it is possible that checkpoint of a process catches one
of the dmabuf fds open. In that case, the amdgpu plugin needs
to handle it.
The checkpoint of the dmabuf fd does require the device file
it was exported from to have already been dumped
To identify which device this dmabuf fd was exprted from, attempt
to import it on each device, then record the dmabuf handle
it imports as. This handle can be used to restore it.
Signed-off-by: David Francis <David.Francis@amd.com>
The amdgpu plugin was counting how many files were checkpointed
to determine when it should close the device files.
The number of device files is not consistent; a process may
have multiple copies of the drm device files open.
Instead of doing this counting, add a new callback after all
files are checkpointed, so plugins can clean up their
resources at an appropriate time.
Signed-off-by: David Francis <David.Francis@amd.com>
Buffer objects held by the amdgpu drm driver are checkpointed with
the new BO_INFO and MAPPING_INFO ioctls/ioctl options. Handling
is in amdgpu_plugin_drm.h
Handling of imported buffer objects may require dmabuf fds to be
transferred between processes. These occur over fdstore, with the
handle-fstore id relationships kept in shread memory. There is a
new plugin callback: RESTORE_INIT to create the shared memory.
During checkpoint, track shared buffer objects, so that buffer objects
that are shared across processes can be identified.
During restore, track which buffer objects have been restored. Retry
restore of a drm file if a buffer object is imported and the
original has not been exported yet. Skip buffer objects that have
already been completed or cannot be completed in the current restore.
So drm code can use sdma_copy_bo, that function no longer requires
kfd bo structs
Update the protobuf messages with new amdgpu drm information.
Signed-off-by: David Francis <David.Francis@amd.com>
The amdgpu plugin usually calls drm ioctls through the libdrm
wrappers. However, amdgpu restore requires dealing with dmabufs
and gem handles directly, which means drm ioctls must be
called directly.
Add the drm.h header (from the kernel's uapi).
Signed-off-by: David Francis <David.Francis@amd.com>
For amdgpu plugin to call the new amdgpu drm CRIU ioctls, it needs
the amdgpu drm header file, copied from the kernel's includes.
Signed-off-by: David Francis <David.Francis@amd.com>
amdgpu dmabuf CRIU requires the ability of the amdgpu plugin
to retry.
Change files_ext.c to read a response of 1 from a plugin restore
function to mean retry.
Signed-off-by: David Francis <David.Francis@amd.com>
amdgpu represents allocated device memory as a memory mapping
of the device file. This is a non-standard VMA that must
be handled by the plugin, not the normal VMA code.
Ignore all VMAs on device files.
Signed-off-by: David Francis <David.Francis@amd.com>
This patch implements the entire logic to enable the offloading of
buffer object content restoration.
The goal of this patch is to offload the buffer object content
restoration to the main CRIU process so that this restoration can occur
in parallel with other restoration logic (mainly the restoration of
memory state in the restore blob, which is time-consuming) to speed up
the restore phase. The restoration of buffer object content usually
takes a significant amount of time for GPU applications, so
parallelizing it with other operations can reduce the overall restore
time.
It has three parts: the first replaces the restoration of buffer objects
in the target process by sending a parallel restore command to the main
CRIU process; the second implements the POST_FORKING hook in the amdgpu
plugin to enable buffer object content restoration in the main CRIU
process; the third stops the parallel thread in the RESUME_DEVICES_LATE
hook.
This optimization only focuses on the single-process situation (common
case). In other scenarios, it will turn to the original method. This is
achieved with the new `parallel_disabled` flag.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently the restore of buffer object comsumes a significant amount of
time. However, this part has no logical dependencies with other restore
operations. This patch introduce some structures and some helper
functions for the target process to offload this task to the main CRIU
process.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
When enabling parallel restore, the target process and the main CRIU
process need an IPC interface to communicate and transfer restore
commands. This patch adds a Unix domain TCP socket and stores this
socket in `fdstore`.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Major changes:
* plugins/amdgpu: Implement parallel restore
* Handle processes with uprobes vma
* Fix: getsockopt usage for SO_PASSCRED/SO_PASSSEC on Linux 6.16
* Relax ELF magic check to support MIPS libraries
* pagemap: prevent integer overflow in pagemap_len
This release's name is a nod to the growing challenge we face in
maintaining compatibility across the rapidly evolving Linux kernel
ecosystem.
The full changelog can be found here: https://criu.org/Download/criu/4.2.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Using sizeof(hdr) where hdr is a pointer gives the size of the pointer,
not the size of the structure it points to.
Reported-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
vsnprintf does not always return the number of bytes actually written to
the buffer.
If the output was truncated due to the buffer limit, the return value is
the total number of bytes which WOULD have been written to the final
string if enough space had been available.
This means we must cap the return value to the buffer size excluding the
terminating null byte to correctly calculate the log entry size.
Signed-off-by: Andrei Vagin <avagin@google.com>
kerndat_init() can generate a significant volume of logs. If called
before log_init(), all these messages will be saved in the
early_log_buffer, which has a limited capacity. Additionally, saving to
the early_log_buffer can introduce a performance penalty, especially
when verbose mode is not enabled.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Andrei Vagin <avagin@google.com>
When we compare two list of vma-s, we need to take into account that
some of them could be merged.
Fixes#12286
Signed-off-by: Andrei Vagin <avagin@google.com>
This functionality (#2527) is being reverted and excluded from this
release due to issue #2812.
It will be included in a subsequent release once all associated issues
are resolved.
Signed-off-by: Andrei Vagin <avagin@google.com>
This patch fixes the following error:
$ sudo make -C test/others/criu-coredump run
...
Traceback (most recent call last):
File "/home/circleci/criu/coredump/coredump", line 55, in <module>
main()
File "/home/circleci/criu/coredump/coredump", line 47, in main
coredump(opts)
File "/home/circleci/criu/coredump/coredump", line 14, in coredump
cores = generator(os.path.realpath(opts['in']))
File "/home/circleci/criu/coredump/criu_coredump/coredump.py", line 192, in __call__
self.coredumps[pid] = self._gen_coredump(pid)
File "/home/circleci/criu/coredump/criu_coredump/coredump.py", line 214, in _gen_coredump
cd.vmas = self._gen_vmas(pid)
File "/home/circleci/criu/coredump/criu_coredump/coredump.py", line 992, in _gen_vmas
v.data = self._gen_mem_chunk(pid, vma, v.filesz)
File "/home/circleci/criu/coredump/criu_coredump/coredump.py", line 879, in _gen_mem_chunk
page_mem = self._get_page(pid, page_no)
File "/home/circleci/criu/coredump/criu_coredump/coredump.py", line 797, in _get_page
num_pages = m.get("nr_pages", m.compat_nr_pages)
AttributeError: 'dict' object has no attribute 'compat_nr_pages'
+ exit 1
make[1]: *** [Makefile:3: run] Error 1
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Andrei Vagin <avagin@google.com>
Use nr_pages when available, falling back to compat_nr_pages
for compatibility.
Signed-off-by: alam0rt <sam@samlockart.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The --mntns-compat-mode option is no longer parsed with CHECK.
Use --log-file instead to test the error message.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
_init__.py defines the public API for pycriu. It is important to use
explicit imports to avoid leaking every symbol from criu.py into the
pycriu namespace. This avoids import-time side effects, prevents name
collisions, and circular-import traps.
Fixes the following lint error:
F403 `from .criu import *` used; unable to detect undefined names
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This allows users to specify RPC options when
using the check() functionality.
Co-authored-by: Andrii Herheliuk <andrii@herheliuk.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The check() functionality is very different from dump, pre-dump,
and restore. It is used only to check if the kernel supports required
features, and does not need the majority of options set via RPC.
In particular, we don't need to open `image_dir` when running `check()`
because this functionality doesn't create or process image files. In
this case, `image_dir` is used as `work_dir`, only when the latter is
not specified and a log file is used.
This patch updates the RPC options parser so that it only handles the
logging options when check() is used. Logging to a file is required when
log_file is explicitly set or no log_to_stderr is used. In such case, we
also resolve images_dir and work_dir where the log file will be created.
Fixes: #2758
Suggested-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Move the logging initialization into a helper function that
can be reused.
No functional change intended.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Move the images_dir selection logic from setup_opts_from_req() into a
new function: resolve_images_dir_path(). This improves readability and
allows the code to be reused. While at it, use snprintf() instead of
sprintf() for the /proc path and ensure NULL termination after strncpy().
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Commit 9089ce8 ("service: use setproctitle") extended cr-service to
get the full path of images_dir using readlink(). However, the RPC
API was later extended to allow setting a custom path (folder) to
be set instead of passing a file descriptor, which causes readlink()
to fail as the path is not a symbolic link.
It would be better to drop the code setting the images-dir path as a
string in the proctitle.
Fixes: #2794
Suggested-by: Andrei Vagin <avagin@google.com>
Co-authored-by: Andrii Herheliuk <andrii@herheliuk.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Move the code that opens the images directory, resolves its absolute
path via readlink(), selects the work_dir, and chdir()s into it into a
new function: setup_images_and_workdir(). This reduces the size of
`setup_opts_from_req()`, improves its readability, and allows this
functionality to be reused.
While at it, change open_image_dir() to take a const char *dir
parameter, reflecting that the path is not modified by the function and
allowing callers to pass string literals without casts.
No functional changes are intended.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This change allows users to call criu.use_sk() without any
parameters to use the default socket name.
Co-authored-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
[Errno 2] No such file or directory -> Socket file not found.
[Errno 111] Connection refused -> Service not running.
Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
Use system-installed CRIU binary instead of a local file
Thanks to @avagin for suggesting this solution.
Co-authored-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
Container runtimes that use libcriu (e.g., crun) need to specify a CRIU
configuration file that allows to overwrite default options set via RPC.
This is particularly useful to set options such as `--tcp-established`
via `/etc/criu/runc.conf` in Kubernetes.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Unlike "which", which is a separate executable not always installed by
default, "command -v" is a shell built-in available at least for bash,
dash, and busybox shell.
Unlike "which", "command -v" is also easier to grep for, and it is
already used in a few places here.
Inspired by commit 57251d811.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
which is used in Makefiles to check for dependencies:
Example:
export USE_ASCIIDOCTOR ?= $(shell which asciidoctor 2>/dev/null)
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Don't install external pip dependencies when running `make install`.
As we are not really into developing a Python project, we should
not install additional packages. CRIU does that nowhere else.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The existing test collects all action-script hooks triggered during
`h`, `ns`, and `uns` runs with ZDTM into `actions_called.txt`, then
verifies that each hook appears at least once. However, the test does
not verify that hooks are invoked *exactly once* or in *correct order*.
This change updates the test to run ZDTM only with ns flavour as this
seems to cover all action-script hooks, and checks that all hooks are
called correctly.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch consolidates the action-script tests into
`test/others/action-script` to ensure all tests are executed
consistently and reduce duplication. Since we had two tests that appear
to do the same thing, we can remove the one that doesn't use zdtm.py.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Regardless of the actual error message, "Unknown" was always appended
to the end of the string, resulting in messages like:
"DUMP failed: Error(3): No process with such pidUnknown".
Fixed by changing standalone if statements to else-if blocks so
"Unknown" is only added when no specific error condition matches.
Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
pycriu depends on protobuf to function correctly. Currently,
it raises an error if protobuf is not installed. Adding
protobuf to the dependencies ensures it is available after
installing pycriu.
Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
We use LGPL-v2.1 license for the libcriu and pycriu as they are
intended to be usable by both proprietary and open-source applications.
Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
* call shstk_vma_restore() for VMA_AREA_SHSTK in vma_remap()
* delete map/copy/unmap from shstk_restore() and keep token setup + finalize
* before the loop naturally stopped at cet->ssp-8, so a -8 nudge is required here
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Co-Authored-By: Andrei Vagin <avagin@gmail.com>
[ alex: small code cleanups ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
1. create shadow stack vma during vma_remap cycle
2. copy contents from a premapped non-shstk VMA into it
3. unmap premapped non-shstk VMA
4. Mark shstk VMA for remap into the final destination
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Co-Authored-By: Andrei Vagin <avagin@gmail.com>
Co-Authored-By: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
[ alex: debugging, rework together with Andrei and code cleanup ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
* reserve space for restorer shadow stack
* set tmp_shstk at mem, advance mem by PAGE_SIZE
* forget the extra PAGE_SIZE (shstk) for premapped VMAs
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Co-Authored-By: Andrei Vagin <avagin@gmail.com>
[ alex: small code cleanups ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
* default: return whatever passed in
eg. to be used as
shtk_min_mmap_addr(kdat.mmap_min_addr)
* x86: ignore def and return 4G
On x86, CET shadow stack is required to be mapped above 4GiB
On the other hand forcing 4GiB globally would break 32-bit restores.
Co-Authored-By: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Extend the test for overwriting config options via RPC with
repeatable option (--action-script) and verify that the value
will not be silently duplicated.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When an additional configuration file is specified via RPC, this file is
parsed twice: first at an early stage to load options such as --log-file,
--work-dir, and --images-dir; and again after all RPC options and
configuration files have been evaluated.
This allows users to overwrite options specified via RPC by the
container runtime (e.g., --tcp-established). However, processing
the RPC config file twice leads to silently duplicating the values
of repeatable options such as `--action-script`.
To address this problem, we adjust the order of options parsing so
that the RPC config file is evaluated only once. This change should
not introduce any functional changes. Note that this change does
not affect the logging functionality, as early log messages are
temporarily buffered and only written to the log file once it has
been initialized (see commit 1ff2333 "Printout early log messages").
Fixes#2727
Suggested-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Program flow:
- Parse the test's own executable to calculate the file offset of the uprobe
target function symbol
- Enable the uprobe at the target function
- Call the target function to trigger the uprobe, and hence the uprobes vma
creation
- C/R
- Call the target function again to check that no SIGTRAP is sent, since the
uprobe is still active
At least v1.7 of libtracefs is required because that's when
tracefs_instance_reset was introduced. The uprobes API was introduced in v1.4,
and the dynamic events API was introduced in v1.3.
Ubuntu Focal doesn't have libtracefs. Jammy has v1.2.5, and Noble has v1.7.
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
This commit teaches criu to deal with processes which have a "[uprobes]" vma.
This vma is mapped by the kernel when execution hits a uprobe location. This
is done so as to execute the uprobe'd instruciton out-of-line in the special
vma. The uprobe'd location is replaced by a software breakpoint instruction,
which is int3 on x86. When execution reaches that location, control is
transferred over to the kernel, which then executes whatever handler code
it has to, for the uprobe, and then executed the replaced instruction out-of-line
in the special vma. For more details, refer to this commit:
d4b3b6384f
Reason for adding a new option
------------------------------
A new option is added instead of making the uprobes vma handling transparent
to the user, so that when a dump is attempted on a process tree in which a
process has the uprobes vma, criu will error, asking the user to use this option.
This gives the user a chance to check what uprobes are attached to the processes
being dumped, and try to ensure that those uprobes are active on restore as well.
Again, the same reason for requiring this option on restore as well. Because
if a process is dumped with an active uprobe, and on restore if the uprobe
is not active, then if execution reaches the uprobe location, then the process
will be sent a SIGTRAP, whose default behaviour will terminate and core dump
the process. This is because the code pages are dumped with the software
breakpoint instruction replacement at the uprobe'd locations. On restore, if
execution reaches these locations and the kernel sees no associated active
uprobes, then it'll send a SIGTRAP.
So, using this option is on dump and restore is an implicit guarantee on the
user's behalf that they'll take care of the active uprobes and that any future
SIGTRAPs because of this are not on us! :)
Handling uprobes vma on dump
----------------------------
We don't need to store any information about the uprobes vma because it's
completely handled by the kernel, transparent to userspace. So, when a uprobes
vma is detected, we check if the --allow-uprobes option was specified or not.
If so, then the allow_uprobes boolean in the inventory image is set (this is
used on restore). The uprobes vma is skipped from being added to the vma list.
Handling uprobes vma on restore
-------------------------------
If allow_uprobes is set in the inventory image, then check if --allow-uprobes
is specified or not. Restoring the vma is not required.
Fixes: checkpoint-restore#1961
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
Add a ZDTM test case where CRIU uses a helper process to restore
a non-empty process group with a terminated leader and a Unix
domain socket. This reproduces a corner case in which mount
namespace switching can fail during restore:
https://github.com/checkpoint-restore/criu/issues/2687
Signed-off-by: Qiao Ma <mqaio@linux.alibaba.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
These tests reveal the following build error:
In file included from compel/include/uapi/compel/asm/sigframe.h:4,
from compel/plugins/std/infect.c:14:
/usr/include/asm/sigcontext.h:28:8: error: redefinition of 'struct sigcontext'
28 | struct sigcontext {
| ^~~~~~~~~~
In file included from criu/arch/aarch64/include/asm/restorer.h:4,
from criu/arch/aarch64/crtools.c:11:
/usr/include/asm/sigcontext.h:28:8: error: redefinition of 'struct sigcontext'
28 | struct sigcontext {
| ^~~~~~~~~~
Inspired by #2766 / #2767.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Compilation on gentoo/arm64 (llvm+musl) fails with:
In file included from compel/include/uapi/compel/asm/sigframe.h:4,
from compel/plugins/std/infect.c:14:
/usr/include/asm/sigcontext.h:28:8: error: redefinition of 'struct sigcontext'
28 | struct sigcontext {
| ^~~~~~~~~~
In file included from criu/arch/aarch64/include/asm/restorer.h:4,
from criu/arch/aarch64/crtools.c:11:
/usr/include/asm/sigcontext.h:28:8: error: redefinition of 'struct sigcontext'
28 | struct sigcontext {
| ^~~~~~~~~~
This is happening because <asm/sigcontext.h> and <signal.h> are
mutually incompatible on Linux.
To fix, use <signal.h> instead of <asm/sigcontext.h> for arm64
(like all others arches do).
Fixes: #2766
Signed-off-by: Pepper Gray <hello@peppergray.xyz>
page_pipe_read() expects an 'unsigned long *', but pi->nr_pages is u64.
On 32-bit platforms (e.g., armv7), passing &pi->nr_pages directly causes
a compiler error. To fix this we introduce a temporary variable and copy
the result back to pi->nr_pages.
Fixes: #2756
Suggested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Our previous mailing list had some technical issues and we created
a new one that is hopefully more reliable.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Currently we run aarch64 tests on both Cirrus CI and GitHub runners.
However, Cirrus CI fails with "Monthly compute limit exceeded!". This
change removes the redundant tests to streamline our CI process.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Ubuntu Focal Fossa (20.04) reached its end-of-life on 31 May 2025. So, move
over to using Ubuntu Jammy (22.04) base images.
Also, focal repos do not have libtracefs, which the uprobes zdtm test needs.
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
Travis CI stopped providing CI minutes for open-source projects
some time ago and we have migrated to GitHub actions.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Currently, adding a package which is required either for development or testing
requires it to be added in multiple places due to many duplicated Dockerfiles
and installation scripts. This makes it difficult to ensure that all scripts
are updated appropriately and can lead to some places being missed.
This patch consolidates the list of dependencies and adds installation
scripts for each package-manager used in our CI (apk, apt, dnf, pacman).
This change also replaces the `debian/dev-packages.lst` as this subfolder
conflicts with the Ubuntu/Debian packing scripts used for CRIU:
https://github.com/rst0git/criu-deb-packages
This patch also removes the CentOS 8 build scripts as it is EOL
and the container registry is no longer available.
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This commit adds the document to provide high-level overviews of the
CRIU project for AI assistants like Claude and Gemini.
These documents are intended to be used as context for AI-powered
developer assistants to help them understand the project's goals,
architecture, and development process. This will allow them to provide
more accurate and helpful responses to developer questions.
The documents include:
- A brief introduction to CRIU
- A quick start guide for checkpointing and restoring a simple process
- An overview of the dump and restore process
- A description of the Compel subproject
- Information about the project's coding style, code layout, and tests
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The previous commit 4cd4a6b1ac ("zdtm: stop importing junit_xml")
removed the junit_xml library, but some variables related to it were
left in the code. This commit removes the unused `tc` variable and a
call to its `add_error_info` method.
Fixes: 4cd4a6b1ac ("zdtm: stop importing junit_xml")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
On some ARM/aarch64 systems, the VDSO ELF header sets EI_OSABI to 3 (Linux),
while CRIU expects 0 (System V). This strict check causes restore to fail
with "ELF header magic mismatch"
This patch relaxes the check to accept both values, improving compatibility
with modern toolchains and kernels (e.g. Linux 6.12+)
Fixes: #2751
Signed-off-by: dong sunchao <dongsunchao@gmail.com>
During investigations, it’s much easier to read logs when regions are
printed in the start - end format rather than `start/size`.
In addition, all page counters and memory sizes are now printed in
hexadecimal, as they are hard to read in decimal form.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Variables storing page counts were previously `unsigned int`, limiting
them to a maximum of 2^32 pages. With a 4k page size, this corresponds
to a 16TB memory mapping, which is insufficient for larger mappings.
This commit changes the type for these variables to `unsigned long` to
support larger memory mappings.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Update the nr_pages field in PagemapEntry to uint64 to prepare for
checkpointing and restoring huge memory mappings.
Backward compatibility with older pagemap images is preserved.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
On restore, CRIU needs to change mount namespaces to properly restore
files and unix sockets. However, the kernel prevents this if a process
is sharing its file system information (fs) with other processes.
Fixes#2687
Signed-off-by: Andrei Vagin <avagin@google.com>
On some kernels, attr/current can be intercepted by BPF LSM, causing
errors (#2033). Using attr/apparmor/current is preferable, because it
is guaranteed to return the apparmor label. attr/current will still be
used as a fallback for older kernels.
Fixes: #2033
Signed-off-by: Filip Hejsek <filip.hejsek@gmail.com>
On MIPS platforms, shared libraries may use EI_ABIVERSION = 5 to indicate
support for .MIPS.xhash sections. The previous ELF header check in
handle_binary() strictly compared e_ident against a hardcoded value,
causing legitimate shared objects to be rejected.
This patch replaces the memcmp-based check with a structured validation
of ELF magic and class, and allows EI_ABIVERSION values beside 0.
fixes: #2745
Signed-off-by: dong sunchao <dongsunchao@gmail.com>
We are dropping support for generating JUnit XML reports in zdtm.py as we've
migrated testing infrastructure entirely to `GitHub Actions` and other
third-party test runners.
This package has been removed from some distribution repositories (e.g.,
Fedora), making it simpler to remove the dependency than to force installation
via pip.
Signed-off-by: Andrei Vagin <avagin@google.com>
This change modifies the CI script to avoid Docker version 28, which has
a known regression that breaks Checkpoint/Restore (C/R) functionality.
The issue is tracked in the moby/moby project as
https://github.com/moby/moby/issues/50750.
Signed-off-by: Andrei Vagin <avagin@google.com>
Linux 6.16+ restricts SO_PASSCRED and SO_PASSSEC to AF_UNIX, AF_NETLINK, and AF_BLUETOOTH
This patch updates CRIU to check the socket family before dumping these options
Fixes: #2705
Signed-off-by: Dong Sunchao <dongsunchao@gmail.com>
SO_PASSCRED and SO_PASSSEC are only valid for AF_UNIX and AF_NETLINK
This patch updates the test logic to use a unix socket for these options,
while preserving the original value consistency check
Fixes: #2705
Signed-off-by: Dong Sunchao <dongsunchao@gmail.com>
The `offset` argument to `mmap()` was computed with a direct cast from
pointer to `off_t`:
`(off_t)addr_hint - (off_t)map_base`
This causes a build failure when compiling since pointers and `off_t`
may differ in size on some platforms.
maps12.c: In function 'mmap_pages':
maps12.c:114:50: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
114 | filemap ? fd : -1, filemap ? ((off_t)addr_hint - (off_t)map_base) : 0);
| ^
maps12.c:114:69: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
114 | filemap ? fd : -1, filemap ? ((off_t)addr_hint - (off_t)map_base) : 0);
The fix in this patch is to cast both pointers to `intptr_t`,
perform the subtraction in that type, and then cast the result
back to `off_t`.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Branch protection uses PAC. It cryptographically "signs" a function's
return address before it is stored on the stack. Upon return, the address
is authenticated using a secret key. If the signature is invalid, the
program will fault.
The PIE code is used for the parasite and the restorer. In both cases, it
runs in a foreign process. The case of the restorer is even trickier
because it needs to restore the original PAC keys, which invalidates
all previously "signed" pointers within the restorer itself.
Fixes#2709
Signed-off-by: Andrei Vagin <avagin@gmail.com>
We need at least 6.16 to test MADV_GUARD_INSTALL support, but
our current Fedora Rawhide test uses only Rawhide's user space,
while using Fedora 42 kernel. Let's start using a vanilla kernel.
Suggested-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Introduce a new kind of VMA - VMA_AREA_GUARD. In fact, it is not
a real VMA as it is not represented as struct vm_area_struct in
the kernel.
We want to reuse an existing vma infrastructure in CRIU to dump
an information about MADV_GUARD_INSTALL-covered address space
ranges as VMAs. Then, on restore, we need to carefully skip
those fake VMAs everywhere we expect a normal VMAs to be processed.
And only in restorer we use these VMAs to get an information about
where to call MADV_GUARD_INSTALL.
Suggested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
1. get info about MADV_GUARD_INSTALL-protected pages with
help of pagemap by looking for PME_GUARD_REGION flag if /proc/<pid>/pagemap
is used or by looking for PAGE_IS_GUARD flag if ioctl(PAGEMAP_SCAN) is used
2. skip those pages
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Make should_dump_page to return int to indicate failure, also
return useful data back through the struct page_info structure
passed as a pointer.
Also, correspondingly convert all call sites.
No functional changes intended, except fixing a bug in
should_dump_page() as it could return (-1) when pmc_fill()
fails, while caller didn't expect that before.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
The arm64 tests are currently being executed on both actuated and GitHub
runners. This change removes the actuated runner to avoid redundancy and
streamline our CI process.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The tar command was failing with the following message:
$ tar cf criu.tar ../../../criu
tar: Removing leading `../../../' from member names
tar: ../../../criu/scripts/ci/criu.tar: archive cannot contain itself; not dumped
In addition, the /vagrant no-longer exist in the new Fedora images.
bash: line 1: cd: /vagrant: No such file or directory
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Installing this package currently fails with the following message:
Package qemu is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
E: Package 'qemu' has no installation candidate
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
See the previous commit for rationale and architecture-specific details.
[ avagin: tweak code comment ]
Signed-off-by: Ignacio Moreno Gonzalez <Ignacio.MorenoGonzalez@kuka.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
After the CRIU process saves the parasite code for the target thread in
the shared mmap, it is necessary to call __clear_cache before the target
thread executes the code.
Without this step, the target thread may not see the correct code to
execute, which can result in a SIGILL signal.
For the specific arm64 case. this is important so that the newly copied
code is flushed from d-cache to RAM, so that the target thread sees the
new code.
The change is based on commit 6be10a2 by @fu.lin and on input received
from @adrianreber.
[ avagin: tweak code comment ]
Signed-off-by: Ignacio Moreno Gonzalez <Ignacio.MorenoGonzalez@kuka.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
In general, we use "$(E)" instead of "$(Q) echo", but we also have
a msg-gen macro which can be used here.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 68f92b551 removed images/google/protobuf directory, so it is
re-created each time during the build process.
This resulted in a weird behavior change. Previously, one could do
something like this:
git clone $CRURL criu
(cd criu && sudo make install-criu)
rm -rf criu
This worked fine, including running rm -rf as a non-root user, since no
new directories were created under criu -- all directories were still
owned by the original user.
Since commit 68f92b551 the same sequence fails:
rm: cannot remove '/home/runner/criu/images/google/protobuf/descriptor.pb-c.c': Permission denied
rm: cannot remove '/home/runner/criu/images/google/protobuf/descriptor.pb-c.d': Permission denied
rm: cannot remove '/home/runner/criu/images/google/protobuf/descriptor.pb-c.h': Permission denied
A workaround is to keep empty images/google/protobuf directory,
which is what this commit does.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 68f92b551 used `$$(Q)` instead of `$(Q)` in the Makefile target,
which resulted in the following error:
$(Q) echo "Generating descriptor.pb-c.c"
/bin/sh: 1: Q: not found
Generating descriptor.pb-c.c
$(Q) protoc --proto_path=/usr/include --proto_path=images/ --c_out=images/ /usr/include/google/protobuf/descriptor.proto
/bin/sh: 1: Q: not found
as well as:
$(Q) rm -rf images/google
/bin/sh: line 1: Q: command not found
Fix it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Currently the build scripts create the following symlink:
criu-4.1/images/google/protobuf/descriptor.proto -> /usr/include/google/protobuf/descriptor.proto
This symlink points to a system-wide absolute-path target. Also,
this symlink ends up in the release tarball. The tarball may later be
downloaded and unpacked by e.g. OS distributions. If unpacking is
done using Python 3.14+, it will fail.
This happens because Python 3.14 will switch the default behavior of
extractall() from "fully trusting the content of archive" to
"disallow common attack vectors while extracting the archive".
With this new behavior, extractall() raises an exception when at
least one file in the archive extracts or points to outside of the
extraction directory (these are called path traversal attacks and
zip slip attacks).
Reported-by: Dmitrii Kuvaiskii <dimakuv@amazon.de>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The test creates a file bindmount in criu mntns and binds it into test
mntns, this external file bindmount is autodetected and restored via
"--external mnt[]" criu option.
Note: In previous patch we fix the problem on this code path where file
bindmount restore fails as there is excess "/" in source path.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
E.g. I have a /etc/hosts in workspace mounted from the host, and get the following message.
(00.141008) 1: mnt-v2: Create plain mountpoint /tmp/.criu.mntns.K1biY1/mnt-0000000938 for 938
(00.141546) 1: mnt-v2: Mounting unsupported @938 (0)
(00.141887) 1: mnt-v2: Bind /tmp/agent/1-d8c746c6fda3a8b2/workspace/etc/hosts/ to /tmp/.criu.mntns.K1biY1/mnt-0000000938
(00.142179) 1: Error (criu/mount-v2.c:319): mnt-v2: Failed to open_tree /tmp/agent/1-d8c746c6fda3a8b2/workspace/etc/hosts/: Not a directory
(00.143774) Error (criu/cr-restore.c:2320): Restoring FAILED.
Signed-off-by: Chuan Qiu <qiuc12@gmail.com>
Add ZDTM static tests for IP4/ICMP and IP6/ICMP
socket feature.
Signed-off-by: समीर सिंह Sameer Singh <lumarzeli30@gmail.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
Currently there is no option to checkpoint/restore programs that use
ICMP sockets, such as `ping`. This patch adds support for the same.
Fixes#2557
Signed-off-by: समीर सिंह Sameer Singh <lumarzeli30@gmail.com>
net/unix/max_dgram_qlen can't be tuned from non-root userns before:
v5.17-rc1~170^2~215 ("net: Enable max_dgram_qlen unix sysctl to be
configurable by non-init user namespaces")
Signed-off-by: Andrei Vagin <avagin@google.com>
We dump sysctls from criu user namespace, but restore from restored user
namespace. So group id values should be mapped to the restored user
namespace gid space to restore correctly.
Signed-off-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We have ability to skip sysctl if there is no value, but we still give
n requests to sysctl_op, that is not correct and probably can segfault
on nullptr access. Fix it by adding ri to count non skipped requests.
To be on the safe side, let's add a check that ri == n on read, as we
should not do any skips there.
While on it lets fix bad error message prefix: s/unix/ipv4/.
Remove excess has_iarg set, and add sarg reset to NULL for the case
sysctl_op skipped it.
Signed-off-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Having CTL_FLAGS_IPC_EACCES_SKIP == (CTL_FLAGS_OPTIONAL |
CTL_FLAGS_READ_EIO_SKIP) is probably not what we want. So let's make it
a real distinct flag.
Fixes: 840735aa0 ("ipc_sysctl: Prioritize restoring IPC variables using non usernsd approach")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The `criu cpuinfo check` command calls cpu_validate_cpuinfo(), which
attempts to open the cpuinfo.img file using `open_image()`. If the
image file is not found, `open_image()` returns an "empty image"
object. As a result, `cpu_validate_cpuinfo()` tries to read from it
and fails with the following error:
(00.002473) Error (criu/protobuf.c:72): Unexpected EOF on (empty-image)
This patch adds a check for an empty image and appropriate error message.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The cpuinfo command requires a "dump" or "check" subcommand. Thus, we
replace `CR_CPUINFO` with `CR_CPUINFO_DUMP` and `CR_CPUINFO_CHECK`.
This allows us to remove unnecessary subcommand check in
`image_dir_mode()` and perform all parsing in `parse_criu_mode()`.
With this change the check for validating the cpuinfo subcommand is
now done only once with `CR_CPUINFO_DUMP` or `CR_CPUINFO_CHECK` enum.
Signed-off-by: Liana Koleva <43767763+lianakoleva@users.noreply.github.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
CRIU currently requires a number of dependencies in order to build from
source. The package names vary across distributions and package
managers. A Nix flake allows developers to spin up a dev environment
with `nix develop`, eliminating the hassle of manual dependency
management. It also prevents polluting the global package set on the
machine.
Signed-off-by: Prajwal S N <prajwalnadig21@gmail.com>
In this test we want to ensure that contents of droppable mappings
and mappings with MADV_WIPEONFORK is properly restored in
parent/child processes.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Support MAP_DROPPABLE [1] by detecting it from /proc/<pid>/smaps
and restoring it as a normal private mapping flag on vma with only
difference that instead of MAP_PRIVATE we should use MAP_DROPPABLE.
[1] 9651fcedf7
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Support VM_WIPEONFORK [1] by detecting it from /proc/<pid>/smaps
and setting a corresponding MADV_WIPEONFORK flag on vma.
[1] d2cd9ede6e
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
The opts['action'] contains actor function and not the action name, so
we should compare it with a function.
While on it let's also add a comment about --criu-bin option if CRIU
binary is missing.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
By default zdtm expects that criu is built from source first and only
then you can run zdtm tests against it. But what if you really want to
run tests against a criu version installed on the system? Yes there is
already a nice option for zdtm to change the criu binary it uses
"--criu-bin", but it would still end up using the pycriu module from
source and you would still have to build everything beforehand.
Let's add an option to change the path where zdtm searches for pycriu
module "--pycriu-search-path". This way we can run zdtm tests on the
criu installed on the system directly without building criu from source,
e.g. on Fedora it works like:
test/zdtm.py run --criu-bin /usr/sbin/criu \
--pycriu-search-path /usr/lib/python3.13/site-packages \
-t zdtm/static/env00
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This patch implements the entire logic to enable the offloading of
buffer object content restoration.
The goal of this patch is to offload the buffer object content
restoration to the main CRIU process so that this restoration can occur
in parallel with other restoration logic (mainly the restoration of
memory state in the restore blob, which is time-consuming) to speed up
the restore phase. The restoration of buffer object content usually
takes a significant amount of time for GPU applications, so
parallelizing it with other operations can reduce the overall restore
time.
It has three parts: the first replaces the restoration of buffer objects
in the target process by sending a parallel restore command to the main
CRIU process; the second implements the POST_FORKING hook in the amdgpu
plugin to enable buffer object content restoration in the main CRIU
process; the third stops the parallel thread in the RESUME_DEVICES_LATE
hook.
This optimization only focuses on the single-process situation (common
case). In other scenarios, it will turn to the original method. This is
achieved with the new `parallel_disabled` flag.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently the restore of buffer object comsumes a significant amount of
time. However, this part has no logical dependencies with other restore
operations. This patch introduce some structures and some helper
functions for the target process to offload this task to the main CRIU
process.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
When enabling parallel restore, the target process and the main CRIU
process need an IPC interface to communicate and transfer restore
commands. This patch adds a Unix domain TCP socket and stores this
socket in `fdstore`.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently, parallel restore only focuses on the single-process
situation. Therefore, it needs an interface to know if there is only one
process to restore. This patch adds a `has_children` function in
`pstree.h` and replaces some existing implementations with this
function.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently, when CRIU calls `cr_plugin_init`, `fdstore` is not
initialized. However, during the plugin restore procedure, there may be
some common file operations used in multiple hooks. This patch moves
`cr_plugin_init` after `fdstore_init`, allowing `cr_plugin_init` to use
`fdstore` to place these file operations.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently, in the target process, device-related restore operations and
other restore operations almost run sequentially. When the target
process executes the corresponding CRIU hook functions, it can't perform
other restore operations. However, for GPU applications, some device
restore operations have no logical dependencies on other common restore
operations and can be parallelized with other operations to speed up the
process.
Instead of launching a thread in child processes for parallelization,
this patch chooses to add a new hook, `POST_FORKING`, in the main CRIU
process to handle these restore operations. This is because the
restoration of memory state in the restore blob is one of the most
time-consuming parts of all restore logic. The main CRIU process can
easily parallelize these operations, whereas parallelizing in threads
within child processes is challenging.
- POST_FORKING
*POST_FORKING: Hook to enable the main CRIU process to perform some
restore operations of plugins.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Building CRIU on Ubuntu 20.04 fails with the following error:
criu/sk-inet.c: In function 'can_dump_ipproto':
criu/sk-inet.c:131:16: error: 'IPPROTO_MPTCP' undeclared (first use in this function); did you mean 'IPPROTO_MTP'?
131 | if (proto == IPPROTO_MPTCP)
| ^~~~~~~~~~~~~
| IPPROTO_MTP
Add definition for MPTCP to fix this error.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The container checkpointing procedure in Kubernetes freezes running
containers to create a consistent snapshot of both the runtime state
and the rootfs of the container. However, when checkpointing a GPU
container, the container must be unfrozen before invoking the
cuda-checkpoint tool.
This is achieved in prepare_freezer_for_interrupt_only_mode(), which
needs to be called before the PAUSE_DEVICES hook. The patch introducing
this functionality fixes this problem for containers with multiple
processes. However, if the container has a single process,
prepare_freezer_for_interrupt_only_mode() must be invoked immediately
before the PAUSE_DEVICES hook.
Fixes: #2514
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
In 0a7c5fd1bd we swapped the BSD
implementation of strlcat and strlcpy in favor of our own replacement.
The checks and the predefined macros are not needed anymore.
Signed-off-by: Lorenzo Fontana <fontanalorenz@gmail.com>
In some cases, they might not work in virtual machines if the hypervisor
doesn't virtualize them. For example, they don't work in AMD SEV virtual
machines if the Debug Virtualization extension isn't supported or isn't
enabled in SEV_FEATURES.
Fixes#2658
Signed-off-by: Andrei Vagin <avagin@gmail.com>
With Go version 1.24, ListenConfig now uses MPTCP by default [1].
Checkpoint/restore for this protocol is not currently supported
and adding support requires kernel changes that are not trivial
to implement. As a result, checkpointing of many containers that
run Go programs is likely to fail with the following error [2]:
(00.026522) Error (criu/sk-inet.c:130): inet: Unsupported proto 262 for socket 2f9bc5
This patch adds a message with suggested workaround for this problem.
[1] https://go.dev/doc/go1.24#netpkgnet
[2] https://github.com/checkpoint-restore/criu/issues/2655
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
It makes root mount readonly and checks that it is still readonly after
migration.
Make zdtm/static writable for logs via "bind" desc option.
v2: explain why we don't have explicit rw/ro flag check
v3: use new zdtm "bind" desc option
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Add {'bind': 'path/to/bindmount'} zdtm descriptor option, so that in
test mount namespace a directory bindmount can be created before running
the test.
This is useful to leave test directory writable (e.g. for logs) while
the test makes root mount readonly. note: We create this bindmount early
so that all test files are opened on it initially and not on the below
mount. Will be used in mnt_ro_root test.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Mount flags belong to mount and mount namespace of the Container, so we
should preserve them, as Container user will not expect mounts switching
between ro and rw over c/r.
Fixes: #2632
v5: fix both mount-v1 and mount-v2
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Building CRIU package on Debian 11 aarch64 fails with
criu/arch/aarch64/crtools.c: In function 'save_pac_keys':
criu/arch/aarch64/crtools.c:32:31: error: storage size of 'paca' isn't known
struct user_pac_address_keys paca;
^~~~
criu/arch/aarch64/crtools.c:33:31: error: storage size of 'pacg' isn't known
struct user_pac_generic_keys pacg;
^~~~
criu/arch/aarch64/crtools.c:47:15: error: 'HWCAP_PACA' undeclared (first use in this function); did you mean 'HWCAP_FCMA'?
if (hwcaps & HWCAP_PACA) {
^~~~~~~~~~
HWCAP_FCMA
criu/arch/aarch64/crtools.c:47:15: note: each undeclared identifier is reported only once for each function it appears in
criu/arch/aarch64/crtools.c:53:44: error: 'NT_ARM_PACA_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_SVE'?
if ((ret = ptrace(PTRACE_GETREGSET, pid, NT_ARM_PACA_KEYS, &iov))) {
^~~~~~~~~~~~~~~~
NT_ARM_SVE
criu/arch/aarch64/crtools.c:73:39: error: 'NT_ARM_PAC_ENABLED_KEYS' undeclared (first use in this function)
ret = ptrace(PTRACE_GETREGSET, pid, NT_ARM_PAC_ENABLED_KEYS, &iov);
^~~~~~~~~~~~~~~~~~~~~~~
criu/arch/aarch64/crtools.c:82:15: error: 'HWCAP_PACG' undeclared (first use in this function); did you mean 'HWCAP_AES'?
if (hwcaps & HWCAP_PACG) {
^~~~~~~~~~
HWCAP_AES
criu/arch/aarch64/crtools.c:88:44: error: 'NT_ARM_PACG_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_SVE'?
if ((ret = ptrace(PTRACE_GETREGSET, pid, NT_ARM_PACG_KEYS, &iov))) {
^~~~~~~~~~~~~~~~
NT_ARM_SVE
criu/arch/aarch64/crtools.c:33:31: error: unused variable 'pacg' [-Werror=unused-variable]
struct user_pac_generic_keys pacg;
^~~~
criu/arch/aarch64/crtools.c:32:31: error: unused variable 'paca' [-Werror=unused-variable]
struct user_pac_address_keys paca;
^~~~
criu/arch/aarch64/crtools.c: In function 'arch_ptrace_restore':
criu/arch/aarch64/crtools.c:227:31: error: storage size of 'upaca' isn't known
struct user_pac_address_keys upaca;
^~~~~
criu/arch/aarch64/crtools.c:228:31: error: storage size of 'upacg' isn't known
struct user_pac_generic_keys upacg;
^~~~~
criu/arch/aarch64/crtools.c:241:18: error: 'HWCAP_PACA' undeclared (first use in this function); did you mean 'HWCAP_FCMA'?
if (!(hwcaps & HWCAP_PACA)) {
^~~~~~~~~~
HWCAP_FCMA
criu/arch/aarch64/crtools.c:255:44: error: 'NT_ARM_PACA_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_SVE'?
if ((ret = ptrace(PTRACE_SETREGSET, pid, NT_ARM_PACA_KEYS, &iov))) {
^~~~~~~~~~~~~~~~
NT_ARM_SVE
criu/arch/aarch64/crtools.c:261:44: error: 'NT_ARM_PAC_ENABLED_KEYS' undeclared (first use in this function)
if ((ret = ptrace(PTRACE_SETREGSET, pid, NT_ARM_PAC_ENABLED_KEYS, &iov))) {
^~~~~~~~~~~~~~~~~~~~~~~
criu/arch/aarch64/crtools.c:268:18: error: 'HWCAP_PACG' undeclared (first use in this function); did you mean 'HWCAP_AES'?
if (!(hwcaps & HWCAP_PACG)) {
^~~~~~~~~~
HWCAP_AES
criu/arch/aarch64/crtools.c:275:44: error: 'NT_ARM_PACG_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_SVE'?
if ((ret = ptrace(PTRACE_SETREGSET, pid, NT_ARM_PACG_KEYS, &iov))) {
^~~~~~~~~~~~~~~~
NT_ARM_SVE
criu/arch/aarch64/crtools.c:233:6: error: variable 'ret' set but not used [-Werror=unused-but-set-variable]
int ret;
^~~
criu/arch/aarch64/crtools.c:228:31: error: unused variable 'upacg' [-Werror=unused-variable]
struct user_pac_generic_keys upacg;
^~~~~
criu/arch/aarch64/crtools.c:227:31: error: unused variable 'upaca' [-Werror=unused-variable]
struct user_pac_address_keys upaca;
^~~~~
This patch adds the missing constants and structs if undefined.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
CRIU locks the network during restore in an "empty" network namespace.
However, "empty" in this context means CRIU isn't restoring the
namespace. This network namespace can be the same namespace where
processes have been dumped and so the network is already locked in it.
Fixes#2650
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Currently we save FP regs before parasite code runs, and restore after
for --leave-running, --check-only, and in case of errors. In case of
errors the error may have happened before FP regs were saved, so we
should only restore them if they were actually saved.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
On a RHEL 8 based system building CRIU fails with:
criu/arch/aarch64/crtools.c: In function 'save_pac_keys':
criu/arch/aarch64/crtools.c:73:39: error: 'NT_ARM_PAC_ENABLED_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_PACA_KEYS'?
ret = ptrace(PTRACE_GETREGSET, pid, NT_ARM_PAC_ENABLED_KEYS, &iov);
^~~~~~~~~~~~~~~~~~~~~~~
NT_ARM_PACA_KEYS
criu/arch/aarch64/crtools.c:73:39: note: each undeclared identifier is reported only once for each function it appears in
criu/arch/aarch64/crtools.c: In function 'arch_ptrace_restore':
criu/arch/aarch64/crtools.c:261:44: error: 'NT_ARM_PAC_ENABLED_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_PACA_KEYS'?
if ((ret = ptrace(PTRACE_SETREGSET, pid, NT_ARM_PAC_ENABLED_KEYS, &iov))) {
^~~~~~~~~~~~~~~~~~~~~~~
NT_ARM_PACA_KEYS
This adds the missing define if it is undefined.
Signed-off-by: Adrian Reber <areber@redhat.com>
The `goto interrupt` label is unnecessary as the code directly
returns after `cuda_process_checkpoint_action()`.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When handing errors for functions such as `ptrace()`, `pipe()`, and
`fork()` it would be better to use `pr_perror` instead of `pr_err`
as it would include a message describing the encountered error.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Thomas Gleixner introduced the new interface to create posix timers
with specifed timer IDs:
ec2d0c0462
Previously, CRIU recreated timers by repeatedly creating and deleting
them until the desired ID was reached. This approach isn't fast,
especially for timers with large IDs. For example, restoring two timers
with IDs 1000000 and 2000000 took approximately 1.5 seconds.
The new `prctl()` based interface allows direct creation of timers with
specified IDs, reducing the restoration time to around 3 microseconds
for the same example.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The stack test incorrectly assumed the page immediately
following the stack pointer could never be changed. This doesn't work,
because this page can be a part of another mapping.
This commit introduces a dedicated "stack redzone," a small guard region
directly after the stack. The stack test is modified to specifically
check for corruption within this redzone.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
This is highly confusing, and it seems that the ret variable
is not handled in the subsequent process.
Signed-off-by: Yuanhong Peng <yummypeng@linux.alibaba.com>
This release of CRIU (4.1.1) addresses a critical compatibility issue
introduced in the Linux kernel and back-ported to all stable releases.
The kernel commit (12f147ddd6de "do_change_type(): refuse to operate on
unmounted/not ours mounts") addressed the security issue introduced
almost 20 years ago. Unfortunately, this change inadvertently broke the
restore functionality of mount namespaces within CRIU. Users attempting
to restore a container on updated kernels would encounter the error:
"mnt-v2: Failed to make mount 476 slave: Invalid argument."
This release contains the necessary adjustments to CRIU, allowing it to
work seamlessly with kernels incorporating this security change.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
A kernel change (commit 12f147ddd6de, "do_change_type(): refuse to
operate on unmounted/not ours mounts") modified how mount propagation
properties can be changed. Previously, these properties could be changed
from any mount namespace. Now, they can only be modified from the
specific mount namespace where the target mount is actually mounted
This commit addresses this new restriction by ensuring that CRIU enters the
correct mount namespace before attempting to restore mount propagation
properties (MS_SLAVE or MS_SHARED) for a mount.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Major changes:
* RISC-V Support
* PIDFD Support
* CUDA Enhancements
* Fixes here and there
The full changelog can be found here: https://criu.org/Download/criu/4.1.
Signed-off-by: Andrei Vagin <avagin@google.com>
When using pr_err in signal handler, locking is used
in an unsafe manner. If another signal happens while holding the
lock, deadlock can happen.
To fix this, we can introduce mutex_trylock similar to
pthread_mutex_trylock that returns immediately. Due to the fact
that lock is used only for writing first_err, this change garantees
that deadlock cannot happen.
Fixes: #358
Signed-off-by: Ivan Pravdin <ipravdin.official@gmail.com>
free_userns_maps is called to clean up uid/gid map when the dump
finishes. If we try to clean up these maps in error cases, it can lead
to double free panic. So just skip cleaning up these maps and let
free_userns_maps do its job.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
There are a couple of tests that require the iptables binary.
Instead of adding a checkskip script, which could also handle this,
this change now uses CRIU's feature detection to see if the CRIU
feature 'has_ipt_legacy' exists.
Signed-off-by: Adrian Reber <areber@redhat.com>
If the tests in others/rpc are failing no information about that error
can be seen in a CI run. This change displays the log files if the test
fails.
Signed-off-by: Adrian Reber <areber@redhat.com>
The tests in others/rpc are running as non-root and
fail silently if the nftables network locking backend is used.
This switches those tests to skip the network locking.
Signed-off-by: Adrian Reber <areber@redhat.com>
The building section also contains the information how to change the
network locking backend without source code changes.
Signed-off-by: Adrian Reber <areber@redhat.com>
As different Linux distributions are switching away from iptables
to nftables, this makes it easier to compile CRIU with a different
default network locking backend. Instead of changing the source
code it is now possible to select the nft backend like this:
make NETWORK_LOCK_DEFAULT=NETWORK_LOCK_NFTABLES
Signed-off-by: Adrian Reber <areber@redhat.com>
Let's change the data types of `nbucket` and `nchain` to uint32.
This should fix the following compile-time error on arm32:
/criu/criu/pie/util-vdso.c:336: undefined reference to `__aeabi_uldivmod'
Signed-off-by: Andrei Vagin <avagin@google.com>
PAC stands for Pointer Authentication Code. Each process has 5 PAC keys
and a mask of enabled keys. All this properties have to be C/R-ed.
As they are per-process protperties, we can save/restore them just for
one thread.
Signed-off-by: Andrei Vagin <avagin@google.com>
Threads are put into cgroups through the cgroupd thread, which
communicates with other threads using a socketpair.
Previously, each thread received a dup'd copy of the socket, and did
the following
sendmsg(socket_dup_fd, my_cgroup_set);
// wait for ack.
while (1) {
recvmsg(socket_dup_fd, &h, MSG_PEEK);
if (h.pid != my_pid) continue;
recvmsg(socket_dup_fd, &h, 0);
}
close(socket_dup_fd);
When restoring many threads, many threads would be spinning in the
above loop waiting for their PID to appear.
In my test-case, restoring a process with a 11.5G heap and 491 threads
could take anywhere between 10 seconds and 60 seconds to complete.
To avoid the spinning, we drop the loop and MSG_PEEK, and add a lock
around the above code. This does not decrease parallelism, as the
cgroupd daemon uses a single thread anyway.
With the lock in place, the same restore consistently takes around 10
seconds on my machine (Thinkpad P14s, AMD Ryzen 8840HS).
There is a similar "daemon" thread for user namespaces. That already
is protected with a similar userns_sync_lock in __userns_call().
Fixes#2614
Signed-off-by: Han-Wen Nienhuys <hanwen@engflow.com>
* Hash buckets is an array of 32-bit words. While DT_HASH is 32-bit on
most platforms except s390 (where it's 64-bit).
* The bloom filter word size differs between 32-bit and 64-bit ELF
files. This commit adjusts the code to handle both cases.
Signed-off-by: Andrei Vagin <avagin@google.com>
Currently CRIU has the possibility to specify a LSM label during
restore. Unfortunately the information is completely ignored in the case
of SELinux.
This change selects the lsm label from the user if it is provided and
else the label from the checkpoint image is used.
Signed-off-by: Adrian Reber <areber@redhat.com>
With Python 3.13, the `subprocess` module now uses the
`posix_spawn()` function [1], which requires the `signal`
module to be imported.
Fixes: #2607
[1] https://docs.python.org/3/whatsnew/3.13.html#subprocess
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Add relevant elf header constants and notes for the arm platform
to enable coredump generation.
Signed-off-by: समीर सिंह Sameer Singh <lumarzeli30@gmail.com>
Add relevant elf header constants and notes for the aarch64 platform
to enable coredump generation.
Signed-off-by: समीर सिंह Sameer Singh <lumarzeli30@gmail.com>
strstartswith() function is incorrect choice for finding parent
directory so i change it to issubpath() function
Signed-off-by: Dmitrii Chervov <dschervov1@yandex.ru>
Currently Fedora rawhide based CI runs fail with:
/bin/sh: line 1: awk: command not found
Let's install it.
Signed-off-by: Adrian Reber <areber@redhat.com>
This way,
- Makefile is less cluttered;
- one can run codespell from the command line.
Fixes: fd7e97fcf ("lint: exclude tags file from codespell")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Trying to run latest CRIU on CentOS Stream 10 or Ubuntu 24.04 (aarch64)
fails like this:
# criu/criu check -v4
[...]
(00.096460) vdso: Parsing at ffffb2e2a000 ffffb2e2c000
(00.096539) vdso: PT_LOAD p_vaddr: 0
(00.096567) vdso: DT_STRTAB: 1d0
(00.096592) vdso: DT_SYMTAB: 128
(00.096616) vdso: DT_STRSZ: 8a
(00.096640) vdso: DT_SYMENT: 18
(00.096663) Error (criu/pie-util-vdso.c:193): vdso: Not all dynamic entries are present
(00.096688) Error (criu/vdso.c:627): vdso: Failed to fill self vdso symtable
(00.096713) Error (criu/kerndat.c:1906): kerndat_vdso_fill_symtable failed when initializing kerndat.
(00.096812) Found mmap_min_addr 0x10000
(00.096881) files stat: fs/nr_open 1073741816
(00.096908) Error (criu/crtools.c:267): Could not initialize kernel features detection.
This seems to be related to the kernel (6.12.0-41.el10.aarch64). The
Ubuntu user-space is running in a container on the same kernel.
Looking at the kernel this seems to be related to:
commit 48f6430505c0b0498ee9020ce3cf9558b1caaaeb
Author: Fangrui Song <i@maskray.me>
Date: Thu Jul 18 10:34:23 2024 -0700
arm64/vdso: Remove --hash-style=sysv
glibc added support for .gnu.hash in 2006 and .hash has been obsoleted
for more than one decade in many Linux distributions. Using
--hash-style=sysv might imply unaddressed issues and confuse readers.
Just drop the option and rely on the linker default, which is likely
"both", or "gnu" when the distribution really wants to eliminate sysv
hash overhead.
Similar to commit 6b7e26547fad ("x86/vdso: Emit a GNU hash").
The commit basically does:
-ldflags-y := -shared -soname=linux-vdso.so.1 --hash-style=sysv \
+ldflags-y := -shared -soname=linux-vdso.so.1 \
Which results in only a GNU hash being added to the ELF header. This
change has been merged with 6.11.
Looking at the referenced x86 commit:
commit 6b7e26547fad7ace3dcb27a5babd2317fb9d1e12
Author: Andy Lutomirski <luto@amacapital.net>
Date: Thu Aug 6 14:45:45 2015 -0700
x86/vdso: Emit a GNU hash
Some dynamic loaders may be slightly faster if a GNU hash is
available. Strangely, this seems to have no effect at all on
the vdso size.
This is unlikely to have any measurable effect on the time it
takes to resolve vdso symbols (since there are so few of them).
In some contexts, it can be a win for a different reason: if
every DSO has a GNU hash section, then libc can avoid
calculating SysV hashes at all. Both musl and glibc appear to
have this optimization.
It's plausible that this breaks some ancient glibc version. If
so, then, depending on what glibc versions break, we could
either require COMPAT_VDSO for them or consider reverting.
Which is also a really simple change:
-VDSO_LDFLAGS = -fPIC -shared $(call cc-ldoption, -Wl$(comma)--hash-style=sysv) \
+VDSO_LDFLAGS = -fPIC -shared $(call cc-ldoption, -Wl$(comma)--hash-style=both) \
The big difference here is that for x86 both hash sections are
generated. For aarch64 only the newer GNU hash is generated. That is why
we only see this error on kernel >= 6.11 and aarch64.
Changing from DT_HASH to DT_GNU_HASH seems to work on aarch64. The test
suite runs without any errors.
Unfortunately I am not aware of all implication of this change and if a
successful test suite run means that it still works.
Looking at the kernel I see following hash styles for the VDSO:
aarch64: not specified (only GNU hash style)
arm: --hash-style=sysv
loongarch: --hash-style=sysv
mips: --hash-style=sysv
powerpc: --hash-style=both
riscv: --hash-style=both
s390: --hash-style=both
x86: --hash-style=both
Only aarch64 on kernels >= 6.11 is a problem right now, because all
other platforms provide the old style hashing.
Signed-off-by: Adrian Reber <areber@redhat.com>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Co-authored-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
It is per net namespace, we need it to allow creation of unprivileged
ICMP sockets.
Note: in case this sysctl was disabled after unprivileged ICMP
socket was created we still need to somehow handle it on restore.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
For two cases libcriu was setting the RPC protobuf field `has_*` before
checking if the given parameter is valid. This can lead to situations,
if the caller doesn't check the return value, that we pass as RPC struct
to CRIU which has the `has_*` protobuf field set to true, but does not
have a verified value (or non at all) set for the actual RPC entry.
Signed-off-by: Adrian Reber <areber@redhat.com>
Temporarily disable CUDA plugin for `criu pre-dump`.
pre-dump currently fails with the following error:
Handling VMA with the following smaps entry: 1822c000-18da5000 rw-p 00000000 00:00 0 [heap]
Handling VMA with the following smaps entry: 200000000-200200000 ---p 00000000 00:00 0
Handling VMA with the following smaps entry: 200200000-200400000 rw-s 00000000 00:06 895 /dev/nvidia0
Error (criu/proc_parse.c:116): handle_device_vma plugin failed: No such file or directory
Error (criu/proc_parse.c:632): Can't handle non-regular mapping on 705693's map 200200000
Error (criu/cr-dump.c:1486): Collect mappings (pid: 705693) failed with -1
We plan to enable support for pre-dump by skipping nvidia mappings
in a separate patch.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Move `run_plugins(CHECKPOINT_DEVICES)` out of `collect_pstree()` to
ensure that the function's sole responsibility is to use the cgroup
freezer for the process tree. This allows us to avoid a time-out
error when checkpointing applications with large GPU state.
v2: This patch calls `checkpoint_devices()` only for `criu dump`.
Support for GPU checkpointing with `pre-dump` will be introduced in
a separate patch.
Suggested-by: Andrei Vagin <avagin@google.com>
Suggested-by: Jesus Ramos <jeramos@nvidia.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When creating a checkpoint of large models, the `checkpoint` action of
`cuda-checkpoint` can exceed the CRIU timeout. This causes CRIU to fail
with the following error, leaving the CUDA task in a locked state:
cuda_plugin: Checkpointing CUDA devices on pid 84145 restore_tid 84202
Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0
Error (cuda_plugin.c:139): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call
Error (cuda_plugin.c:396): cuda_plugin: CHECKPOINT_DEVICES failed with
net: Unlock network
cuda_plugin: finished cuda_plugin stage 0 err -1
cuda_plugin: resuming devices on pid 84145
cuda_plugin: Restore thread pid 84202 found for real pid 84145
Unfreezing tasks into 1
Unseizing 84145 into 1
Error (criu/cr-dump.c:2111): Dumping FAILED.
To fix this, we set `task_info->checkpointed` before invoking
the `checkpoint` action to ensure that the CUDA task is resumed
even if CRIU times out.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Using libnftables the chain to lock the network is composed of
("CRIU-%d", real_pid). This leads to around 40 zdtm tests failing
with errors like this:
Error: No such file or directory; did you mean table 'CRIU-62' in family inet?
delete table inet CRIU-86
The reason is that as soon as a process is running in a namespace the
real PID can be anything and only the PID in the namespace is restored
correctly. Relying on the real PID does not work for the chain name.
Using the PID of the innermost namespace would lead to the chain be
called 'CRIU-1' most of the time which is also not really unique.
With this commit the change is now named using the already existing CRIU
run ID. To be able to correctly restore the process and delete the
locking table, the CRIU run id during checkpointing is now stored in the
inventory as dump_criu_run_id.
Signed-off-by: Adrian Reber <areber@redhat.com>
criu_run_id will be used in upcoming changes to create and remove
network rules for network locking. Instead of trying to come up with
a way to create unique IDs, just use an existing library.
libuuid should be installed on most systems as it is indirectly required
by systemd (via libmount).
Signed-off-by: Adrian Reber <areber@redhat.com>
It creates a few timers with log expiration intervals, waites for C/R
and check that timers are armed and their intervals have been restored.
Signed-off-by: Austin Kuo <hsuanchikuo@gmail.com>
On aarch64 the test cmdlinenv00 was failing with:
FAIL: cmdlinenv00.c:120: auxv corrupted on restore (errno = 11 (Resource temporarily unavailable))
Starting with Linux kernel version 6.3 the size of AUXV was changed:
commit 28c8e088427ad30b4260953f3b6f908972b77c2d
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date: Wed Jan 4 14:20:54 2023 -0500
rseq: Increase AT_VECTOR_SIZE_BASE to match rseq auxvec entries
Two new auxiliary vector entries are introduced for rseq without
matching increment of the AT_VECTOR_SIZE_BASE, which causes failures
with CONFIG_HARDENED_USERCOPY=y.
Fixes: 317c8194e6ae ("rseq: Introduce feature size and alignment ELF auxiliary vector entries")
With this change AT_VECTOR_SIZE increases from 40 to 50 on aarch64. CRIU
uses AT_VECTOR_SIZE to read the content of /proc/PID/auxv
auxv_t mm_saved_auxv[AT_VECTOR_SIZE];
ret = read(fd, mm_saved_auxv, sizeof(mm_saved_auxv));
Now the tests works again on aarch64.
Signed-off-by: Adrian Reber <areber@redhat.com>
Running the zdtm/static/unlink_regular00 test on Ubuntu 24.04 on aarch64
results in following error:
# ./zdtm.py run -t zdtm/static/unlink_regular00 -k always
userns is supported
=== Run 1/1 ================ zdtm/static/unlink_regular00
==================== Run zdtm/static/unlink_regular00 in ns ====================
Skipping rtc at root
Start test
Test is SUID
./unlink_regular00 --pidfile=unlink_regular00.pid --outfile=unlink_regular00.out --dirname=unlink_regular00.test
Run criu dump
*** buffer overflow detected ***: terminated
############# Test zdtm/static/unlink_regular00 FAIL at CRIU dump ##############
Test output: ================================
<<< ================================
Send the 9 signal to 47
Wait for zdtm/static/unlink_regular00(47) to die for 0.100000
##################################### FAIL #####################################
According to the backtrace:
#0 __pthread_kill_implementation (threadid=281473158467616, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1 0x0000ffff93477690 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2 0x0000ffff9342cb3c in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x0000ffff93417e00 in __GI_abort () at ./stdlib/abort.c:79
#4 0x0000ffff9346abf0 in __libc_message_impl (fmt=fmt@entry=0xffff93552a78 "*** %s ***: terminated\n") at ../sysdeps/posix/libc_fatal.c:132
#5 0x0000ffff934e81a8 in __GI___fortify_fail (msg=msg@entry=0xffff93552a28 "buffer overflow detected") at ./debug/fortify_fail.c:24
#6 0x0000ffff934e79e4 in __GI___chk_fail () at ./debug/chk_fail.c:28
#7 0x0000ffff934e9070 in ___snprintf_chk (s=s@entry=0xffffc6ed04a3 "testfile", maxlen=maxlen@entry=4056, flag=flag@entry=2, slen=slen@entry=4053,
format=format@entry=0xaaaacffe3888 "link_remap.%d") at ./debug/snprintf_chk.c:29
#8 0x0000aaaacff4b8b8 in snprintf (__fmt=0xaaaacffe3888 "link_remap.%d", __n=4056, __s=0xffffc6ed04a3 "testfile")
at /usr/include/aarch64-linux-gnu/bits/stdio2.h:54
#9 create_link_remap (path=path@entry=0xffffc6ed2901 "/zdtm/static/unlink_regular00.test/subdir/testfile", len=len@entry=60, lfd=lfd@entry=20,
idp=idp@entry=0xffffc6ed14ec, nsid=nsid@entry=0xaaaada2bac00, parms=parms@entry=0xffffc6ed2808, fallback=0xaaaacff4c6c0 <dump_linked_remap+96>,
fallback@entry=0xffffc6ed2797) at criu/files-reg.c:1164
#10 0x0000aaaacff4c6c0 in dump_linked_remap (path=path@entry=0xffffc6ed2901 "/zdtm/static/unlink_regular00.test/subdir/testfile", len=len@entry=60,
parms=parms@entry=0xffffc6ed2808, lfd=lfd@entry=20, id=id@entry=12, nsid=nsid@entry=0xaaaada2bac00, fallback=fallback@entry=0xffffc6ed2797)
at criu/files-reg.c:1198
#11 0x0000aaaacff4d8b0 in check_path_remap (nsid=0xaaaada2bac00, id=12, lfd=20, parms=0xffffc6ed2808, link=<optimized out>) at criu/files-reg.c:1426
#12 dump_one_reg_file (lfd=20, id=12, p=0xffffc6ed2808) at criu/files-reg.c:1827
#13 0x0000aaaacff51078 in dump_one_file (pid=<optimized out>, fd=4, lfd=20, opts=opts@entry=0xaaaada2ba2c0, ctl=ctl@entry=0xaaaada2c4d50,
e=e@entry=0xffffc6ed39c8, dfds=dfds@entry=0xaaaada2c3d40) at criu/files.c:581
#14 0x0000aaaacff5176c in dump_task_files_seized (ctl=ctl@entry=0xaaaada2c4d50, item=item@entry=0xaaaada2b8f80, dfds=dfds@entry=0xaaaada2c3d40)
at criu/files.c:657
#15 0x0000aaaacff3d3c0 in dump_one_task (parent_ie=0x0, item=0xaaaada2b8f80) at criu/cr-dump.c:1679
#16 cr_dump_tasks (pid=<optimized out>) at criu/cr-dump.c:2224
#17 0x0000aaaacff163a0 in main (argc=<optimized out>, argv=0xffffc6ed40e8, envp=<optimized out>) at criu/crtools.c:293
This line is the problem:
snprintf(tmp + 1, sizeof(link_name) - (size_t)(tmp - link_name - 1), "link_remap.%d", rfe.id);
The problem was that the `-1` was on the inside of the braces and not on
the outside. This way the destination size was increase by 1 instead of
being decreased by 1 which triggered the buffer overflow detection.
Signed-off-by: Adrian Reber <areber@redhat.com>
Based on the code, the `ret` variable at this point does not
represent the task state, so this log message should be
moved to a position after the `compel_wait_task()` function.
Signed-off-by: Yuanhong Peng <yummypeng@linux.alibaba.com>
When using the nftables network locking backend and restoring a process
a second time the network locking has already been deleted by the first
restore. The second restore will print out to the console text like:
Error: Could not process rule: No such file or directory
delete table inet CRIU-202621
With this change CRIU's log FD is used by libnftables stdout and stderr.
Signed-off-by: Adrian Reber <areber@redhat.com>
Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.
In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).
When criu is used, it fails to freeze a cgroup from time to time
(see [1], [2]). Let's try adding kludges similar to ones in runc.
Alas, I have absolutely no way to test this, so please review carefully.
[1]: https://github.com/opencontainers/runc/issues/4273
[2]: https://github.com/opencontainers/runc/issues/4457
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There are a few issues with the freeze_processes logic:
1. Commit 9fae23fbe2 grossly (by 1000x) miscalculated the number of
attempts required, as a result, we are seeing something like this:
> (00.000340) freezing processes: 100000 attempts with 100 ms steps
> (00.000351) freezer.state=THAWED
> (00.000358) freezer.state=FREEZING
> (00.100446) freezer.state=FREEZING
> ...close to 100 lines skipped...
> (09.915110) freezer.state=FREEZING
> (10.000432) Error (criu/cr-dump.c:1467): Timeout reached. Try to interrupt: 0
> (10.000563) freezer.state=FREEZING
For 10s with 100ms steps we only need 100 attempts, not 100000.
2. When the timeout is hit, the "failed to freeze cgroup" error is not
printed, and the log_unfrozen_stacks is not called either.
3. The nanosleep at the last iteration is useless (this was hidden by
issue 1 above, as the timeout was hit first).
Fix all these.
While at it,
4. Amend the error message with the number of attempts, sleep duration,
and timeout.
5. Modify the "freezing cgroup" debug message to be in sync with the
above error.
Was:
> freezing processes: 100000 attempts with 100 ms steps
Now:
> freezing cgroup some/name: 100 x 100ms attempts, timeout: 10s
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The kernel releases a test socket asynchronously, so the restore can
fail if it is executed before the kernel actually destroys the socket.
Fixes#2537
Signed-off-by: Andrei Vagin <avagin@google.com>
Right now, this test fails with this error:
Error (criu/files-reg.c:1031): Can't dump ghost file
/criu/test/javaTests/omrvmem_000000626_Mlm48x of 2097152 size,
increase limit
Signed-off-by: Andrei Vagin <avagin@google.com>
cuda-checkpoint returns the positive CUDA error code when it runs into an issue
and passing that along as the return value would cause errors to get ignored
Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
The vvar_vclock was introduced by [1]. Basically, the old vvar vma has
been splited on two parts. In term of C/R, these two vma-s can be still
treated as one.
[1] e93d2521b27f ("x86/vdso: Split virtual clock pages into dedicated mapping")
Signed-off-by: Andrei Vagin <avagin@google.com>
Fix for the following error when building CRIU on Rocky Linux 8
criu/pidfd.c: In function ‘pidfd_open’:
criu/pidfd.c:119:17: error: ‘__NR_pidfd_open’ undeclared (first use in this function); did you mean ‘pidfd_open’?
return syscall(__NR_pidfd_open, pid, flags);
^~~~~~~~~~~~~~~
pidfd_open
criu/pidfd.c:119:17: note: each undeclared identifier is reported only once for each function it appears in
criu/pidfd.c:120:1: error: control reaches end of non-void function [-Werror=return-type]
}
^
criu/pidfd.c: At top level:
cc1: error: unrecognized command line option ‘-Wno-unknown-warning-option’ [-Werror]
cc1: error: unrecognized command line option ‘-Wno-dangling-pointer’ [-Werror]
cc1: all warnings being treated as errors
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
We need to dynamically calculate TASK_SIZE depending
on the MMU on RISC-V system. [We are using analogical
approach on aarch64/ppc64le.]
This change was tested on physical machine:
StarFive VisionFive 2
isa : rv64imafdc_zicntr_zicsr_zifencei_zihpm_zca_zcd_zba_zbb
mmu : sv39
uarch : sifive,u74-mc
mvendorid : 0x489
marchid : 0x8000000000000007
mimpid : 0x4210427
hart isa : rv64imafdc_zicntr_zicsr_zifencei_zihpm_zca_zcd_zba_zbb
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
We don't need to have compel/arch/riscv64/plugins/std/syscalls/syscalls.S
tracked in git. It is autogenerated. We also need to update our .gitignore
to ignore autogenerated files with syscall tables.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
If a CUDA process is already in a "locked" or "checkpointed" state
during criu dump, the CUDA plugin currently fails with an error because
it attempts an unnecessary "lock" action using the cuda-checkpoint tool.
This patch extends the CUDA plugin to handle such cases by first
verifying the initial state of the CUDA processes and skipping
unnecessary "lock" and "checkpoint" actions when a process has been
locked or checkpointed before CRIU is invoked.
In particular, CUDA tasks may already be in a "locked" or "checkpointed"
state to ensure consistent checkpoint/restore for distributed workloads,
such as model training, where multiple containers run across different
cluster nodes.
Another use case for this functionality is optimizing resource
utilization, where CUDA tasks with low-priority are preempted
immediately to release GPU resources needed by high-priority
tasks, and the paused workloads are later resumed or migrated
to another node.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
We have multiple processes open a pidfd to a common dead process.
After C/R we check that the inode numbers for these pidfds are equal or
not.
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
Currently, the `waitpid()` call on the tmp process can be made by a
process which is not its parent. This causes restore to fail.
This patch instead selects one process to create the tmp process and
open all the fds that point to it. These fds are sent to the correct
process(es).
Fixes: #2496
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
The check for `/dev/nvidiactl` to determine if the CUDA plugin can be
used is unreliable because in some cases the default path for driver
installation is different [1]. This patch changes the logic to check
if a GPU device is available in `/proc/driver/nvidia/gpus/`. This
approach is similar to `torch.cuda.is_available()` and it is a more
accurate indicator.
The subsequent check for support of the `cuda-checkpoint --action`
option would confirm if the driver supports checkpoint/restore.
[1] https://github.com/NVIDIA/gpu-operatorFixes: #2509
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Container runtimes like CRI-O and containerd utilize the freezer cgroup
to create a consistent snapshot of container root filesystem (rootfs)
changes. In this case, the container is frozen before invoking CRIU.
After CRIU successfully completes, a copy of the container rootfs diff
is saved, and the container is then unfrozen.
However, the `cuda-checkpoint` tool is not able to perform a 'lock'
action on frozen threads. To support GPU checkpointing with these
container runtimes, we need to unfreeze the cgroup and return it to its
original state once the checkpointing is complete.
To reflect this new behavior, the following changes are applied:
- `dont_use_freeze_cgroup(void)` -> `set_compel_interrupt_only_mode(void)`
- `bool freeze_cgroup_disabled` -> `bool compel_interrupt_only_mode`
- `check_freezer_cgroup(void)` -> `prepare_freezer_for_interrupt_only_mode(void)`
Note that when `compel_interrupt_only_mode` is set to `true`,
`compel_interrupt_task()` is used instead of `freeze_processes()`
to prevent tasks from running during `criu dump`.
Fixes: #2508
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When `check_freezer_cgroup()` has non-zero return value, `goto err` calls
`return ret`. However, the value of `ret` has been set to `0` in the lines
above and CRIU does not handle the error properly.
This problem is related to https://github.com/checkpoint-restore/criu/issues/2508
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When restoring dumps in new mount + pid namespaces where multiple dumps
share the same network namespace, CRIU may fail due to conflicting
unix socket names. This happens because the service worker creates
sockets using a pattern that includes criu_run_id, but util_init()
is called after cr_service_work() starts.
The socket naming pattern "crtools-fd-%d-%d" uses the restore PID
and criu_run_id, however criu_run_id is always 0 when not initialized,
leading to conflicts when multiple restores run simultaneously either
in the same CRIU process or because of multiple CRIU processes
doing the same operation in different PID namespaces.
Fix this by:
- Moving util_init() before cr_service_work() starts
- Adding a second util_init() call in the service worker fork
to ensure unique IDs across multiple worker runs
- Making sure that dump and restore operations have util_init() called
early to generate unique socket names
With this fix, socket names always include the namespace ID, preventing
conflicts when multiple processes with the same pid share a network
namespace.
Fixes#2499
[ avagin: minore code changes ]
Signed-off-by: Lorenzo Fontana <fontanalorenz@gmail.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
After a fork, both the child and parent processes may trigger a page fault (#PF)
at the same virtual address, referencing the same position in the page image.
If deduplication is enabled, the last process to trigger the page fault will fail.
Therefore, deduplication should be disabled after a fork to prevent this issue.
Signed-off-by: Liu Hua <weldonliu@tencent.com>
This patch blocks SIGCHLD during temporary process creation to prevent a
race condition between kill() and waitpid() where sigchld_handler()
causes `criu restore` to fail with an error.
Fixes: #2490
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch adds two test plugins to verify that CRIU plugins listed
in the inventory image are enabled, while those that are not listed
can be disabled.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch extends the inventory image with a `plugins` field that
contains an array of plugins which were used during checkpoint,
for example, to save GPU state. In particular, the CUDA and AMDGPU
plugins are added to this field only when the checkpoint contains
GPU state. This allows to disable unnecessary plugins during restore,
show appropriate error messages if required CRIU plugin are missing,
and migrate a process that does not use GPU from a GPU-enabled system
to CPU-only environment.
We use the `optional plugins_entry` for backwards compatibility. This
entry allows us to distinguish between *unset* and *missing* field:
- When the field is missing, it indicates that the checkpoint was
created with a previous version of CRIU, and all plugins should be
*enabled* during restore.
- When the field is empty, it indicates that no plugins were used during
checkpointing. Thus, all plugins can be *disabled* during restore.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch fixes the following errors reported by ruff:
lib/pycriu/images/pb2dict.py:307:24: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
|
305 | elif field.type in _basic_cast:
306 | cast = _basic_cast[field.type]
307 | if pretty and (cast == int):
| ^^^^^^^^^^^ E721
308 | if is_hex:
309 | # Fields that have (criu).hex = true option set
|
lib/pycriu/images/pb2dict.py:379:13: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
|
377 | elif field.type in _basic_cast:
378 | cast = _basic_cast[field.type]
379 | if (cast == int) and is_string(value):
| ^^^^^^^^^^^ E721
380 | if _marked_as_dev(field):
381 | return encode_dev(field, value)
|
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
We open a pidfd to a thread using `PIDFD_THREAD` flag and after C/R
ensure that we can send signals using it with `PIDFD_SIGNAL_THREAD`.
signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
After, C/R of pidfds that point to dead processes their inodes might
change. But if two pidfds point to same dead process they should
continue to do so after C/R.
This test ensures that this happens by calling `statx()` on pidfds after
C/R and then comparing their inode numbers.
Support for comparing pidfds by using `statx()` and inode numbers was
introduced alongside pidfs. So if `f_type` of pidfd is not equal to
`PID_FS_MAGIC` then we skip this test.
signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
Validate that pidfds can been used to send signals to different
processes after C/R using the `pidfd_send_signal()` syscall.
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
Process file descriptors (pidfds) were introduced to provide a stable
handle on a process. They solve the problem of pid recycling.
For a detailed explanation, see https://lwn.net/Articles/801319/ and
http://www.corsix.org/content/what-is-a-pidfd
Before Linux 6.9, anonymous inodes were used for the implementation of
pidfds. So, we detect them in a fashion similiar to other fd types that
use anonymous inodes by calling `readlink()`.
After 6.9, pidfs (a file system for pidfds) was introduced.
In 6.9 `S_ISREG()` returned true for pidfds, but this again changed with
6.10.
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/pidfs.c?h=v6.11-rc2#n285)
After this change, pidfs inodes have no file type in st_mode in
userspace.
We use `PID_FS_MAGIC` to detect pidfds for kernel >= 6.9
Hence, check for pidfds occurs before the check for regular files.
For pidfds that refer to dead processes, we lose the pid of the process
as the Pid and NSpid fields in /proc/<pid>/fdinfo/<pidfd> change to -1.
So, we create a temporary process for each unique inode and open pidfds
that refer to this process. After all pidfds have been opened we kill
this temporary process.
This commit does not include support for pidfds that point to a specific
thread, i.e pidfds opened with `PIDFD_THREAD` flag.
Fixes: #2258
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
We only use the last pid from the list in NSpid entry (from
/proc/<pid>/fdinfo/<pidfd>) while restoring pidfds.
The last pid refers to the pid of the process in the most deeply nested
pid namespace. Since CRIU does not currently support nested pid
namespaces, this entry is the one we want.
After Linux 6.9, inode numbers can be used to compare pidfds. pidfds
referring to the same process will have the same inode numbers. We use
inode numbers to restore pidfds that point to dead processes.
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
By default, CRIU uses the path "/usr/lib/criu" to install and load
plugins at runtime. This path is defined by the `PLUGINDIR` variable
in Makefile.install and `CR_PLUGIN_DEFAULT` in `criu/include/plugin.h`.
However, some distribution packages might install the CRIU plugins at
"/usr/lib64/criu" instead. This patch updates the makefile to align
the path defined by `CR_PLUGIN_DEFAULT` with the value of `PLUGINDIR`.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch fixes the following warnings that appear
when building an RPM package:
+ /usr/lib/rpm/redhat/brp-mangle-shebangs
*** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.c is executable but has no shebang, removing executable bit
*** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.h is executable but has no shebang, removing executable bit
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Major changes:
* CUDA plugin to support checkpointing and restoring NVIDIA CUDA applications.
* Shadow stack support
* Pagemap cache: Added support for PAGEMAP_SCAN ioctl
The full changelog can be found here: https://criu.org/Download/criu/4.0.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The topology parsing assumed that all parameter names were
30 characters or fewer, but
recommended_sdma_engine_id_mask
is 31 characters.
Make the maximum length a macro, and set it to 64.
Signed-off-by: David Francis <David.Francis@amd.com>
The presence of /dev/nvidiactl indicates that the system has a
compatible NVIDIA GPU driver installed and that the GPU is accessible to
the operating system.
Signed-off-by: Andrei Vagin <avagin@google.com>
Some plugins (e.g., CUDA) may not function correctly when processes are
frozen using cgroups. This change introduces a mechanism to disable the
use of freeze cgroups during process seizing, even if explicitly
requested via the --freeze-cgroup option.
The CUDA plugin is updated to utilize this new mechanism to ensure
compatibility.
Signed-off-by: Andrei Vagin <avagin@google.com>
This patch fixes the following typos reported by codespell:
./test/others/bers/bers.c:394: dependin ==> depending, depend in
./criu/kerndat.c:837: hitted ==> hit
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The `uninstall_module.py` script is a wrapper for the `pip uninstall`
command that enables support for specifying installation prefix
(i.e., `--prefix`). When this functionality is used, we intentionally
set `sys.path` to include only search paths for the specified prefix
to avoid unintentional uninstallation of packages in system paths.
Since `importlib_metadata` version 8.1.0, the `Distribution.from_name()`
method has been modified [1] to perform additional pre-processing of
Distribution objects [2] that requires loading distribution metadata
and results in the following error:
File "/usr/local/lib/python3.12/site-packages/importlib_metadata/__init__.py", line 422, in <lambda>
buckets = bucket(dists, lambda dist: bool(dist.metadata))
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/importlib_metadata/__init__.py", line 454, in metadata
from . import _adapters
File "/usr/local/lib/python3.12/site-packages/importlib_metadata/_adapters.py", line 3, in <module>
import email.message
File "/usr/lib64/python3.12/email/message.py", line 11, in <module>
import quopri
ModuleNotFoundError: No module named 'quopri'
This error occurs because we have excluded system paths from the list
of search paths (`sys.path`).
However, this pre-processing is not required for our use case, as we
only use the discovery mechanism of importlib_metadata to resolve the
metadata directory path of the module being uninstalled.
To fix this problem, this patch updates `uninstall_module` to avoid the
`from_name()` method and use `discover(name=package_name)` directly.
[1] a65c29adc0
[2] a65c29ad/importlib_metadata/__init__.py (L391)Fixes: #2468
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When attempting to checkpoint a container with CUDA processes,
CRIU could fail with the following error:
Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 1
Error (cuda_plugin.c:143): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call
Error (cuda_plugin.c:384): cuda_plugin: PAUSE_DEVICES failed with
In this situation, the target process is locked, but CRIU fails due to
a timeout and exits with an error. We need to make sure that the target
PID is unlocked in such case.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Some test environments (Actuated runners for example) do not support
maclvan devices. Skip tests depending on it automatically.
Signed-off-by: Adrian Reber <areber@redhat.com>
Previously the check was just if /sys/fs/selinux is mounted. This
extends the check to see if all necessary tools are installed.
Signed-off-by: Adrian Reber <areber@redhat.com>
Running 'crit x ./ rss' on aarch64 crashes with:
File "/home/criu/crit/crit/__main__.py", line 331, in explore_rss
while vmas[vmi]['start'] < pme:
~~~~^^^^^
IndexError: list index out of range
This adds an additional check to the while loop to do access indexes out
of range.
Signed-off-by: Adrian Reber <areber@redhat.com>
Errors on aarch64:
In file included from amdgpu_plugin_drm.h:10,
from amdgpu_plugin.c:33:
amdgpu_plugin.c: In function 'amdgpu_plugin_dump_file':
amdgpu_plugin_util.h:24:20: error: format '%lld' expects argument of type 'long long int', but argument 6 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin.c:1236:9: note: in expansion of macro 'pr_info'
1236 | pr_info("devices:%d bos:%d objects:%d priv_data:%lld\n", args.num_devices, args.num_bos, args.num_objects,
| ^~~~~~~
cc1: all warnings being treated as errors
Errors on ppc64:
In file included from amdgpu_plugin_drm.h:10,
from amdgpu_plugin.c:33:
amdgpu_plugin.c: In function 'amdgpu_plugin_dump_file':
amdgpu_plugin_util.h:24:20: error: format '%llu' expects argument of type 'long long unsigned int', but argument 6 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin.c:1236:9: note: in expansion of macro 'pr_info'
1236 | pr_info("devices:%u bos:%u objects:%u priv_data:%llu\n",
| ^~~~~~~
cc1: all warnings being treated as errors
In file included from amdgpu_plugin_util.c:38:
amdgpu_plugin_util.c: In function 'print_kfd_bo_stat':
amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin_util.c:196:17: note: in expansion of macro 'pr_info'
196 | pr_info("%s(), %d. KFD BO Addr: %llx \n", __func__, idx, bo->addr);
| ^~~~~~~
amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin_util.c:197:17: note: in expansion of macro 'pr_info'
197 | pr_info("%s(), %d. KFD BO Size: %llx \n", __func__, idx, bo->size);
| ^~~~~~~
amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin_util.c:198:17: note: in expansion of macro 'pr_info'
198 | pr_info("%s(), %d. KFD BO Offset: %llx \n", __func__, idx, bo->offset);
| ^~~~~~~
amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin_util.c:199:17: note: in expansion of macro 'pr_info'
199 | pr_info("%s(), %d. KFD BO Restored Offset: %llx \n", __func__, idx, bo->restored_offset);
| ^~~~~~~
cc1: all warnings being treated as errors
Co-developed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Skip cross-compilation on armv7 because, among many other errors,
it fails with the following:
In file included from ../../include/common/lock.h:9,
from ../../criu/include/files.h:9,
from amdgpu_plugin.c:30:
../../include/common/asm/atomic.h:60:2: error: #error ARM architecture version (CONFIG_ARMV*) not set or unsupported.
60 | #error ARM architecture version (CONFIG_ARMV*) not set or unsupported.
| ^~~~~
../../include/common/asm/atomic.h: In function 'atomic_add_return':
../../include/common/asm/atomic.h:81:9: error: implicit declaration of function 'smp_mb' [-Werror=implicit-function-declaration]
81 | smp_mb();
| ^~~~~~
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
To enable cross-compile we need to use the CC definition from
criu/scripts/nmk/scripts/tools.mk:
CC := $(CROSS_COMPILE)$(HOSTCC)
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Here is an example how to run one test:
$ python test/zdtm.py run -t zdtm/static/env00 --ignore-taint --mocked-cuda-checkpoint
Signed-off-by: Andrei Vagin <avagin@google.com>
1. os auto assignment vma addr maybe conflict with vma in gpu living migrate scene;
2. so, we should give choice to user;
Signed-off-by: haozi007 <liuhao27@huawei.com>
New internal glibc types __timeval64 [1] and __suseconds64_t [2] have
been introduced as a solution for the Y2038 problem [3]. These 64-bit
types are used across all architectures. However, this change causes
the following build errors when cross-compiling on ARMv7 (armhf):
criu/timer.c:49:17: error: format '%ld' expects argument of type 'long int', but argument 5 has type '__suseconds64_t' {aka 'long long int'} [-Werror=format=]
49 | pr_info("Restored %s timer to %" PRId64 ".%ld -> %" PRId64 ".%ld\n", n,
| ^~~~~~~~~~~~~~~~~~~~~~~~
50 | (int64_t)val->it_value.tv_sec, val->it_value.tv_usec,
| ~~~~~~~~~~~~~~~~~~~~~
| |
| __suseconds64_t {aka long long int}
criu/timer.c:49:17: error: format '%ld' expects argument of type 'long int', but argument 7 has type '__suseconds64_t' {aka 'long long int'} [-Werror=format=]
49 | pr_info("Restored %s timer to %" PRId64 ".%ld -> %" PRId64 ".%ld\n", n,
| ^~~~~~~~~~~~~~~~~~~~~~~~
50 | (int64_t)val->it_value.tv_sec, val->it_value.tv_usec,
51 | (int64_t)val->it_interval.tv_sec, val->it_interval.tv_usec);
| ~~~~~~~~~~~~~~~~~~~~~~~~
| |
| __suseconds64_t {aka long long int}
ns.c:234:48: error: format '%ld' expects argument of type 'long int', but argument 5 has type 'time_t' {aka 'long long int'} [-Werror=format=]
234 | len = snprintf(buf, sizeof(buf), "%d %ld 0", clk_id, offset);
| ~~^ ~~~~~~
| | |
| long int time_t {aka long long int}
| %lld
msg.c:58:41: error: format '%ld' expects argument of type 'long int', but argument 3 has type '__suseconds64_t' {aka 'long long int'} [-Werror=format=]
58 | off += sprintf(buf + off, ".%.3ld: ", tv.tv_usec / 1000);
| ~~~~^ ~~~~~~~~~~~~~~~~~
| | |
| long int __suseconds64_t {aka long long int}
| %.3lld
../lib/zdtmtst.h:137:26: error: format '%ld' expects argument of type 'long int', but argument 4 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
137 | test_msg("ERR: %s:%d: " format " (errno = %d (%s))\n", __FILE__, __LINE__, ##arg, errno, \
| ^~~~~~~~~~~~~~
pthread_timers_h.c:72:17: note: in expansion of macro 'pr_perror'
72 | pr_perror("wrong interval: %ld:%ld", itimerspec.it_interval.tv_sec, itimerspec.it_interval.tv_nsec);
| ^~~~~~~~~
vdso00.c:22:32: error: format '%li' expects argument of type 'long int', but argument 3 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
22 | test_msg("%d time: %10li\n", getpid(), tv.tv_sec);
| ~~~~^ ~~~~~~~~~
| | |
| long int __time64_t {aka long long int}
| %10lli
vdso00.c:29:32: error: format '%li' expects argument of type 'long int', but argument 3 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
29 | test_msg("%d time: %10li\n", getpid(), tv.tv_sec);
| ~~~~^ ~~~~~~~~~
| | |
| long int __time64_t {aka long long int}
| %10lli
vdso01.c:357:42: error: format '%li' expects argument of type 'long int', but argument 2 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
357 | test_msg("gettimeofday: tv_sec %li vdso_gettimeofday: tv_sec %li\n", tv1.tv_sec, tv2.tv_sec);
| ~~^ ~~~~~~~~~~
| | |
| long int __time64_t {aka long long int}
| %lli
vdso01.c:357:72: error: format '%li' expects argument of type 'long int', but argument 3 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
357 | test_msg("gettimeofday: tv_sec %li vdso_gettimeofday: tv_sec %li\n", tv1.tv_sec, tv2.tv_sec);
| ~~^ ~~~~~~~~~~
| | |
| long int __time64_t {aka long long int}
|
vdso01.c:328:43: error: format '%li' expects argument of type 'long int', but argument 2 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
328 | test_msg("clock_gettime: tv_sec %li vdso_clock_gettime: tv_sec %li\n", ts1.tv_sec, ts2.tv_sec);
| ~~^ ~~~~~~~~~~
| | |
| long int __time64_t {aka long long int}
| %lli
vdso01.c:328:74: error: format '%li' expects argument of type 'long int', but argument 3 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
328 | test_msg("clock_gettime: tv_sec %li vdso_clock_gettime: tv_sec %li\n", ts1.tv_sec, ts2.tv_sec);
| ~~^ ~~~~~~~~~~
| | |
| long int __time64_t {aka long long int}
|
../lib/zdtmtst.h:144:26: error: format '%ld' expects argument of type 'long int', but argument 4 has type 'time_t' {aka 'long long int'} [-Werror=format=]
144 | test_msg("FAIL: %s:%d: " format " (errno = %d (%s))\n", __FILE__, __LINE__, ##arg, errno, \
| ^~~~~~~~~~~~~~~
mtime_mmap.c:80:17: note: in expansion of macro 'fail'
80 | fail("mtime %ld wasn't updated on mmapped %s file", mtime_new, filename);
| ^~~~
../lib/zdtmtst.h:144:26: error: format '%ld' expects argument of type 'long int', but argument 4 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
144 | test_msg("FAIL: %s:%d: " format " (errno = %d (%s))\n", __FILE__, __LINE__, ##arg, errno, \
| ^~~~~~~~~~~~~~~
mtime_mmap.c:101:17: note: in expansion of macro 'fail'
101 | fail("After migration, mtime changed to %ld", fst.st_mtime);
| ^~~~
[1] https://sourceware.org/git/?p=glibc.git;h=504c98717062cb9bcbd4b3e59e932d04331ddca5
[2] https://sourceware.org/git/?p=glibc.git;h=3fced064f23562ec24f8312ffbc14950993969e6
[3] https://en.wikipedia.org/wiki/Year_2038_problem
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
By default, if the "CRIU_LIBS_DIR" environment variable is not set,
CRIU will load all plugins installed in `/usr/lib/criu`. This may
result in running the ZDTM tests with plugins for a different version
of CRIU (e.g., installed from a package).
This patch updates ZDTM to always set the "CRIU_LIBS_DIR" environment
variable and use a local "plugins" directory. This directory contains
copies of the plugin files built from source. In addition, this patch
adds the `--criu-plugin` option to the `zdtm.py run` command, allowing
tests to be run with specified CRIU plugins.
Example:
- Run test only with AMDGPU plugin
./zdtm.py run -t zdtm/static/busyloop00 --criu-plugin amdgpu
- Run test only with CUDA plugin
./zdtm.py run -t zdtm/static/busyloop00 --criu-plugin cuda
- Run test with both AMDGPU and CUDA plugins
./zdtm.py run -t zdtm/static/busyloop00 --criu-plugin amdgpu cuda
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When the cuda-checkpoint tool is not installed, execvp() is expected to
fail and return -1. In this case, we need to call exit() to terminate
the child process that was created earlier with fork().
Since CRIU can be used with applications that do not use CUDA, even
when the CUDA plugin is installed, this patch also updates the log
messages to show debug and warning (instead of error) when the
cuda-checkpoint tool is not found in $PATH.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Andrei Vagin <avagin@google.com>
Show information about mounts available on the host filesystem.
This is useful for debugging.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
CRIU provides two plugins for checkpoint/restore of GPU applications:
amdgpu and cuda. Both plugins use the `RESUME_DEVICES_LATE` hook to
enable restore:
CR_PLUGIN_REGISTER_HOOK(CR_PLUGIN_HOOK__RESUME_DEVICES_LATE, amdgpu_plugin_resume_devices_late)
CR_PLUGIN_REGISTER_HOOK(CR_PLUGIN_HOOK__RESUME_DEVICES_LATE, cuda_plugin_resume_devices_late)
However, CRIU currently does not support running more than one plugin
for the same hook. As a result, when both plugins are installed, the
resume function for CUDA applications is not executed. To fix this,
we need to make sure that both `plugin_resume_devices_late()` functions
return `-ENOTSUP` when restore is not supported.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The plugin hook "PAUSE_DEVICES" was recently introduced in the following
commit. This hook was intended to execute the cuda-checkpoint tool
before the process tree is frozen. However, the run_plugins() call has
been placed immediately *after* freeze_processes(). This causes the
cuda-checkpoint tool to hang indefinitely during the checkpointing
of CUDA applications running in containers, eventually leading to its
termination by the timeout alarm.
a85f488595
criu/plugin: Introduce new plugin hooks PAUSE_DEVICES and CHECKPOINT_DEVICES to be used during pstree collection
This problem can be reproduced with the following example:
sudo podman run -d --rm \
--device nvidia.com/gpu=all --security-opt=label=disable \
quay.io/radostin/cuda-counter
sudo podman container checkpoint -l -e /tmp/checkpoint.tar
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
For historical reasons, some tools like rpm [1] or ldd [2,3]
may expect the executable bit to be present for the correct
identification of shared libraries. The executable bit on .so
files is set by default by compilers (e.g., GCC). It is not
strictly necessary but primarily a convention.
[1] https://docs.fedoraproject.org/en-US/package-maintainers/CommonRpmlintIssues/#unstripped_binary_or_object
[2] https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/ldd.bash.in;h=d6b640df;hb=HEAD#l154
[3] $ sudo ldd /usr/lib/criu/*.so
/usr/lib/criu/amdgpu_plugin.so:
ldd: warning: you do not have execution permission for `/usr/lib/criu/amdgpu_plugin.so'
linux-vdso.so.1 (0x00007fd0a2a3e000)
libdrm.so.2 => /lib64/libdrm.so.2 (0x00007fd0a29eb000)
libdrm_amdgpu.so.1 => /lib64/libdrm_amdgpu.so.1 (0x00007fd0a29de000)
libc.so.6 => /lib64/libc.so.6 (0x00007fd0a27fc000)
/lib64/ld-linux-x86-64.so.2 (0x00007fd0a2a40000)
/usr/lib/criu/cuda_plugin.so:
ldd: warning: you do not have execution permission for `/usr/lib/criu/cuda_plugin.so'
linux-vdso.so.1 (0x00007f1806e13000)
libc.so.6 => /lib64/libc.so.6 (0x00007f1806c08000)
/lib64/ld-linux-x86-64.so.2 (0x00007f1806e15000)
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch updates the dependencies section of the AMDGPU plugin man
page to reflect that the plugin has been merged upstream and to fix a
formatting issue.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
In commit 2e456ccf0c34a056e3ccafac4a0c7effef14d918 ("Linux: Make
__rseq_size useful for feature detection (bug 31965)") glibc 2.40
changed the meaning of __rseq_size slightly: it is now the size
of the active/feature area (20 bytes initially), and not the size
of the entire initially defined struct (32 bytes including padding).
The reason for the change is that the size including padding does not
allow detection of newly added features while previously unused
padding is consumed.
The prep_libc_rseq_info change in criu/cr-restore.c is not necessary
on kernels which have full ptrace support for obtaining rseq
information because the code is not used. On older kernels, it is
a correctness fix because with size 20 (the new value), rseq
registeration would fail.
The two other changes are required to make rseq unregistration work
in tests.
Signed-off-by: Florian Weimer <fweimer@redhat.com>
cgroup testcases live in the same cgroup root zdtmtst and
zdtmtst.defaultroot controller then create child subgroup for testing. This
can cause problems when cgroup testcases run in parallel. For example,
testcase A dumps the child subgroup of testcase B since it's in the cgroup
root but in the middle of restoring of testcase A, testcase B completes and
cleans up the subgroup directory. This causes error in testcase A restore.
This commit adds excl flag to all cgroup testcases description so that
these don't run parallel.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
The CI tests with CentOS 7 have been disabled and removed [1,2].
This patch removes the obsolete Makefile targets for these tests.
[1] 24bc083653
[2] f8466ca798
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Sometimes due to sigblockmask inheritance cgroupd can inherit SIGTERM
blocked. That will lead cgroupd ignoring SIGTERM from stop_cgroupd() and
CRIU will get stuck due to waiting for never-stopping cgroupd.
I see this happening in lxc-checkpoint, also saw this in OpenVZ jenkins
on cgroup_inotify00 test.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Duplicate string in irmap_scan_path_add, otherwise it will free before
parsing next configuration input.
[ avagin: handle errors of xstrdup ]
Signed-off-by: Liu Hua <weldonliu@tencent.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Commit fc683cb01 ("compel: shstk: save CET state when CPU supports it")
started using PTRACE_ARCH_PRCTL to query shadow stack status. While
PTRACE_ARCH_PRCTL has existed in the kernel for a long time, it was only
added to glibc in version 2.27. Amazon Linux 2 (AL2) has glibc 2.26,
which does not have this definition. As a result, build on AL2 fails
with the below error:
compel/arch/x86/src/lib/infect.c: In function ‘get_task_xsave’:
compel/arch/x86/src/lib/infect.c:276:14: error: ‘PTRACE_ARCH_PRCTL’ undeclared (first use in this function)
276 | if (ptrace(PTRACE_ARCH_PRCTL, pid, (unsigned long)&features, ARCH_SHSTK_STATUS)) {
| ^~~~~~~~~~~~~~~~~
While the definition is present on the system via the kernel headers (in
asm/ptrace-abi.h) which can be reached by including linux/ptrace.h, the
comment in compel/include/uapi/ptrace.h says:
We'd want to include both sys/ptrace.h and linux/ptrace.h, hoping
that most definitions come from either one or another. Alas, on
Alpine/musl both files declare struct ptrace_peeksiginfo_args, so
there is no way they can be used together. Let's rely on libc one.
Since including linux/ptrace.h is not an option, define
PTRACE_ARCH_PRCTL if it doesn't already exist. An interesting point to
note is that in sys/ptrace.h, PTRACE_ARCH_PRCTL is an enum value so the
preprocessor doesn't know about it. PT_ARCH_PRCTL is the preprocessor
symbol that matches the value of PTRACE_ARCH_PRCTL. So look for
PT_ARCH_PRCTL to decide if PTRACE_ARCH_PRCTL is available or not.
Another interesting point to note is that AL2 ships with GCC 7 by
default, which does not support the -mshstk option, causing other build
failures. Luckily, it also ships GCC 10 which does have the option.
Using GCC 10 lets the build succeed.
Fixes: fc683cb01 ("compel: shstk: save CET state when CPU supports it")
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
Adding support for the NVIDIA cuda-checkpoint utility, requires the use of an
r555 or higher driver along with the cuda-checkpoint binary.
Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
PAUSE_DEVICES is called before a process is frozen and is used by the CUDA
plugin to place the process in a state that's ready to be checkpointed and
quiesce any pending work
CHECKPOINT_DEVICES is called after all processes in the tree have been frozen
and PAUSE'd and performs the actual checkpointing operation for CUDA
applications
Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
Restore rseq_cs state before calling RESUME_DEVICES_LATE as the CUDA plugin will
temporarily unfreeze a thread during the plugin hook to assist with device
restore
Run the plugin finalizer later in the dump sequence since the finalizer is used
by the CUDA plugin to handle some process cleanup
Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
Move PYTHON_EXTERNALLY_MANAGED and PIP_BREAK_SYSTEM_PACKAGES
into Makefile.install to avoid code duplication. In addition, add
PIPFLAGS variable to enable specifying pip options during installation.
This is particularly useful for packaging, where it is common for `pip install`
to run in an environment with pre-installed dependencies and without internet
access. In such environment, we need to specify the following options:
--no-build-isolation --no-index --no-deps
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Adds a exit_signal static method to criu_cli, criu_config and criu_rpc
used to detect a crash.
Fixes: #350
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
This commit adds a `--preload-libfault` option to ZDTM's run command.
This option runs CRIU with LD_PRELOAD to intercept libc functions
such as pread(). This method allows to simulate special cases,
for example, when a successful call to pread() transfers fewer
bytes than requested.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
It is possible for pread() to return fewer number of bytes than
requested. In such case, we need to repeat the read operation
with appropriate offset.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The unix_conf_op function reads the size of the sysctl entry array
twice. gcc thinks that it can lead to a time-of-check to time-of-use
(TOCTOU) race condition if the array size changes between the two reads.
Fixes#2398
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The rawhide tests runs in a container. Containers always have SELinux
disabled from the inside. Somehow /sys/fs/selinux is now mounted. We
used the existence of that directory if SELinux is available. This seems
to be no longer true.
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
A fault-injection test was introduced in commit [1] and later removed in
commit [2]. This patch removes the obsolete Makefile target.
[1] b95407e264
test: check, that parasite can rollback itself (v2)
[2] 2cb4532e26
tests: remove zdtm.sh (v2)
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Replace sprintf() with snprintf() and specify maximum length of
characters to avoid potential overflow.
Reported-by: GitHub CodeQL (https://codeql.github.com/)
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Currently there are no socket option test cases for TCP_CORK and
TCP_NODELAY, this commit adds related test cases.
The socket option test cases for TCP_KEEPCNT, TCP_KEEPIDLE, and
TCP_KEEPINTVL already exist in socket-tcp_keepalive.c, so they are
not included in this test case.
Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Currently some TCP socket option information is stored in SkOptsEntry,
which is a little confusing.
SkOptsEntry should only contain socket options that are common to
all sockets.
In this commit move the TCP-specific socket options from SkOptsEntry
to TcpOptsEntry.
Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Currently some of the TCP socket option information is stored in the
TcpStreamEntry, but the information in the TcpStreamEntry is only
restored after the TCP socket has established connection, which
results in these TCP socket options not being restored for
unconnected TCP sockets.
In this commit move the TCP socket options from TcpStreamEntry to
TcpOptsEntry and add dump_tcp_opts() and restore_tcp_opts() for TCP
socket options dump and restore.
Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
On some systems, nft binary might not be installed, or some kernel
options might be unconfigured, resulting in something like this:
sudo unshare -n nft create table inet CRIU
Error: Could not process rule: Operation not supported
create table inet CRIU
^^^^^^^^^^^^^^^^^^^^^^^
This is similar to what kerndat_has_nftables_concat() does, and if the
outcome is the same, it returns an error to kerndat_init(), and an error
from kerndat_init() is considered fatal.
Let's relax the check, returning mere "feature not working" instead of
a fatal error.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1) In dump_tcp_conn_state, if return from libsoccr_save is >=0, we check
that sizeof(struct libsoccr_sk_data) returned from libsoccr_save is
equal to sizeof(struct libsoccr_sk_data) we see in dump_tcp_conn_state
(probably to check if we use the right library version). And if sizes
are different we go to err_r, which just returns ret, which can
teoretically be 0 (if size in library is zero) and that would lead
dump_one_tcp treat this as success though it is obvious error.
2) In case of dump_opt or open_image fails we don't explicitly set ret
and rely that sizeof(struct libsoccr_sk_data) previously set to ret is
not 0, I don't really like it, it makes reading code too complex.
3) We have a lot of err_* labels which do exactly the same thing, there
is no point in having all of them, also it is better to choose the name
of the label based on what it really does.
So let's refactor error handling to avoid these inconsistencies.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
During restore, CRIU prints "Enqueue page-read" messages for
each page-read request [1]. However, this message does not
provide useful information, increases performance overhead
during restore and the size of log file.
$ ./zdtm.py run -t zdtm/static/maps06 -f h -k always
$ grep 'Enqueue page-read' dump/zdtm/static/maps06/56/1/restore.log | wc -l
20493
This commit replaces these log messages with a single message
that shows the number of enqueued page-read requests.
$ grep 'enqueued' dump/zdtm/static/maps06/56/1/restore.log
(00.061449) 56: nr_enqueued: 20493
[1] 91388fc
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
1. Tell which RPMs or DEBs are required in all cases.
2. Use $(info ...) everywhere.
3. Drop extra nested $(info), instead use (a document) a simpler kludge.
4. Simplify and unify the language, add missing periods.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Currently we have tabs + spaces on the wrapped line but the wrapped part
is not alligned to the opening bracket.
Fixes: bbe26d1b7 ("timer: fix allignment in function definition")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
CircleCI currently prints out the following warning:
This job is using a deprecated image 'ubuntu-2004:202010-01', please update to a newer image
According to https://discuss.circleci.com/t/linux-image-deprecations-and-eol-for-2024/
the recommended image name is: "image: default"
Signed-off-by: Adrian Reber <areber@redhat.com>
This patch extends the sched_policy00 test case to verify that
the SCHED_RESET_ON_FORK flag is restored correctly.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch extends CRIU with support for SCHED_RESET_ON_FORK.
When the SCHED_RESET_ON_FORK flag is set, the following rules
apply for subsequently created children:
- If the calling thread has a scheduling policy of SCHED_FIFO or
SCHED_RR, the policy is reset to SCHED_OTHER in child processes.
- If the calling process has a negative nice value, the nice value
is reset to zero in child processes.
(See 'man 7 sched')
Fixes: #2359
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
A memory interval is a half-open interval, so the condition
when pr->pe->vaddr == vma->e->end should not be interpreted
as an intersection and should cause vma to be marked with VMA_NO_PROT_WRITE.
Fixes: #2364
Signed-off-by: Artem Trushkin <at.120@ya.ru>
The restore of a task with shadow stack enabled adds these steps:
* switch from the default shadow stack to a temporary shadow stack
allocated in the premmaped area
* unmap CRIU mappings; nothing changed here, but it's important that
CRIU mappings can be removed only after switching to a temporary
shadow stack
* create shadow stack VMA with map_shadow_stack()
* restore shadow stack contents with wrss
* switch to "real" shadow stack
* lock shadow stack features
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
There are several gotachs when restoring a task with shadow stack:
* depending on the compiler options, glibc version and glibc tunables
CRIU can run with or without shadow stack.
* shadow stack VMAs are special, they must be created using a dedicated
map_shadow_stack() system call and can be modified only by a special
instruction (wrss) that is only available when shadow stack is
enabled.
* once shadow stack is enabled, it is not writable even with wrss;
writes to shadow stack can be only enabled with ptrace() and only when
shadow stack is enabled in the tracee.
* if the shadow stack is enabled during restore rather than by glibc,
calling retq after arch_prctl() that enables the shadow stack causes
#CP, so the function that enables shadow stack can never return.
Add the infrastructure required to cope with all of those:
* modify the restore code to allow trampoline (arch_shstk_trampoline)
that will enable shadow stack and call restore_task_with_children().
* add call to arch_shstk_unlock() right after the tasks are clone()ed;
this will allow unlocking shadow stack features and making shadow
stack writable.
* add stubs for architectures that do not support shadow stacks
* add implementation of arch_shstk_trampoline() and arch_shstk_unlock()
for x86, but keep it disabled; it will be enabled along with addtion
of the code that will restore shadow stack in the restorer blob
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Detect if CRIU runs with shadow stack enabled and store the result in
kerndat.
Unlike most kerndat knobs, kdat_has_shstk() does not check for
availability of the shadow stack in the kernel, but rather checks if
criu runs with shadow stack enabled.
This depends on hardware availabilty, kernel and glibc support, compiler
options and glibc tunables, so kdat_has_shstk() must be called every
time CRIU starts and its result cannot be cached.
The result will be used by the code that controls shadow stack
enablement in the next commit.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Shadow stacks must be populated using special WRSS instruction. This
instruction is only available when shadow stack is enabled, calling it
with disabled shadow stack causes #UD.
Moreover, shadow stack VMAs cannot be mremap()ed and they must be
created using map_shadow_stack() system call. This requires delaying the
restore of shadow stacks to restorer blob after the CRIU mappings are
cleared.
Introduce rst_shstk_info structure to hold shadow stack parameters
required in the restorer blob and populate this structure in
arch_prepare_shstk() method.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Shadow stack VMAs cannot be mmap()ed, they must be created using
map_shadow_stack() system call and populated using special wrss
instruction available only when shadow stack is enabled.
Premap them to reserve virtual address space and populate it to have
there contents available for later copying after enabling shadow stack.
Along with the space required by shadow stack VMAs also reserve an extra
page that will be later used as a temporary shadow stack.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
The shadow stack VMAs require special care because they can only be
created and populated using special system calls.
Add VMA_AREA_SHSTK flag and set it for VMAs that are marked as "ss" in
/proc/pid/smaps
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
When calling sigreturn with CET enabled, the kernel verifies that the
shadow stack has proper address of sa_restorer and a "restore token".
Normally, they pushed to the shadow stack when signal processing is
started.
Since compel calls sigreturn directly, the shadow stack should be
updated to match the kernel expectations for sigreturn invocation.
Add parasite_setup_shstk() that sets up the shadow stack with the
address of __export_parasite_head_start as sa_restorer and with the
required restore token.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
To support sigreturn with CET enabled parasite must rewind its stack
before calling sigreturn so that shadow stack will be compatible with
actual calling sequence.
In addition, calling sigreturn from top level routine
(__export_parasite_head_start) will significantly simplify the shadow
stack manipulations required to execute sigreturn.
For x86 make fini_sigreturn() return the stack pointer for the signal
frame that will be used by sigreturn and propagate that return value up
to __export_parasite_head_start.
In non-daemon mode parasite_trap_cmd() returns non-positive value
which allows to distinguish daemon and non-daemon mode and properly stop
at int3 in non-daemon mode.
Architectures other than x86 remain unchanged and will still call
sigreturn from fini_sigreturn().
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
All architectures create on-stack structure for floating point save area
in compel_get_task_regs() if the caller passes NULL rather than a valid
pointer.
The only place that calls compel_get_task_regs() with NULL for floating
point save area is parasite_start_daemon() and it is simpler to define
this strucuture on stack of parasite_start_daemon().
The availability of floating point save data is required in
parasite_start_daemon() to detect shadow stack presence early during
parasite infection and will be used in later patches.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Currently we have checkpoint/restore support only of cgroup v2 threaded
controllers. Threads originating in cgroup v1 environments will be
restored to the main thread's cgroup. This change extends the support
for a cgroups v1.
Signed-off-by: Stepan Pieshkin <stepanpieshkin@google.com>
This patch fixes the following lint error:
scripts/criu-ns:219:16: E713 [*] Test for membership should be `not in`
The change in this patch is auto-generated with `ruff --fix`.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Ruff (https://github.com/astral-sh/ruff) is a Python linter
written in Rust, designed to replace Flake8. It is significantly
faster and actively maintained.
In addition to replacing flake8 with ruff, this patch also
creates separate makefile targets for ruff, shellcheck and
codespell, so that they can be tested independently.
RUFF_FLAGS can be used to specify options such as '--fix'.
Example:
make lint
make ruff RUFF_FLAGS=--fix
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch fixes the following flake8 error:
python3 -m flake8 --config=scripts/flake8.cfg lib/pycriu/images/pb2dict.py
lib/pycriu/images/pb2dict.py:361:43: E721 do not compare types, for exact checks use `is` / `is not`, for instance checks use `isinstance()`
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The commit introducing PAGE_IS_SOFT_DIRTY has not been merged
in kernel v6.7.x.
fs/proc/task_mmu: report SOFT_DIRTY bits through the PAGEMAP_SCAN ioctl
e6a9a2cbc1
As a result, CRIU fails with the following error:
Error (criu/pagemap-cache.c:199): pagemap-cache: PAGEMAP_SCAN: Invalid argument'
Error (criu/pagemap-cache.c:225): pagemap-cache: Failed to fill cache for 63 (400000-402000)'
This patch updates check_pagemap() in kerndat to check if PAGE_IS_SOFT_DIRTY is supported.
Fixes: #2334
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Refactor code used to Checkpoint DRM devices. Code is moved
into amdgpu_plugin_drm.c file which hosts various methods to
checkpoint and restore a workload.
Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
Add a new compilation unit to host symbols and methods that will be
needed to C&R DRM devices. Refactor code that indicates support for
C&R and checkpoints KFD and DRM devices
Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
We already don't treat it as error in the plugin itself, but after
returning -1 from RESUME_DEVICES_LATE hook we print debug message in
criu about failed plugin, let's return 0 instead.
While on it let's replace ret to exit_code.
Fixes: a9cbdad76 ("plugin/amdgpu: Don't print error for "No such process" during resume")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
During the late stages of restore, each process being resumed gets
an ioctl call to KFD_CRIU_OP_RESUME. If the process has no kfd
process info, this call with fail with -ESRCH. This is normal
behaviour, so we shouldn't print an error message for it.
Signed-off-by: David Francis <David.Francis@amd.com>
To improve readability, this patch changes the return type of
iptables_has_criu_jump_target() to a boolean, where 'true' indicates
that iptables has CRIU jump target and 'false' indicates otherwise.
Suggested-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch removes a leftover declaration for log_closedir()
which has been removed in the following commit:
dc80d6f125
log: get rid of LOG_DIR_FD_OFF and opening cwd in log_init()
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Let's use hooked nft chain which actually affects packets.
Fixes: e5f4d8c6f ("test/nfconntrack: use nft or iptables-legacy")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This change adds a new injectable fault (135) to disable PAGEMAP_SCAN and fault
back to read pagemap files.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
PAGEMAP_SCAN is a new ioctl that allows to get page attributes in a more
effeciant way than reading pagemap files.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
In commit [1] was introduced a mechanism to auto-generate the files:
sys-exec-tbl*.c, syscalls*.S, syscall-codes*.h, and syscall*.h.
This commit also updated the gitignore rules to ignore auto-generated
files. However, after commit [2], the path for these files has changed
and the patterns specified in gitignore are no longer needed.
[1] bbc2f133 (x86/build: generate syscalls-{64,32}.built-in.o)
[2] 19fadee9 (compel: plugins,std -- Implement syscalls in std plugin)
Reported-by: @felicitia
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The image has a too old version of nettle which does not work with gnutls.
Just upgrade to the latest to make the error go away.
Signed-off-by: Adrian Reber <areber@redhat.com>
Newer versions of 'tail' rely on inotify and after a restore 'tail' is
unhappy with the state of inotify and just stops.
This replaces 'tail' with a minimal shell based test (thanks Andrei).
Signed-off-by: Adrian Reber <areber@redhat.com>
If ioctl(TIOCSLCKTRMIOS) fails with EPERM it means that a CRIU
process lacks of CAP_SYS_ADMIN capability. But we can use
ioctl(TIOCGLCKTRMIOS) to *read* current ->termios_locked
value from the kernel and if it's the same as we already have
we can skip failing ioctl(TIOCSLCKTRMIOS) safely.
Adrian has recently posted [1] a very good patch to allow ioctl(TIOCSLCKTRMIOS)
for processes that have CAP_CHECKPOINT_RESTORE (right now it requires CAP_SYS_ADMIN).
[1] https://lore.kernel.org/all/20231206134340.7093-1-areber@redhat.com/
Suggested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
WARNINGS variable should be amended, not redefined.
We still need, e.g., `-Wno-dangling-pointer` to build
criu on loongarch64 with gcc13.
Signed-off-by: Ivan A. Melnikov <iv@altlinux.org>
Checkpoint/restore with version 25.0.0-beta.1 fails
with the following error:
$ docker start --checkpoint=c1 cr
Error response from daemon: failed to create task for container: content digest fdb1054b00a8c07f08574ce52198c5501d1f552b6a5fb46105c688c70a9acb45: not found: unknown
Release notes:
https://github.com/moby/moby/discussions/46816
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
In the compel/arch/arm/plugins/std/syscalls/syscall.def, the syscall number of bind on ARM64 should be 200 instead of 235
Signed-off-by: Sally Kang <snapekang@gmail.com>
Two major highlights of this release:
* LoongArch64 support
* A lot of fixes and improvments form the Google backlog.
The full changelog can be found here: https://criu.org/Download/criu/3.19.
This marks the final release of the 3.x series. The upcoming version
will be 4.0! Additionally, the naming pattern will be changed. Any ideas
are welcome.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Newer versions of pip use an isolated virtual environment when building
Python projects. However, when the source code of CRIT is copied into
the isolated environment, the symlink for `../lib/py` (pycriu) becomes
invalid. As a workaround, we used the `--no-build-isolation` option for
`pip install`. However, this functionality has issues in some versions
of PIP [1, 2]. To fix this problem, this patch adds separate packages
for pycriu and crit, and each package is installed independently.
[1] https://github.com/pypa/pip/pull/8221
[2] https://github.com/pypa/pip/issues/8165#issuecomment-625401463
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
cgroup_ifpriomap test needs net_prio cgroup, which might not be
available. Make the .checkskip script check it.
Signed-off-by: Michał Mirosław <emmir@google.com>
At this point the correct position is already restored, so reading from
the fd results in the position being moved forward by 5 bytes.
Fixes: 9191f8728d ("criu/files-reg.c: add build-id validation functionality")
Signed-off-by: Michal Clapinski <mclapinski@google.com>
Eventpollentry's fields are set only when ret == 3 or ret == 6. The
remaining cases can be grouped together to an error
Signed-off-by: Taemin Ha <taemin.ha@utexas.edu>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
line 131 checks if (ret >= 0). line 133 could be replaced by a simple else statement
Signed-off-by: Taemin Ha <taeminha@cs.utexas.edu>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The condition meant to check fd2 instead of fd1, which is checked in
line 24.
Signed-off-by: Taemin Ha <taeminha@cs.utexas.edu>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The is_native field is a boolean. Therefore, else if() should can be
changed to a simple else{}.
Signed-off-by: Taemin Ha <taeminha@cs.utexas.edu>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
This check is redundant as line 201 checks for this condition.
Signed-off-by: Taemin Ha <taeminha@cs.utexas.edu>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
read_ns_sys_file() can return an error, but we are trying to parse a
buffer before checking a return code.
CID 417395 (#3 of 3): String not null terminated (STRING_NULL)
2. string_null: Passing unterminated string buf to strtol, which expects
a null-terminated string.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
GCC's lto source:
> To avoid this problem the compiler must assume that it sees the
> whole program when doing link-time optimization. Strictly
> speaking, the whole program is rarely visible even at link-time.
> Standard system libraries are usually linked dynamically or not
> provided with the link-time information. In GCC, the whole
> program option (@option{-fwhole-program}) asserts that every
> function and variable defined in the current compilation
> unit is static, except for function @code{main} (note: at
> link time, the current unit is the union of all objects compiled
> with LTO). Since some functions and variables need to
> be referenced externally, for example by another DSO or from an
> assembler file, GCC also provides the function and variable
> attribute @code{externally_visible} which can be used to disable
> the effect of @option{-fwhole-program} on a specific symbol.
As far as I read gcc's source, ipa_comdats() will avoid placing symbols
that are either already in a user-defined section or have
externally_visible attribute into new optimized gcc sections.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The "ColumnLimit: 120" is not only allowing lines to be longer than 80
characters but it also forces line wrapping at 120 characters. If total
expression length is more than 120 characters, clang-format will try to
wrap it as close to 120 as it can, it would not even allow to wrap at 80
characters if we really want it. But as we all know 80 characters is
Linux kernel coding style default and as far as our coding style is
based on it it is really strange to prohibit wrapping lines at 80
characters...
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
One memfd can be shared by a few restored files. Only of these files is
restored with a file created with memfd_open. Others are restored by reopening
memfd files via /proc/self/fd/.
It seems unnecessary for restoring memfd memory mappings. We can always use the
origin file.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
amdgpu_plugin.c:930:6: error: variable 'buffer' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
if (ret) {
^~~
amdgpu_plugin.c:988:8: note: uninitialized use occurs here
xfree(buffer);
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch adds the `libdrm-dev` package to the list of CRIU
dependencies installed in CI to build CRIU with amdgpu plugin.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
It means CRIU has to close it when it is not needed.
It looks more logically correct and matches the behaviour of
the RESTORE_EXT_FILE callback.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Currently most of the times we don't have problems with VVAR segment and
lazy restore because when VDSO is parked there is an munmap call that
calls UFFDIO_UNREGISTER on the destination address.
But we don't want to enable userfaultfd for VDSO and VVAR at the first
place.
Signed-off-by: Vladislav Khmelevsky <och95@yandex.ru>
Currently page_size() returns unsigned int value that is after "bitwise
not" is promoted to unsigned long value e.g. in uffd.c
handle_page_fault. Since the value is unsigned promotion is done with 0
MSB that results in lost of MSB pagefault address bits. So make
page_size to return unsigned long to avoid such situation.
Signed-off-by: Vladislav Khmelevsky <och95@yandex.ru>
When -- after restore -- sockets can't communicate, the test times out
while waiting on recvfrom(). Since the communication is local, send()
works instantaneously - so mark sockets with SOCK_NONBLOCK and report
failure if the message is not received immediately.
Signed-off-by: Michał Mirosław <emmir@google.com>
This fixes a failure to clean up after a failed test, where CRIU didn't start properly.
```
===================== Run zdtm/transition/socket-tcp in h ======================
Start test
./socket-tcp --pidfile=socket-tcp.pid --outfile=socket-tcp.out
Traceback (most recent call last):
File ".../zdtm_py.py", line 1906, in do_run_test
cr(cr_api, t, opts)
File ".../zdtm_py.py", line 1584, in cr
cr_api.dump("dump")
File ".../zdtm_py.py", line 1386, in dump
self.__dump_process = self.__criu_act(action,
File ".../zdtm_py.py", line 1224, in __criu_act
raise test_fail_exc("CRIU %s" % action)
test_fail_exc: CRIU dump
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<embedded module '_launcher'>", line 182, in run_filename_from_loader_as_main
File "<embedded module '_launcher'>", line 34, in _run_code_in_main
File ".../zdtm_py.py", line 2790, in <module>
fork_zdtm()
File ".../zdtm_py.py", line 2782, in fork_zdtm
do_run_test(tinfo[0], tinfo[1], tinfo[2], tinfo[3])
File ".../zdtm_py.py", line 1922, in do_run_test
t.kill()
File ".../zdtm_py.py", line 509, in kill
os.kill(int(self.__pid), sig)
ProcessLookupError: [Errno 3] No such process
```
Signed-off-by: Michał Mirosław <emmir@google.com>
cgroup04 test needs full control over mem and devices cgroup hierarchies.
Make the test's .checkskip script better at detecting if the cgroups are
available for use.
Signed-off-by: Michał Mirosław <emmir@google.com>
Make the errno values reported by cgroup04 always correct and showing
relevant parameters.
Constify constant strings, while at it.
Signed-off-by: Michał Mirosław <emmir@google.com>
At least in Google's VM environment, the kernel taints are unrelated to CRIU
runs. Don't fail tests if taints change, if kernel taints are ignored.
Signed-off-by: Michał Mirosław <emmir@google.com>
They break it with each kernel rebase. More details are here:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1857257
Last time, it was fixed a few month ago and it has been broken again in
5.15.0-1046-azure.
Let's bind-mount the CRIU directory into a test container to make it
independent of a container file system.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The version of CRIU is specified in the Makefile.versions file.
This patch generates '__varion__' value for the pycriu module.
This value can be used by crit to implement `--version`.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Power ISA 3.0 added a new syscall instruction. Kernel 5.9 added
corresponding support.
Add CRIU support to recognize the new instruction and kernel ABI changes
to properly dump and restore threads executing in syscalls. Without this
change threads executing in syscalls using the scv instruction will not
be restored to re-execute the syscall, they will be restored to execute
the following instruction and will return unexpected error codes
(ERESTARTSYS, etc) to user code.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
The amdgpu plugin would create a memory buffer at the size
of the largest VRAM bo (buffer object). On some systems, VRAM
size exceeds RAM size, so the largest bo might be larger than
the available memory.
Add an environment variable KFD_MAX_BUFFER_SIZE, which caps the
size of this buffer. By default, it is set to 0, and has no
effect. When active, any bo larger than its value will be
saved to/restored from file in multiple passes.
Signed-off-by: David Francis <David.Francis@amd.com>
Check membarrier registration both ways:
1. By issuing membarrier commands and checking if they succeed.
2. By issuing MEMBARRIER_CMD_GET_REGISTRATIONS.
The first way is needed for older kernels. The second way is needed to test
MEMBARRIER_CMD_GLOBAL_EXPEDITED.
Signed-off-by: Michal Clapinski <mclapinski@google.com>
MEMBARRIER_CMD_GET_REGISTRATIONS can tell us whether or not the process used
MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED unlike the old probing method.
Falls back to the old method when MEMBARRIER_CMD_GET_REGISTRATIONS is
unavailable.
Signed-off-by: Michal Clapinski <mclapinski@google.com>
There are multiple cases where good human readable code block is
converted to an unreadable mess by clang-format, so we don't want to
rely on clang-format completely. Also there is no way, as far as I can
see, to make clang-format only fix what we want it to fix without
breaking something.
So let's just display hints inline where clang-format is unhappy. When
reviewer sees such a warning it's a good sign that something is broken
in coding-style around this warning.
We add special script which parses diff generated by indent and
generates warning for each hunk.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This is highlight that code readability is the real goal of all the
coding-style rules. We should not do coding-style just for coding-style,
e.g. when clang-format suggests crazy formating we should not follow it
if we feel it is bad.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Ctags is mentioned in the beginning of the "Edit the source code" which
is really confusing: Do you need ctags to edit CRIU code? - No. It is
just one helpful tool to browse the code, and we do not want to enforce
it. So, what is it doing in contribution guide? People who really need
it should be able to find it in Makefile or just write oneliner of their
own to collect tags...
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
If there is only a single RW opened fd for a memfd, it can be used
to pass it to execveat() with AT_EMPTY_PATH to have its contents
executed. This currently works only for the original fd from
memfd_create(). For now we ignore processes that reopen the memfd's
rw and expect a particular executability trait of it. (Note: for
security purposes recent kernels have SEAL_EXEC to make memfds
non-executable.)
Signed-off-by: Michał Mirosław <emmir@google.com>
Plug a fd leak when returning error from check_pagemap().
(Cosmetic, as the process will exit soon anyway.)
Signed-off-by: Michał Mirosław <emmir@google.com>
This commit is introducing a test for the action-script functionality
of CRIU to verify that pre-dump, post-dump, pre-restore, pre-resume,
post-restore, post-resume hooks are executed during dump/restore.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Fix test of whether the kernel exposes page frame numbers to cope with the
possibility that the top of the stack is swapped out, which was happening
in about one 1 out of 3 million runs. This lead to a later failure when
trying to read the PFN of the zero page, after which criu would exit with
no error message.
Original-From: Ambrose Feinstein <ambrose@google.com>
Signed-off-by: Michał Mirosław <emmir@google.com>
There is only one user of memfd_open() outside of memfd.c: open_filemap().
It is restoring a file-backed mapping and doesn't need nor expect to
update F_SETOWN nor the fd's position. Check the inherited_fd() handling
in the callers to simplify the code.
Signed-off-by: Michał Mirosław <emmir@google.com>
The 288d6a61e2 change broke all the syscall numbers.
Reported-by: Michał Mirosław <emmir@google.com>
Fixes: (288d6a61e2 "loongarch64: reformat syscall_64.tbl for 8-wide tabs")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
While each preadv() is followed by a fallocate() that removes the data
range from image files on tmpfs, temporarily (between preadv() and
fallocate()) the same data is in two places; this increases the memory
overhead of restore operation by the size of a single preadv.
Uncapped preadv() would read up to 2 GiB of data, thus we limit that to
a smaller block size (128 MiB).
Based-on-work-by: Paweł Stradomski <pstradomski@google.com>
Signed-off-by: Michał Mirosław <emmir@google.com>
Note: Silently drops MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED as it's
not currently detectable. This is still better than silently dropping
all membarrier() registrations.
Signed-off-by: Michał Mirosław <emmir@google.com>
The VMA_AREA_MEMFD constant was introduced with commit
29a1a88bce
memfd: add memory mapping support
This patch extends the status map used in CRIT and coredump with the
value of this constant to recognize it.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This change fixes the issue:
```
The following packages have unmet dependencies:
docker-ce : Depends: containerd.io (>= 1.6.4)
E: Unable to correct problems, you have held broken packages.
```
Signed-off-by: Andrei Vagin <avagin@google.com>
The log prefix "amdgpu_plugin:" is defined with `LOG_PREFIX` in
`amdgpu_plugin.c`. However, the prefix is also included in each
log message. As a result it appears duplicated in the log messages:
(00.044324) amdgpu_plugin: amdgpu_plugin: devices:1 bos:58 objects:148 priv_data:45696
(00.045376) amdgpu_plugin: amdgpu_plugin: Thread[0x5589] started
(00.167172) amdgpu_plugin: amdgpu_plugin: img_path = amdgpu-kfd-62.img
(00.083739) amdgpu_plugin: amdgpu_plugin : amdgpu_plugin_dump_file() called for fd = 235
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Make the scan use the order of paths that came from the user.
Fixes: 4f2e4ab3be ("irmap: add --irmap-scan-path option"; 2015-09-16)
Signed-off-by: Michał Mirosław <emmir@google.com>
Move tcp_cork() and tcp_nodelay() to the only user: page-xfer.c. While
at it, fix error messages (as they do not refer to restoring the sockopt
values) and demote them as they are not fatal to the page transfer.
Signed-off-by: Michał Mirosław <emmir@google.com>
In criu/apparmor.c: write_aa_policy(), the arg path is passed as a char
pointer. The original code used sizeof(path) to get the size of it,
which is incorrect as it always return the size of the char pointer
(typically 8 or 4), not the actual capacity of the char array.
Given that this function is only invoked with path declared as `char
path[PATH_MAX]`, replacing sizeof(path) with PATH_MAX should correctly
represent the maximum size of it.
Fixes: 8723e3f ("check: add a feature test for apparmor_stacking")
Signed-off-by: Haorong Lu <ancientmodern4@gmail.com>
memfd is created by default with +x permissions set. This can be changed
by a process using fchmod() and expected to prevent using this fd for
exec(). Migrate the permissions.
Signed-off-by: Michał Mirosław <emmir@google.com>
Include the file descriptor and error code in the debug message to make
it more useful.
Fixes: e7ba90955c (2016-03-14 "cr-check: Inspect errno on syscall failures")
Signed-off-by: Michał Mirosław <emmir@google.com>
prctl(NO_NEW_PRIVS) when set prevents child processes gaining
capabilities not in permitted set. In this case, inability to
clear capability from BSET that is not in the permitted set is
harmless.
Signed-off-by: Michał Mirosław <emmir@google.com>
When restoring on a kernel that has different number of supported
capabilities than checkpoint one, check that the extra caps are unset.
There are two directions to consider:
1) dump.cap_last_cap > restore.cap_last_cap
- restoring might reduce the processes' capabilities if restored
kernel doesn't support checkpointed caps. Warn.
2) dump.cap_last_cap < restore.cap_last_cap
- restoring will fill the extra caps with zeroes. No changes.
Note: `last_cap` might change without affecting `n_words`.
Signed-off-by: Michał Mirosław <emmir@google.com>
Skip calling setgroups() when the list of auxiliary groups already has
the values we want. This allows restoring into an unprivileged user
namespace where setgroups() is disabled.
From: Ambrose Feinstein <ambrose@google.com>
Signed-off-by: Michał Mirosław <emmir@google.com>
When CRIU is run with the task's credentials on restore, don't set uids
and gids. This avoids the need to modify the SECURE_NO_SETUID_FIXUP flag
which requires CAP_SETPCAP.
From: Andy Tucker <agtucker@google.com>
Signed-off-by: Michał Mirosław <emmir@google.com>
Note: This removes the difference in calling convention of
restore_file_perms() returning -errno that was the only call that did
this in the caller.
From: Radosław Burny <rburny@google.com>
Signed-off-by: Michał Mirosław <emmir@google.com>
Add generic wrappers for fchown() and fchmod() that skip the calls if
no changes are needed. This will allow to unify places where we can
avoid errors when no-op requests are not permitted.
Signed-off-by: Michał Mirosław <emmir@google.com>
NR_fstat is a deprecated syscall, some
modern architectures such as riscv and
loongarch64 no longer support this syscall.
It is usually replaced by NR_statx.
NR_statx is supported since linux 4.10.
Signed-off-by: znley <shanjiantao@loongson.cn>
Fixes: #2222
Fixes: f1c8d38 ("kerndat: check if setsockopt IPV6_FREEBIND is supported")
Signed-off-by: Yan Evzman <yevzman@gmail.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
These errors originate from the filesystem scanning in irmap.c and are mostly
benign. Nevertheless, if they do result in a failed irmap lookup, that failed
lookup is more interesting from an application perspective.
Signed-off-by: Michał Mirosław <emmir@google.com>
Make logs about inaccessible mounts warnings, as the failures are
normally harmless (e.g. failure to read /dev/cgroup) and don't
make the CRIU run fail. (If it happens that the fsnotify can't
find a file, then to debug, full CRIU logs will be necessary anyway.)
Signed-off-by: Michał Mirosław <emmir@google.com>
Errors in early restore.log for status=1 from a subprocess are confusing,
esp. that they don't show what command failed. Since the result is
either ignored or logged anyway, mark the calls as "can fail".
Signed-off-by: Michał Mirosław <emmir@google.com>
This makes the error to mount cgroup hierarchy a bit less noisy:
Error (criu/cgroup.c:623): cg: Unable to mount cgroup2 : Invalid argument'
Instead of
Error (criu/cgroup.c:623): cg: Unable to mount cgroup2 : Invalid argument'
Error (criu/cgroup.c:715): cg: failed walking /proc/self/fd/-1/zdtmtst for empty cgroups: No such file or directory'
Signed-off-by: Michał Mirosław <emmir@google.com>
This patch removes the code for Python 2 compatibility introduced
with commit e65c7b5 (zdtm: Replace imp module with importlib).
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch is replacing the set_blocking() function with
os.set_blocking(). This function was introduced for compatibility with
Python 2 in commit 8094df8di (criu-ns: Add tests for criu-ns script).
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This commit removes the checks for the Python 2 binary in the makefile
and makes sure that ZDTM tests always use python3. Since support for
Python 2 has been dropped, these checks are no longer needed.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This commit removes the dependency on the __future__ module, which was
used to enable Python 3 features in Python 2 code. With support for
Python 2 being dropped, it is no longer necessary to maintain backward
compatibility.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When building with pip version 20.0.2 or older, the pip install
command creates a temporary directory and copies all files from
./crit. This results in the following error message:
ModuleNotFoundError: No module named 'pycriu'
This error appears because the symlink 'pycriu' uses a relative path
that becomes invalid '../lib/py/'.
The '--no-build-isolation' option for pip install is needed to enable
the use of pre-installed dependencies (e.g., protobuf) during build.
The '--ignore-installed' option for pip is needed to avoid an error when
crit is already installed. For example, crit is installed in the GitHub
CI environment as part of the criu OBS package as a dependency for
podman.
Distributions such as Arch Linux have adopted an externally managed
python installation in compliance with PEP 668 [1] that prevents pip
from breaking the system by either installing packages to the system or
locally in the home folder. The '--break-system-packages' [2] option
allows pip to modify an externally managed Python installation.
[1] https://peps.python.org/pep-0668/
[2] https://pip.pypa.io/en/stable/cli/pip_uninstall/
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch reverts changes introduced with the following commits:
4feb07020d
crit: enable python2 or python3 based crit
b78c4e071a
test: fix crit test and extend it
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch reverts changes introduced for Python 2 compatibility
in commits:
1c866db (Add new files for running criu-coredump via python 2 or 3)
3180d35 (Add support for python3 in criu-coredump).
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
We have disabled CentOS 7 tests in CI. This patch reverts the
changes introduced in the following commit:
24bc083653
ci: disable some tests on CentOS 7
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Instead of opening the image directly, the commit refactors the
asciinema image embedded link to redirect users to the corresponding
video.
Signed-off-by: Abhishek Guleri <abhishekguleri24@gmail.com>
With the parasite socket clash now guaranteed not to happen,
the comment becomes obsolete. netns is steel needed though, so
update the comment to point at the requirement.
Change-Id: I3cfb253cd5c53b91b955fcb001530b4aee5129f4
Signed-off-by: Michał Mirosław <emmir@google.com>
Instead of relying on chance of CLOCK_MONOTONIC reading being unique,
use pid namespace ID that combined with the process ID will make it
unique on the machine level.
If pidns is not enabled on a kernel we'll get ENOENT, but then CRIU's
pid will already be unique. If there is some other error, log it but
continue, as the socket clash (if it happens) will result in a failed
run anyway.
Fixes: 45e048d77a (2022-03-31 "criu: generate unique socket names")
Fixes: 408a7d82d6 (2022-02-12 "util: add an unique ID of the current criu run")
Change-Id: I111c006e1b5b1db8932232684c976a84f4256e49
Signed-off-by: Michał Mirosław <emmir@google.com>
If not dumping netns nor connections, nsid support is not used. Don't
fail the run as if the support is needed, the dumping process will fail
later.
Change-Id: I39a086756f6d520c73bb6b21eaf6d9fb49a18879
Signed-off-by: Michał Mirosław <emmir@google.com>
kerndat_nsid() is not used outside kerndat.c. Make it static.
Change-Id: I52e518ecb7c627cc1866e373411b2be3f71a2c9d
Signed-off-by: Michał Mirosław <emmir@google.com>
If the error is ignored it is not important enough - make it a warning
instead.
From: Mian Luo <mianl@google.com>
Change-Id: If2641c3d4e0a4d57fdf04e4570c49be55f526535
Signed-off-by: Michał Mirosław <emmir@google.com>
Google's RPC client process is in a different pidns and has more privileges --
CRIU can't open its /proc/<pid>/fd/<fd>. For images_dir_fd to be useful here
it would need to refer to a passed or CRIU's fd.
From: Michał Cłapiński <mclapinski@google.com>
Change-Id: Icbfb5af6844b21939a15f6fbb5b02264c12341b1
Signed-off-by: Michał Mirosław <emmir@google.com>
New 'query-ext-files' action for `criu dump` is sent after
freezing the process tree. This allows to defer gathering
the external file list when the process tree is in a stable
state and avoids race with the process creating and deleting
files.
Change-Id: Iae32149dc3992dea086f513ada52cf6863beaa1f
Signed-off-by: Michał Mirosław <emmir@google.com>
Container runtimes commonly use CRIU with RPC. However, this prevents
the use of action-scripts set in a CRIU configuration file due to the
explicit scripts mode introduced with the following commit:
ac78f13bdf
actions: Introduce explicit scripts mode
This patch enables container checkpoint/restore with action-scripts
specified via configuration file.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
nla_get_s32() was added to libnl 3.2.7 in 2015. Remove CRIU's definition
as it breaks build when statically linking the binary.
From: Uros Prestor <urosp@google.com>
Signed-off-by: Michał Mirosław <emmir@google.com>
When trying to build CRIU with libbsd enabled the compilation fails due
to duplicate definition of __aligned macro. Other such definitions are
already wrapped with #ifndef make __aligned definition consistent and
make it easier in the future to use the libbsd features if needed.
Signed-off-by: Michał Mirosław <emmir@google.com>
$LDFLAGS can contain `-Ldir`s that are required by '-lib's in $LIBS.
Reverse the order so that `-L` options make effect.
Signed-off-by: Michał Mirosław <emmir@google.com>
`make` without `-s` option will normally show the commands executed. In
the case of detecting build environment features current makefile will
cause detected features to be seen as 'echo #define' commands, but not
detected ones will be silent. Change it so that all tried features can
be seen (outside of make's silent mode) regardless of detection result.
Signed-off-by: Michał Mirosław <emmir@google.com>
The test for HAS_MEMFD is empty and noit used. Remove it.
Fixes: 5ee1ac1f28 ("criu: remove FEATURE_TEST_MEMFD")
Change-Id: I43b8f0cfd50ce9bdf93dafb647377318df1deae8
Signed-off-by: Michał Mirosław <emmir@google.com>
During dump, CRIU stores the structs representing sockets in a statically sized
hashmap of size 32. We have some (admittedly crazy) tasks that use tens of
thousands of sockets, and seem to spend most of the dump time iterating over
the linked lists of the map.
16K is chosen arbitrarily, so that it reduces the lengths of the chains to few
elements on average, while not introducing significant memory overhead.
From: Radosław Burny <rburny@google.com>
Signed-off-by: Michał Mirosław <emmir@google.com>
The fail() macro provides a new line character at the end of the
message. This patch fixes the following lint check that currently
fails in CI:
$ git --no-pager grep -E '^\s*\<(pr_perror|fail)\>.*\\n"'
test/zdtm/static/thp_disable.c: fail("prctl(GET_THP_DISABLE) returned unexpected value: %d != 1\n", ret);
test/zdtm/static/thp_disable.c: fail("Flags changed %lx -> %lx\n", orig_flags, new_flags);
test/zdtm/static/thp_disable.c: fail("Madvs changed %lx -> %lx\n", orig_madv, new_madv);
test/zdtm/static/thp_disable.c: fail("post-migration prctl(GET_THP_DISABLE) returned unexpected value: %d != 1\n", ret);
test/zdtm/static/thp_disable.c: fail("Flags changed %lx -> %lx\n", orig_flags, new_flags);
test/zdtm/static/thp_disable.c: fail("Madvs changed %lx -> %lx\n", orig_madv, new_madv);
Fixes: #2193
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Make it possible to skip network lock to enable uses that break connections
anyway to work without iptables/nftables being present.
Signed-off-by: Michał Mirosław <emmir@google.com>
Make it clear that the option numbers are indexes not the option
identifiers ("names"). Also show the value change that prompted test
failure.
Signed-off-by: Michał Mirosław <emmir@google.com>
We don't want test framework to change its behaviour on whether we
run a single or multiple tests in a run. When we shard the test suite
it can result in some shards having a single test to run and
unexpectedly change the test output format.
Signed-off-by: Michał Mirosław <emmir@google.com>
This commit revises the error handling in the fdspy test. Previously,
a failure case could have been incorrectly reported as successful because
of a specific check `pass != 0`, leading to potential false positives
when `check_pipe_ends()` returned `-1` due to a read/write pipe error.
To improve this, we've adjusted the error handling to return `0` in case
of any error. As such, the final success condition remains unchanged. This
approach will help accurately differentiate between successful and failed
cases, ensuring the output "All OK" is printed for success, and "Something
went WRONG" for any failure.
Fixes: 5364ca3 ("compel/test: Fix warn_unused_result")
Signed-off-by: Haorong Lu <ancientmodern4@gmail.com>
Apparently Skylake uses init-optimization when saving FPU state, and ptrace()
returns XSTATE_BV[0] = 0 meaning FPU was not used by a task (in init state).
Since CRIU restore uses sigreturn to restore registers, FPU state is always
restored. Fill the state with default values on dump to make restore happy.
Signed-off-by: Michał Mirosław <emmir@google.com>
The original commit added saving THP_DISABLED flag value, but missed
restoring it. There is restoring code, but used only when --lazy_pages
mode is enabled. Restore the prctl flag always. While at it, rename the
`has_thp_enabled` -> `!thp_disabled` for consistency.
Fixes: bbbd597b41 (2017-06-28 "mem: add dump state of THP_DISABLED prctl")
Signed-off-by: Michał Mirosław <emmir@google.com>
Linux 4.15 doesn't like empty string for cgroup2 mount options.
Pass NULL then to satisfy the kernel check. Log the options for
easier debugging.
Signed-off-by: Michał Mirosław <emmir@google.com>
4.15-based kernels don't allow F_*SEAL for memfds created with MFD_HUGETLB.
Since seals are not possible in this case, fake F_GETSEALS result as if it
was queried for a non-sealing-enabled memfd.
Signed-off-by: Michał Mirosław <emmir@google.com>
This does cgroup namespace creation separately from joining task
cgroups. This makes the code more logical, because creating cgroup
namespace also involves joining cgroups but these cgroups can be
different to task's cgroups as they are cgroup namespace roots
(cgns_prefix), and mixing all of them together may lead to
misunderstanding.
Another positive thing is that we consolidate !item->parent checks in
one place in restore_task_with_children.
Signed-off-by: Valeriy Vdovin <valeriy.vdovin@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This is a patch proposed by Thomas here:
https://lore.kernel.org/all/87ilczc7d9.ffs@tglx/
It removes (created id > desired id) "sanity" check and adds proper
checking that ids start at zero and increment by one each time when we
create/delete a posix timer.
First purpose of it is to fix infinite looping in create_posix_timers on
old pre 3.11 kernels.
Second purpose is to allow kernel interface of creating posix timers
with desired id change from iterating with predictable next id to just
setting next id directly. And at the same time removing predictable next
id so that criu with this patch would not get to infinite loop in
create_posix_timers if this happens.
Thanks a lot to Thomas!
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
CentOS 7 CI environment uses Python 2. To execute criu-ns
script in CentOS 7 changing the current shebang line to
python is required.
This reverse the changes made in a15a63fce0
Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
These changes fix the `ImportError: No module named pathlib`
error when executing criu-ns tests located at criu/test/others/criu-ns
Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
These changes remove and update the changes introduced in
7177938e60 in favor of the
Python version in CI.
os.waitstatus_to_exitcode() function appeared in Python 3.9
Related to: #1909
Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
--criu-binary argument provides a way to supply the CRIU binary
location to run_criu().
Related to: #1909
Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
By default, the file name 'amdgpu_plugin.txt' is used also as the name
for the corresponding man page (`man amdgpu_plugin`). However, when
this man page is installed system-wide it would be more appropriate
to have a prefix 'criu-' (e.g., `man criu-amdgpu-plugin`).
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Using the fact that we know criu_pid and criu is a parent of restored
process we can create pidfile with pid on caller pidns level.
We need to move mount namespace creation to child so that criu-ns can
see caller pidns proc.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Newer Intel CPUs (Sapphire Rapids) have a much larger xsave area than
before. Looking at older CPUs I see 2440 bytes.
# cpuid -1 -l 0xd -s 0
...
bytes required by XSAVE/XRSTOR area = 0x00000988 (2440)
On newer CPUs (Sapphire Rapids) it grows to 11008 bytes.
# cpuid -1 -l 0xd -s 0
...
bytes required by XSAVE/XRSTOR area = 0x00002b00 (11008)
This increase the xsave area from one page to four pages.
Without this patch the fpu03 test fails, with this patch it works again.
Signed-off-by: Adrian Reber <areber@redhat.com>
The pipe_size type is unsigned int, when the fcntl call fails and
return -1, it will cause a negative rollover problem.
Signed-off-by: zhoujie <zhoujie133@huawei.com>
The TOS(type of service) field in the ip header allows you specify the
priority of the socket data.
Signed-off-by: Suraj Shirvankar <surajshirvankar@gmail.com>
2023-10-22 13:29:25 -07:00
636 changed files with 27204 additions and 4518 deletions
@ -8,8 +8,8 @@ Here are some useful hints to get involved.
* We have both -- [very simple](https://github.com/checkpoint-restore/criu/issues?q=is%3Aissue+is%3Aopen+label%3Aenhancement) and [more sophisticated](https://github.com/checkpoint-restore/criu/issues?q=is%3Aissue+is%3Aopen+label%3A%22new+feature%22) coding tasks;
* CRIU does need [extensive testing](https://github.com/checkpoint-restore/criu/issues?q=is%3Aissue+is%3Aopen+label%3Atesting);
* Documentation is always hard, we have [some information](https://criu.org/Category:Empty_articles) that is to be extracted from people's heads into wiki pages as well as [some texts](https://criu.org/Category:Editor_help_needed) that all need to be converted into useful articles;
* Feedback is expected on the GitHub issues page and on the [mailing list](https://lists.openvz.org/mailman/listinfo/criu);
* We accept GitHub pull requests and this is the preferred way to contribute to CRIU. If you prefer to send patches by email, you are welcome to send them to [CRIU development mailing list](https://lists.openvz.org/mailman/listinfo/criu).
* Feedback is expected on the GitHub issues page and on the [mailing list](https://lore.kernel.org/criu);
* We accept GitHub pull requests and this is the preferred way to contribute to CRIU. If you prefer to send patches by email, you are welcome to send them to [CRIU development mailing list](https://lore.kernel.org/criu).
Below we describe in more detail recommend practices for CRIU development.
* Spread the word about CRIU in [social networks](http://criu.org/Contacts);
* If you're giving a talk about CRIU -- let us know, we'll mention it on the [wiki main page](https://criu.org/News/events);
@ -27,44 +27,67 @@ The repository may contain multiple branches. Development happens in the **criu-
To clone CRIU repo and switch to the proper branch, run:
First, you need to install compile-time dependencies. Check [Installation dependencies](https://criu.org/Installation#Dependencies) for more info.
Follow these steps to compile CRIU from source code.
To compile CRIU, run:
#### Installing build dependencies
First, you need to install the required build dependencies. We provide scripts to simplify this process for several Linux distributions in [contrib/dependencies](contrib/dependencies). For a complete list of dependencies, please refer to the [installation guide](https://criu.org/Installation).
##### On Ubuntu/Debian-based systems:
```
make
./contrib/dependencies/apt-packages.sh
```
##### On Fedora/CentOS-based systems:
```
./contrib/dependencies/dnf-packages.sh
```
##### Using Nix:
```
nix develop
```
#### Compiling CRIU
Once the dependencies are installed, you can compile CRIU by running the `make` command from the root of the source directory:
```
make
```
This should create the `./criu/criu` executable.
## Edit the source code
If you use ctags, you can generate the ctags file by running
```
make tags
```
When you change the source code, please keep in mind the following code conventions:
* code is written to be read, so the code readability is the most important thing you need to have in mind when preparing patches
* we prefer tabs and indentations to be 8 characters width
* CRIU mostly follows [Linux kernel coding style](https://www.kernel.org/doc/Documentation/process/coding-style.rst), but we are less strict than the kernel community.
* we prefer line length of 80 characters or less, more is allowed if it helps with code readability
* CRIU mostly follows [Linux kernel coding style](https://www.kernel.org/doc/Documentation/process/coding-style.rst), but we are less strict than the kernel community
Other conventions can be learned from the source code itself. In short, make sure your new code
looks similar to what is already there.
Other conventions can be learned from the source code itself. In short, make sure your new code looks similar to what is already there.
The following command can be used to automatically run a code linter for Python files (flake8), Shell scripts (shellcheck),
## Automatic tools to fix coding-style
Important: These tools are there to advise you, but should not be considered as a "source of truth", as tools also make nasty mistakes from time to time which can completely break code readability.
The following command can be used to automatically run a code linter for Python files (ruff), Shell scripts (shellcheck),
text spelling (codespell), and a number of CRIU-specific checks (usage of print macros and EOL whitespace for C files).
```
make lint
make lint
```
In addition, we have adopted a [clang-format configuration file](https://www.kernel.org/doc/Documentation/process/clang-format.rst)
@ -74,7 +97,7 @@ results in decreased readability, we may choose to ignore these errors.
Run the following command to check if your changes are compliant with the clang-format rules:
```
make indent
make indent
```
This command is built upon the `git-clang-format` tool and supports two options `BASE` and `OPTS`. The `BASE` option allows you to
@ -84,27 +107,57 @@ can use `BASE=origin/criu-dev`. The `OPTS` option can be used to pass additional
to check the last *N* commits for formatting errors, without applying the changes to the codebase you can use the following command.
```
make indent OPTS=--diff BASE=HEAD~N
make indent OPTS=--diff BASE=HEAD~N
```
Note that for pull requests, the "Run code linter" workflow runs these checks for all commits. If a clang-format error is detected
we need to review the suggested changes and decide if they should be fixed before merging.
Here are some bad examples of clang-format-ing:
* if clang-format tries to force 120 characters and breaks readability - it is wrong:
```
@@ -58,8 +59,7 @@ static int register_membarriers(void)
If you get tired of typing `--to=criu@openvz.org` all the time,
If you get tired of typing `--to=criu@lists.linux.dev` all the time,
you can configure that to be automatically handled as well:
```
git config sendemail.to criu@openvz.org
git config sendemail.to criu@lists.linux.dev
```
If a developer is sending another version of the patch (e.g. to address
@ -345,7 +398,7 @@ version if needed though).
### Mail patches
The patches should be sent to CRIU development mailing list, `criu AT openvz.org`. Note that you need to be subscribed first in order to post. The list web interface is available at https://openvz.org/mailman/listinfo/criu; you can also use standard mailman aliases to work with it.
The patches should be sent to CRIU development mailing list, `criu AT lists.linux.dev`. Note that you need to be subscribed first in order to post. The list web interface is available at https://lore.kernel.org/criu; you can also use standard mailman aliases to work with it.
Please make sure the email client you're using doesn't screw your patch (line wrapping and so on).
@ -362,5 +415,3 @@ sometimes a patch may fly around a week before it gets reviewed.
Wiki article: [Continuous integration](https://criu.org/Continuous_integration)
CRIU tests are run for each series sent to the mailing list. If you get a message from our patchwork that patches failed to pass the tests, you have to investigate what is wrong.
We also recommend you to [enable Travis CI for your repo](https://criu.org/Continuous_integration#Enable_Travis_CI_for_your_repo) to check patches in your git branch, before sending them to the mailing list.
- [A simple example of usage](http://criu.org/Simple_loop)
- [Examples of more advanced usage](https://criu.org/Category:HOWTO)
- Troubleshooting can be hard, some help can be found [here](https://criu.org/When_C/R_fails), [here](https://criu.org/What_cannot_be_checkpointed) and [here](https://criu.org/FAQ)
- Troubleshooting can be hard, some help can be found [here](https://criu.org/When_C/R_fails), [here](https://criu.org/What_cannot_be_checkpointed) and [here](https://criu.org/index.php?title=FAQ)