When `opts.pid` is explicitly set to `os.getpid()`, `pycriu` fails to
daemonize the `criu` process. This causes `criu` to run as a child of
the dumped process, leading to the error "The criu itself is within
dumped tree".
This can be fixed by modifying `_send_req_and_recv_resp` to check if the
target PID matches the current process PID. If so, it enables daemon
mode, ensuring `criu` is detached and the dump succeeds.
Signed-off-by: unichronic <ishuvam.pal@gmail.com>
In a simple case where the parent process and the child one are in one
pid namespace we can safely use vpid(item) to prace the child. But, for
the cases where the child is a pid namespace init, or the child is put
into external pid namespace, the parent and the child have different pid
namespaces and using pid vpid(item) (which e.g. for init will always be
1 here) to ptrace the child process is inorrect.
Let's use the pid reported to us from clone as it's always the right pid
of the child from the parent's point of view.
Fixes: 7dd583002 ("restore: add infrastructure to enable shadow stack")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The return value of sys_rseq was previously ignored during
unregistration, under the assumption that it would not fail if the rseq
structure was properly registered.
However, if sys_rseq fails, the kernel retains the registration. If the
memory containing the rseq structure is subsequently unmapped or reused,
kernel updates to the rseq area can cause the process to crash (e.g.,
via SIGSEGV).
Check the return value of sys_rseq. If it fails, log the error code and
abort the restoration process. This makes rseq unregistration failures
fatal and explicit, aiding in debugging and preventing later obscure
crashes.
Signed-off-by: liqiang2020 <liqiang64@huawei.com>
Fedora rawhide ships a pre-release of GCC 16 which produces following
error:
uprobes.c:34:22: error: variable ‘dummy’ set but not used [-Werror=unused-but-set-variable=]
34 | volatile int dummy = 0;
| ^~~~~
Marking this variable as "__maybe_unused" to fix the error.
Signed-off-by: Adrian Reber <areber@redhat.com>
In some cases, CRIU can observe tasks that exit during checkpointing,
and sets the state of these tasks to COMPEL_TASK_DEAD.
This patch adds a string representation of this value that can be used
by CRIT when decoding the images.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
CRIU defines the following constants for task state in compel/include/uapi/task-state.h
COMPEL_TASK_ALIVE = 0x01
COMPEL_TASK_STOPPED = 0x03
COMPEL_TASK_ZOMBIE = 0x06
Thus, we need to swap the values for "zombie" and "stopped" used in CRIT.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When running the command 'make docker-test', almost all zdtm tests fail,
logging 'ip: not found'. 'ip' command of the iproute2 package was missing.
So added the package to the list of dependencies in 'apt-packages.sh'. Now
tests run
Signed-off-by: ImranullahKhann <imranullahkhann2004@gmail.com>
As of Alpine Linux 3.19, musl libc no longer contains separate
fopen64(), fseeko64(), or ftello64() functions. This causes building
CRIU with amdgpu plugin to fail with the following error:
amdgpu_plugin.c: In function 'parallel_restore_bo_contents':
amdgpu_plugin.c:2286:17: error: implicit declaration of function 'fseeko64'; did you mean 'fseeko'? [-Wimplicit-function-declaration]
2286 | fseeko64(bo_contents_fp, entry->read_offset + offset, SEEK_SET);
| ^~~~~~~~
| fseeko
make[2]: *** [Makefile:31: amdgpu_plugin.so] Error 1
make[1]: *** [Makefile:363: amdgpu_plugin] Error 2
To fix this, add the missing $(DEFINES) to plugin builds, and since we
always compile with _FILE_OFFSET_BITS=64, we don't need the 64 suffix.
Fixes: #2826
Suggested-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The initialization of the struct timespec used as clockid input
parameter was removed in commit:
b4441d1bd8 ("restorer.c: rm unneded struct init")
This causes the build to fail on Alpine with clang version 21.1.2:
GEN criu/pie/parasite-blob.h
criu/pie/restorer.c:1230:39: error: variable 'ts' is uninitialized when passed as a const pointer argument here [-Werror,-Wuninitialized-const-pointer]
1230 | if (sys_clock_gettime(t->clockid, &ts)) {
| ^~
1 error generated.
make[2]: *** [/criu/scripts/nmk/scripts/build.mk:118: criu/pie/restorer.o] Error 1
make[1]: *** [criu/Makefile:59: pie] Error 2
make: *** [Makefile:278: criu] Error 2
To fix this, we remove the "const" from the declaration of
clock_gettime. Since the kernel writes the current time into
the struct timespec provided by the caller, the pointer must
be writable.
Suggested-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
After commit [1] we accidentally stopped reporting the errors from
kerndat_has_timer_cr_ids(), let's fix that.
Fixes: 1eaa870cc ("kerndat: check that hardware breakpoints work") [1]
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The "man 2 close":"Dealing with error returns from close()" says:
"Retrying the close() after a failure return is the wrong thing to do"
We should not leave the fd there, attempting to close it again on next
close()/close_safe() may lead to accidentally closing something else.
It confirms with the kernel code where sys_close() removes fd from
fdtable in this stack:
+-> sys_close
+-> file_close_fd
+-> file_close_fd_locked
+-> rcu_assign_pointer(fdt->fd[fd], NULL)
If there was an fd this stack is always reached and fd is always
removed.
Let's replace the fd with -1 after close no matter what.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The README currently uses an external link to criu.org for the embedded
CRIU logo. Loading this URL when viewing the README on GitHub sometimes
fails with "Error Fetching Resource". Using a local copy of the logo
fixes this issue.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Docker version 28 broke container restore in combination with network
namespaces. The workaround in the CI script was excluding Docker version
28. Now that there is also Docker version 29, which is still broken,
this also excludes Docker version 29.
Signed-off-by: Adrian Reber <areber@redhat.com>
Commit "plugin: Add DUMP_DEVICES_LATE callback" introduced a new plugin
callback that is invoked in cr_dump_tasks(). The return value of this
callback was assigned to the variable ret. However, this variable is later
used as the return value when goto err is triggered in subsequent
conditions. As a result, CRIU exits with "Dumping finished successfully" even
when some actions have failed and inventory.img has not been created.
To fix this, we replace ret with exit_code and use it only when it is
actually needed.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Introduce an opt-in mode for building and running ZDTM static tests
with Guarded Control Stack (GCS) enabled on AArch64.
Changes:
- Support `GCS_ENABLE=1` builds, adding `-mbranch-protection=standard`
and `-z experimental-gcs=check` to CFLAGS/LDFLAGS.
- Export required GLIBC_TUNABLES at runtime via `TEST_ENV`.
- %.pid rules to prefix test binaries with `$(TEST_ENV)`
so the tunables are set when running tests.
- Makefile rules for selectively enabling GCS in tests
Usage:
# Build and run with GCS enabled
make -C zdtm/static GCS_ENABLE=1 posix_timers
GCS_ENABLE=1 ./zdtm.py run --keep-img=always \
-t zdtm/static/posix_timers
By default (`GCS_ENABLE` unset or 0), test builds and runs are
unchanged.
NOTE: This assumes that the test victim was compiled also using
GCS_ENABLE=1 so that the proper GCS AArch64 ELF headers are present
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Reviewed-by: Alexander Mikhalitsyn aleksandr.mikhalitsyn@canonical.com
This commit finalizes AArch64 Guarded Control Stack (GCS)
support by wiring the full dump and restore flow.
The restore path adds the following steps:
- Define shared AArch64 GCS types and constants in a dedicated header
for both compel and CRIU inclusion
- compel: add get/set NT_ARM_GCS via ptrace, enabling user-space
GCS state save and restore.
- During restore switch to the new GCS (via GCSSTR) to place capability
token sa_restorer address
- arch_shstk_trampoline() — We enable GCS in a trampoline that using
prctl(PR_SET_SHADOW_STACK_STATUS, ...) via inline SVC. The trampoline
ineeded because we can’t RET without a valid GCS.
- restorer: map the recorded GCS VMA, populate contents top-down with
GCSSTR, write the signal capability at GCSPR_EL0 and the valid token at
GCSPR_EL0-8, then switch to the rebuilt GCS (GCSSS1)
- Save and restore registers via ptrace
- Extend restorer argument structures to carry GCS state
into post-restore execution
- Add shstk_set_restorer_stack(): sets tmp_gcs to temporary restorer
shadow stack start
- Add gcs_vma_restore implementation (required for mremap of the GCS VMA)
Tested with:
GCS_ENABLE=1 ./zdtm.py run -t zdtm/static/env00
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Add debug and info messages to log Guarded Control Stack state when
dumping AArch64 threads. This includes the following values:
- gcspr_el0
- features_enabled
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
[ alex: cleanup fixes ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Mike Rapoport <rppt@kernel.org>
- Define user_aarch64_gcs_entry in core-aarch64.proto to store
Guarded Control Stack state (gcspr_el0, features_enabled).
- Extend thread_info_aarch64 with an optional gcs field
Also extend thread_info_aarch64 with an optional gcs field
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Introduce an opt-in mode for building and running compel tests
with Guarded Control Stack (GCS) enabled on AArch64.
Changes:
- Extend compel/test/infect to support `GCS_ENABLE=1` builds,
adding `-mbranch-protection=standard` and
`-z experimental-gcs=check` to CFLAGS/LDFLAGS.
- Export required GLIBC_TUNABLES at runtime via `TEST_ENV`.
Usage:
make -C compel/test/infect GCS_ENABLE=1
make -C compel/test/infect GCS_ENABLE=1 run
By default (`GCS_ENABLE` unset or 0), builds and runs are unchanged.
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
When GCS is enabled, the kernel expects a capability token at GCSPR_EL0-8
and sa_restorer at GCSPR_EL0-16 on rt_sigreturn. The sigframe must be
consistent with the kernel’s expectations, with GCSPR_EL0 advanced by -8
having it point to the token on signal entry. On rt_sigreturn, the kernel
verifies the cap at GCSPR_EL0, invalidates it and increments GCSPR_EL0 by 8
at the end of gcs_restore_signal() .
Implement parasite_setup_gcs() to:
- read NT_ARM_GCS via ptrace(PTRACE_GETREGSET)
- write (via ptrace) the computed capability token and restorer address
- update GCSPR_EL0 to point to the token's location
Call parasite_setup_gcs() into parasite_start_daemon() so the sigreturn
frame satisfies kernel's expectation
Tests with GCS remain opt‑in:
make -C compel/test/infect GCS_ENABLE=1 && make -C compel/test/infect run
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
[ alex: cleanup fixes ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Mike Rapoport <rppt@kernel.org>
Add basic prerequisites for Guarded Control Stack (GCS) state on AArch64.
This adds a gcs_context to the signal frame and extends user_fpregs_struct_t to
carry GCS metadata, preparing the groundwork for GCS in the parasite.
For now, the GCS fields are zeroed during compel_get_task_regs(), technically
ignoring GCS since it does not reach the control logic yet; that will be
introduced in the next commit.
The code path is gated and does not affect normal tests. Can be explicitly
enabled and tested via:
make -C infect GCS_ENABLE=1 && make -C infect run
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
[ alex: clean up fixes ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Mike Rapoport <rppt@kernel.org>
Introduce ARM64 Guarded Control Stack (GCS) constants and macros
in a new uapi header for use in both CRIU and compel.
Includes:
- NT_ARM_GCS type
- prctl(2) constants for GCS enable/write/push modes
- Capability token helpers (GCS_CAP, GCS_SIGNAL_CAP)
- HWCAP_GCS definition
These are based on upstream Linux definitions
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Mike Rapoport <rppt@kernel.org>
Refactor user_fpregs_struct_t to wrap user_fpsimd_state in a
dedicated struct, preparing for future extending by just
adding new members
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
[ alex: fixes ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Mike Rapoport <rppt@kernel.org>
At least on tests running on Fedora rawhide following error could be
seen:
```
criu/tty.c: In function 'pts_fd_get_index':
criu/tty.c:262:21: error: initialization discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers]
262 | char *pos = strrchr(link->name, '/');
|
```
This fixes it.
Signed-off-by: Adrian Reber <areber@redhat.com>
Static code analysis reported:
1. criu/cr-restore.c:2438:2: var_decl: Declaring variable "end_vma"
without initializer.
4. criu/cr-restore.c:2451:5: assign: Assigning: "s_vma" = "&end_vma",
which points to uninitialized data.
7. criu/cr-restore.c:2449:4: uninit_use: Using uninitialized value
"s_vma->list.next".
This tries to fix it by initializing the variable.
Signed-off-by: Adrian Reber <areber@redhat.com>
Static code analysis reported:
criu/cr-dump.c:2328:2: identical_branches: The same code is executed
when the condition "ret" is true or false, because the code in the
if-then branch and after the if statement is identical. Should the if
statement be removed?
This is a fix for the warning.
Signed-off-by: Adrian Reber <areber@redhat.com>
The syntax of the inherit-fd functionality for unix socket and pipe
includes a colon.
Fixes: 0df3f79fc0 ("criu(8): fix --inherit-fd description")
Fixes: c37324b6d0 ("crtools: describe the inherit-fd option")
Signed-off-by: Mark Polyakov <mark@thundercompute.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
On AMD Instinct MI300 systems, restoring a large GPU application can
fail because the checkpoint size is too large and the maximum value of
an offset (with integer type) is insufficient. This problem occurs when
the total size of all buffer objects exceeds int max, not because any
single buffer is too large, but it can also happen with a large number
of small buffers.
Fixes: #2812
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
These header files are copied directly from the Linux kernel and contain
typos. We skip these files in codespell to simplify maintenance.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch updates drm.h and amdgpu_drm.h kernel headers,
and adds drm_mode.h (included by drm.h) from the rocm-7.1.0
release tag.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
amdgpu_plugin_drm.c:167:6: error: variable 'num_bos' set but not used [-Werror,-Wunused-but-set-variable]
167 | int num_bos = 0;
|
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Add a comment that explains the purpose of `retry_needed`.
Co-authored-by: Andrei Vagin <avagin@google.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Co-authored-by: Andrei Vagin <avagin@google.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Use `return 0` on success in `post_dump_dmabuf_check()` for consistency
with other functions.
Co-authored-by: Andrei Vagin <avagin@google.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
These pr_info lines begin with "CC3" and "TWI" were not meant to be
included in the patch.
Co-authored-by: Andrei Vagin <avagin@google.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
amdgpu libraries that use dmabuf fd to share GPU memory between
processes close the dmabuf fds immediately after using them.
However, it is possible that checkpoint of a process catches one
of the dmabuf fds open. In that case, the amdgpu plugin needs
to handle it.
The checkpoint of the dmabuf fd does require the device file
it was exported from to have already been dumped
To identify which device this dmabuf fd was exprted from, attempt
to import it on each device, then record the dmabuf handle
it imports as. This handle can be used to restore it.
Signed-off-by: David Francis <David.Francis@amd.com>
The amdgpu plugin was counting how many files were checkpointed
to determine when it should close the device files.
The number of device files is not consistent; a process may
have multiple copies of the drm device files open.
Instead of doing this counting, add a new callback after all
files are checkpointed, so plugins can clean up their
resources at an appropriate time.
Signed-off-by: David Francis <David.Francis@amd.com>
Buffer objects held by the amdgpu drm driver are checkpointed with
the new BO_INFO and MAPPING_INFO ioctls/ioctl options. Handling
is in amdgpu_plugin_drm.h
Handling of imported buffer objects may require dmabuf fds to be
transferred between processes. These occur over fdstore, with the
handle-fstore id relationships kept in shread memory. There is a
new plugin callback: RESTORE_INIT to create the shared memory.
During checkpoint, track shared buffer objects, so that buffer objects
that are shared across processes can be identified.
During restore, track which buffer objects have been restored. Retry
restore of a drm file if a buffer object is imported and the
original has not been exported yet. Skip buffer objects that have
already been completed or cannot be completed in the current restore.
So drm code can use sdma_copy_bo, that function no longer requires
kfd bo structs
Update the protobuf messages with new amdgpu drm information.
Signed-off-by: David Francis <David.Francis@amd.com>
The amdgpu plugin usually calls drm ioctls through the libdrm
wrappers. However, amdgpu restore requires dealing with dmabufs
and gem handles directly, which means drm ioctls must be
called directly.
Add the drm.h header (from the kernel's uapi).
Signed-off-by: David Francis <David.Francis@amd.com>
For amdgpu plugin to call the new amdgpu drm CRIU ioctls, it needs
the amdgpu drm header file, copied from the kernel's includes.
Signed-off-by: David Francis <David.Francis@amd.com>
amdgpu dmabuf CRIU requires the ability of the amdgpu plugin
to retry.
Change files_ext.c to read a response of 1 from a plugin restore
function to mean retry.
Signed-off-by: David Francis <David.Francis@amd.com>
amdgpu represents allocated device memory as a memory mapping
of the device file. This is a non-standard VMA that must
be handled by the plugin, not the normal VMA code.
Ignore all VMAs on device files.
Signed-off-by: David Francis <David.Francis@amd.com>
This patch implements the entire logic to enable the offloading of
buffer object content restoration.
The goal of this patch is to offload the buffer object content
restoration to the main CRIU process so that this restoration can occur
in parallel with other restoration logic (mainly the restoration of
memory state in the restore blob, which is time-consuming) to speed up
the restore phase. The restoration of buffer object content usually
takes a significant amount of time for GPU applications, so
parallelizing it with other operations can reduce the overall restore
time.
It has three parts: the first replaces the restoration of buffer objects
in the target process by sending a parallel restore command to the main
CRIU process; the second implements the POST_FORKING hook in the amdgpu
plugin to enable buffer object content restoration in the main CRIU
process; the third stops the parallel thread in the RESUME_DEVICES_LATE
hook.
This optimization only focuses on the single-process situation (common
case). In other scenarios, it will turn to the original method. This is
achieved with the new `parallel_disabled` flag.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently the restore of buffer object comsumes a significant amount of
time. However, this part has no logical dependencies with other restore
operations. This patch introduce some structures and some helper
functions for the target process to offload this task to the main CRIU
process.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
When enabling parallel restore, the target process and the main CRIU
process need an IPC interface to communicate and transfer restore
commands. This patch adds a Unix domain TCP socket and stores this
socket in `fdstore`.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Major changes:
* plugins/amdgpu: Implement parallel restore
* Handle processes with uprobes vma
* Fix: getsockopt usage for SO_PASSCRED/SO_PASSSEC on Linux 6.16
* Relax ELF magic check to support MIPS libraries
* pagemap: prevent integer overflow in pagemap_len
This release's name is a nod to the growing challenge we face in
maintaining compatibility across the rapidly evolving Linux kernel
ecosystem.
The full changelog can be found here: https://criu.org/Download/criu/4.2.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Using sizeof(hdr) where hdr is a pointer gives the size of the pointer,
not the size of the structure it points to.
Reported-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
vsnprintf does not always return the number of bytes actually written to
the buffer.
If the output was truncated due to the buffer limit, the return value is
the total number of bytes which WOULD have been written to the final
string if enough space had been available.
This means we must cap the return value to the buffer size excluding the
terminating null byte to correctly calculate the log entry size.
Signed-off-by: Andrei Vagin <avagin@google.com>
kerndat_init() can generate a significant volume of logs. If called
before log_init(), all these messages will be saved in the
early_log_buffer, which has a limited capacity. Additionally, saving to
the early_log_buffer can introduce a performance penalty, especially
when verbose mode is not enabled.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Andrei Vagin <avagin@google.com>
When we compare two list of vma-s, we need to take into account that
some of them could be merged.
Fixes#12286
Signed-off-by: Andrei Vagin <avagin@google.com>
This functionality (#2527) is being reverted and excluded from this
release due to issue #2812.
It will be included in a subsequent release once all associated issues
are resolved.
Signed-off-by: Andrei Vagin <avagin@google.com>
This patch fixes the following error:
$ sudo make -C test/others/criu-coredump run
...
Traceback (most recent call last):
File "/home/circleci/criu/coredump/coredump", line 55, in <module>
main()
File "/home/circleci/criu/coredump/coredump", line 47, in main
coredump(opts)
File "/home/circleci/criu/coredump/coredump", line 14, in coredump
cores = generator(os.path.realpath(opts['in']))
File "/home/circleci/criu/coredump/criu_coredump/coredump.py", line 192, in __call__
self.coredumps[pid] = self._gen_coredump(pid)
File "/home/circleci/criu/coredump/criu_coredump/coredump.py", line 214, in _gen_coredump
cd.vmas = self._gen_vmas(pid)
File "/home/circleci/criu/coredump/criu_coredump/coredump.py", line 992, in _gen_vmas
v.data = self._gen_mem_chunk(pid, vma, v.filesz)
File "/home/circleci/criu/coredump/criu_coredump/coredump.py", line 879, in _gen_mem_chunk
page_mem = self._get_page(pid, page_no)
File "/home/circleci/criu/coredump/criu_coredump/coredump.py", line 797, in _get_page
num_pages = m.get("nr_pages", m.compat_nr_pages)
AttributeError: 'dict' object has no attribute 'compat_nr_pages'
+ exit 1
make[1]: *** [Makefile:3: run] Error 1
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Andrei Vagin <avagin@google.com>
Use nr_pages when available, falling back to compat_nr_pages
for compatibility.
Signed-off-by: alam0rt <sam@samlockart.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The --mntns-compat-mode option is no longer parsed with CHECK.
Use --log-file instead to test the error message.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
_init__.py defines the public API for pycriu. It is important to use
explicit imports to avoid leaking every symbol from criu.py into the
pycriu namespace. This avoids import-time side effects, prevents name
collisions, and circular-import traps.
Fixes the following lint error:
F403 `from .criu import *` used; unable to detect undefined names
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This allows users to specify RPC options when
using the check() functionality.
Co-authored-by: Andrii Herheliuk <andrii@herheliuk.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The check() functionality is very different from dump, pre-dump,
and restore. It is used only to check if the kernel supports required
features, and does not need the majority of options set via RPC.
In particular, we don't need to open `image_dir` when running `check()`
because this functionality doesn't create or process image files. In
this case, `image_dir` is used as `work_dir`, only when the latter is
not specified and a log file is used.
This patch updates the RPC options parser so that it only handles the
logging options when check() is used. Logging to a file is required when
log_file is explicitly set or no log_to_stderr is used. In such case, we
also resolve images_dir and work_dir where the log file will be created.
Fixes: #2758
Suggested-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Move the logging initialization into a helper function that
can be reused.
No functional change intended.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Move the images_dir selection logic from setup_opts_from_req() into a
new function: resolve_images_dir_path(). This improves readability and
allows the code to be reused. While at it, use snprintf() instead of
sprintf() for the /proc path and ensure NULL termination after strncpy().
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Commit 9089ce8 ("service: use setproctitle") extended cr-service to
get the full path of images_dir using readlink(). However, the RPC
API was later extended to allow setting a custom path (folder) to
be set instead of passing a file descriptor, which causes readlink()
to fail as the path is not a symbolic link.
It would be better to drop the code setting the images-dir path as a
string in the proctitle.
Fixes: #2794
Suggested-by: Andrei Vagin <avagin@google.com>
Co-authored-by: Andrii Herheliuk <andrii@herheliuk.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Move the code that opens the images directory, resolves its absolute
path via readlink(), selects the work_dir, and chdir()s into it into a
new function: setup_images_and_workdir(). This reduces the size of
`setup_opts_from_req()`, improves its readability, and allows this
functionality to be reused.
While at it, change open_image_dir() to take a const char *dir
parameter, reflecting that the path is not modified by the function and
allowing callers to pass string literals without casts.
No functional changes are intended.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This change allows users to call criu.use_sk() without any
parameters to use the default socket name.
Co-authored-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
[Errno 2] No such file or directory -> Socket file not found.
[Errno 111] Connection refused -> Service not running.
Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
Use system-installed CRIU binary instead of a local file
Thanks to @avagin for suggesting this solution.
Co-authored-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
Container runtimes that use libcriu (e.g., crun) need to specify a CRIU
configuration file that allows to overwrite default options set via RPC.
This is particularly useful to set options such as `--tcp-established`
via `/etc/criu/runc.conf` in Kubernetes.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Unlike "which", which is a separate executable not always installed by
default, "command -v" is a shell built-in available at least for bash,
dash, and busybox shell.
Unlike "which", "command -v" is also easier to grep for, and it is
already used in a few places here.
Inspired by commit 57251d811.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
which is used in Makefiles to check for dependencies:
Example:
export USE_ASCIIDOCTOR ?= $(shell which asciidoctor 2>/dev/null)
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Don't install external pip dependencies when running `make install`.
As we are not really into developing a Python project, we should
not install additional packages. CRIU does that nowhere else.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The existing test collects all action-script hooks triggered during
`h`, `ns`, and `uns` runs with ZDTM into `actions_called.txt`, then
verifies that each hook appears at least once. However, the test does
not verify that hooks are invoked *exactly once* or in *correct order*.
This change updates the test to run ZDTM only with ns flavour as this
seems to cover all action-script hooks, and checks that all hooks are
called correctly.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch consolidates the action-script tests into
`test/others/action-script` to ensure all tests are executed
consistently and reduce duplication. Since we had two tests that appear
to do the same thing, we can remove the one that doesn't use zdtm.py.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Regardless of the actual error message, "Unknown" was always appended
to the end of the string, resulting in messages like:
"DUMP failed: Error(3): No process with such pidUnknown".
Fixed by changing standalone if statements to else-if blocks so
"Unknown" is only added when no specific error condition matches.
Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
pycriu depends on protobuf to function correctly. Currently,
it raises an error if protobuf is not installed. Adding
protobuf to the dependencies ensures it is available after
installing pycriu.
Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
We use LGPL-v2.1 license for the libcriu and pycriu as they are
intended to be usable by both proprietary and open-source applications.
Signed-off-by: Andrii Herheliuk <andrii@herheliuk.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
* call shstk_vma_restore() for VMA_AREA_SHSTK in vma_remap()
* delete map/copy/unmap from shstk_restore() and keep token setup + finalize
* before the loop naturally stopped at cet->ssp-8, so a -8 nudge is required here
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Co-Authored-By: Andrei Vagin <avagin@gmail.com>
[ alex: small code cleanups ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
1. create shadow stack vma during vma_remap cycle
2. copy contents from a premapped non-shstk VMA into it
3. unmap premapped non-shstk VMA
4. Mark shstk VMA for remap into the final destination
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Co-Authored-By: Andrei Vagin <avagin@gmail.com>
Co-Authored-By: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
[ alex: debugging, rework together with Andrei and code cleanup ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
* reserve space for restorer shadow stack
* set tmp_shstk at mem, advance mem by PAGE_SIZE
* forget the extra PAGE_SIZE (shstk) for premapped VMAs
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Co-Authored-By: Andrei Vagin <avagin@gmail.com>
[ alex: small code cleanups ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
* default: return whatever passed in
eg. to be used as
shtk_min_mmap_addr(kdat.mmap_min_addr)
* x86: ignore def and return 4G
On x86, CET shadow stack is required to be mapped above 4GiB
On the other hand forcing 4GiB globally would break 32-bit restores.
Co-Authored-By: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Extend the test for overwriting config options via RPC with
repeatable option (--action-script) and verify that the value
will not be silently duplicated.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When an additional configuration file is specified via RPC, this file is
parsed twice: first at an early stage to load options such as --log-file,
--work-dir, and --images-dir; and again after all RPC options and
configuration files have been evaluated.
This allows users to overwrite options specified via RPC by the
container runtime (e.g., --tcp-established). However, processing
the RPC config file twice leads to silently duplicating the values
of repeatable options such as `--action-script`.
To address this problem, we adjust the order of options parsing so
that the RPC config file is evaluated only once. This change should
not introduce any functional changes. Note that this change does
not affect the logging functionality, as early log messages are
temporarily buffered and only written to the log file once it has
been initialized (see commit 1ff2333 "Printout early log messages").
Fixes#2727
Suggested-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Program flow:
- Parse the test's own executable to calculate the file offset of the uprobe
target function symbol
- Enable the uprobe at the target function
- Call the target function to trigger the uprobe, and hence the uprobes vma
creation
- C/R
- Call the target function again to check that no SIGTRAP is sent, since the
uprobe is still active
At least v1.7 of libtracefs is required because that's when
tracefs_instance_reset was introduced. The uprobes API was introduced in v1.4,
and the dynamic events API was introduced in v1.3.
Ubuntu Focal doesn't have libtracefs. Jammy has v1.2.5, and Noble has v1.7.
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
This commit teaches criu to deal with processes which have a "[uprobes]" vma.
This vma is mapped by the kernel when execution hits a uprobe location. This
is done so as to execute the uprobe'd instruciton out-of-line in the special
vma. The uprobe'd location is replaced by a software breakpoint instruction,
which is int3 on x86. When execution reaches that location, control is
transferred over to the kernel, which then executes whatever handler code
it has to, for the uprobe, and then executed the replaced instruction out-of-line
in the special vma. For more details, refer to this commit:
d4b3b6384f
Reason for adding a new option
------------------------------
A new option is added instead of making the uprobes vma handling transparent
to the user, so that when a dump is attempted on a process tree in which a
process has the uprobes vma, criu will error, asking the user to use this option.
This gives the user a chance to check what uprobes are attached to the processes
being dumped, and try to ensure that those uprobes are active on restore as well.
Again, the same reason for requiring this option on restore as well. Because
if a process is dumped with an active uprobe, and on restore if the uprobe
is not active, then if execution reaches the uprobe location, then the process
will be sent a SIGTRAP, whose default behaviour will terminate and core dump
the process. This is because the code pages are dumped with the software
breakpoint instruction replacement at the uprobe'd locations. On restore, if
execution reaches these locations and the kernel sees no associated active
uprobes, then it'll send a SIGTRAP.
So, using this option is on dump and restore is an implicit guarantee on the
user's behalf that they'll take care of the active uprobes and that any future
SIGTRAPs because of this are not on us! :)
Handling uprobes vma on dump
----------------------------
We don't need to store any information about the uprobes vma because it's
completely handled by the kernel, transparent to userspace. So, when a uprobes
vma is detected, we check if the --allow-uprobes option was specified or not.
If so, then the allow_uprobes boolean in the inventory image is set (this is
used on restore). The uprobes vma is skipped from being added to the vma list.
Handling uprobes vma on restore
-------------------------------
If allow_uprobes is set in the inventory image, then check if --allow-uprobes
is specified or not. Restoring the vma is not required.
Fixes: checkpoint-restore#1961
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
Add a ZDTM test case where CRIU uses a helper process to restore
a non-empty process group with a terminated leader and a Unix
domain socket. This reproduces a corner case in which mount
namespace switching can fail during restore:
https://github.com/checkpoint-restore/criu/issues/2687
Signed-off-by: Qiao Ma <mqaio@linux.alibaba.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
These tests reveal the following build error:
In file included from compel/include/uapi/compel/asm/sigframe.h:4,
from compel/plugins/std/infect.c:14:
/usr/include/asm/sigcontext.h:28:8: error: redefinition of 'struct sigcontext'
28 | struct sigcontext {
| ^~~~~~~~~~
In file included from criu/arch/aarch64/include/asm/restorer.h:4,
from criu/arch/aarch64/crtools.c:11:
/usr/include/asm/sigcontext.h:28:8: error: redefinition of 'struct sigcontext'
28 | struct sigcontext {
| ^~~~~~~~~~
Inspired by #2766 / #2767.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Compilation on gentoo/arm64 (llvm+musl) fails with:
In file included from compel/include/uapi/compel/asm/sigframe.h:4,
from compel/plugins/std/infect.c:14:
/usr/include/asm/sigcontext.h:28:8: error: redefinition of 'struct sigcontext'
28 | struct sigcontext {
| ^~~~~~~~~~
In file included from criu/arch/aarch64/include/asm/restorer.h:4,
from criu/arch/aarch64/crtools.c:11:
/usr/include/asm/sigcontext.h:28:8: error: redefinition of 'struct sigcontext'
28 | struct sigcontext {
| ^~~~~~~~~~
This is happening because <asm/sigcontext.h> and <signal.h> are
mutually incompatible on Linux.
To fix, use <signal.h> instead of <asm/sigcontext.h> for arm64
(like all others arches do).
Fixes: #2766
Signed-off-by: Pepper Gray <hello@peppergray.xyz>
page_pipe_read() expects an 'unsigned long *', but pi->nr_pages is u64.
On 32-bit platforms (e.g., armv7), passing &pi->nr_pages directly causes
a compiler error. To fix this we introduce a temporary variable and copy
the result back to pi->nr_pages.
Fixes: #2756
Suggested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Our previous mailing list had some technical issues and we created
a new one that is hopefully more reliable.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Currently we run aarch64 tests on both Cirrus CI and GitHub runners.
However, Cirrus CI fails with "Monthly compute limit exceeded!". This
change removes the redundant tests to streamline our CI process.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Ubuntu Focal Fossa (20.04) reached its end-of-life on 31 May 2025. So, move
over to using Ubuntu Jammy (22.04) base images.
Also, focal repos do not have libtracefs, which the uprobes zdtm test needs.
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
Travis CI stopped providing CI minutes for open-source projects
some time ago and we have migrated to GitHub actions.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Currently, adding a package which is required either for development or testing
requires it to be added in multiple places due to many duplicated Dockerfiles
and installation scripts. This makes it difficult to ensure that all scripts
are updated appropriately and can lead to some places being missed.
This patch consolidates the list of dependencies and adds installation
scripts for each package-manager used in our CI (apk, apt, dnf, pacman).
This change also replaces the `debian/dev-packages.lst` as this subfolder
conflicts with the Ubuntu/Debian packing scripts used for CRIU:
https://github.com/rst0git/criu-deb-packages
This patch also removes the CentOS 8 build scripts as it is EOL
and the container registry is no longer available.
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This commit adds the document to provide high-level overviews of the
CRIU project for AI assistants like Claude and Gemini.
These documents are intended to be used as context for AI-powered
developer assistants to help them understand the project's goals,
architecture, and development process. This will allow them to provide
more accurate and helpful responses to developer questions.
The documents include:
- A brief introduction to CRIU
- A quick start guide for checkpointing and restoring a simple process
- An overview of the dump and restore process
- A description of the Compel subproject
- Information about the project's coding style, code layout, and tests
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The previous commit 4cd4a6b1ac ("zdtm: stop importing junit_xml")
removed the junit_xml library, but some variables related to it were
left in the code. This commit removes the unused `tc` variable and a
call to its `add_error_info` method.
Fixes: 4cd4a6b1ac ("zdtm: stop importing junit_xml")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
On some ARM/aarch64 systems, the VDSO ELF header sets EI_OSABI to 3 (Linux),
while CRIU expects 0 (System V). This strict check causes restore to fail
with "ELF header magic mismatch"
This patch relaxes the check to accept both values, improving compatibility
with modern toolchains and kernels (e.g. Linux 6.12+)
Fixes: #2751
Signed-off-by: dong sunchao <dongsunchao@gmail.com>
During investigations, it’s much easier to read logs when regions are
printed in the start - end format rather than `start/size`.
In addition, all page counters and memory sizes are now printed in
hexadecimal, as they are hard to read in decimal form.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Variables storing page counts were previously `unsigned int`, limiting
them to a maximum of 2^32 pages. With a 4k page size, this corresponds
to a 16TB memory mapping, which is insufficient for larger mappings.
This commit changes the type for these variables to `unsigned long` to
support larger memory mappings.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Update the nr_pages field in PagemapEntry to uint64 to prepare for
checkpointing and restoring huge memory mappings.
Backward compatibility with older pagemap images is preserved.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
On restore, CRIU needs to change mount namespaces to properly restore
files and unix sockets. However, the kernel prevents this if a process
is sharing its file system information (fs) with other processes.
Fixes#2687
Signed-off-by: Andrei Vagin <avagin@google.com>
On some kernels, attr/current can be intercepted by BPF LSM, causing
errors (#2033). Using attr/apparmor/current is preferable, because it
is guaranteed to return the apparmor label. attr/current will still be
used as a fallback for older kernels.
Fixes: #2033
Signed-off-by: Filip Hejsek <filip.hejsek@gmail.com>
On MIPS platforms, shared libraries may use EI_ABIVERSION = 5 to indicate
support for .MIPS.xhash sections. The previous ELF header check in
handle_binary() strictly compared e_ident against a hardcoded value,
causing legitimate shared objects to be rejected.
This patch replaces the memcmp-based check with a structured validation
of ELF magic and class, and allows EI_ABIVERSION values beside 0.
fixes: #2745
Signed-off-by: dong sunchao <dongsunchao@gmail.com>
We are dropping support for generating JUnit XML reports in zdtm.py as we've
migrated testing infrastructure entirely to `GitHub Actions` and other
third-party test runners.
This package has been removed from some distribution repositories (e.g.,
Fedora), making it simpler to remove the dependency than to force installation
via pip.
Signed-off-by: Andrei Vagin <avagin@google.com>
This change modifies the CI script to avoid Docker version 28, which has
a known regression that breaks Checkpoint/Restore (C/R) functionality.
The issue is tracked in the moby/moby project as
https://github.com/moby/moby/issues/50750.
Signed-off-by: Andrei Vagin <avagin@google.com>
Linux 6.16+ restricts SO_PASSCRED and SO_PASSSEC to AF_UNIX, AF_NETLINK, and AF_BLUETOOTH
This patch updates CRIU to check the socket family before dumping these options
Fixes: #2705
Signed-off-by: Dong Sunchao <dongsunchao@gmail.com>
SO_PASSCRED and SO_PASSSEC are only valid for AF_UNIX and AF_NETLINK
This patch updates the test logic to use a unix socket for these options,
while preserving the original value consistency check
Fixes: #2705
Signed-off-by: Dong Sunchao <dongsunchao@gmail.com>
The `offset` argument to `mmap()` was computed with a direct cast from
pointer to `off_t`:
`(off_t)addr_hint - (off_t)map_base`
This causes a build failure when compiling since pointers and `off_t`
may differ in size on some platforms.
maps12.c: In function 'mmap_pages':
maps12.c:114:50: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
114 | filemap ? fd : -1, filemap ? ((off_t)addr_hint - (off_t)map_base) : 0);
| ^
maps12.c:114:69: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
114 | filemap ? fd : -1, filemap ? ((off_t)addr_hint - (off_t)map_base) : 0);
The fix in this patch is to cast both pointers to `intptr_t`,
perform the subtraction in that type, and then cast the result
back to `off_t`.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Branch protection uses PAC. It cryptographically "signs" a function's
return address before it is stored on the stack. Upon return, the address
is authenticated using a secret key. If the signature is invalid, the
program will fault.
The PIE code is used for the parasite and the restorer. In both cases, it
runs in a foreign process. The case of the restorer is even trickier
because it needs to restore the original PAC keys, which invalidates
all previously "signed" pointers within the restorer itself.
Fixes#2709
Signed-off-by: Andrei Vagin <avagin@gmail.com>
We need at least 6.16 to test MADV_GUARD_INSTALL support, but
our current Fedora Rawhide test uses only Rawhide's user space,
while using Fedora 42 kernel. Let's start using a vanilla kernel.
Suggested-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Introduce a new kind of VMA - VMA_AREA_GUARD. In fact, it is not
a real VMA as it is not represented as struct vm_area_struct in
the kernel.
We want to reuse an existing vma infrastructure in CRIU to dump
an information about MADV_GUARD_INSTALL-covered address space
ranges as VMAs. Then, on restore, we need to carefully skip
those fake VMAs everywhere we expect a normal VMAs to be processed.
And only in restorer we use these VMAs to get an information about
where to call MADV_GUARD_INSTALL.
Suggested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
1. get info about MADV_GUARD_INSTALL-protected pages with
help of pagemap by looking for PME_GUARD_REGION flag if /proc/<pid>/pagemap
is used or by looking for PAGE_IS_GUARD flag if ioctl(PAGEMAP_SCAN) is used
2. skip those pages
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Make should_dump_page to return int to indicate failure, also
return useful data back through the struct page_info structure
passed as a pointer.
Also, correspondingly convert all call sites.
No functional changes intended, except fixing a bug in
should_dump_page() as it could return (-1) when pmc_fill()
fails, while caller didn't expect that before.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
The arm64 tests are currently being executed on both actuated and GitHub
runners. This change removes the actuated runner to avoid redundancy and
streamline our CI process.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The tar command was failing with the following message:
$ tar cf criu.tar ../../../criu
tar: Removing leading `../../../' from member names
tar: ../../../criu/scripts/ci/criu.tar: archive cannot contain itself; not dumped
In addition, the /vagrant no-longer exist in the new Fedora images.
bash: line 1: cd: /vagrant: No such file or directory
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Installing this package currently fails with the following message:
Package qemu is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
E: Package 'qemu' has no installation candidate
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
See the previous commit for rationale and architecture-specific details.
[ avagin: tweak code comment ]
Signed-off-by: Ignacio Moreno Gonzalez <Ignacio.MorenoGonzalez@kuka.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
After the CRIU process saves the parasite code for the target thread in
the shared mmap, it is necessary to call __clear_cache before the target
thread executes the code.
Without this step, the target thread may not see the correct code to
execute, which can result in a SIGILL signal.
For the specific arm64 case. this is important so that the newly copied
code is flushed from d-cache to RAM, so that the target thread sees the
new code.
The change is based on commit 6be10a2 by @fu.lin and on input received
from @adrianreber.
[ avagin: tweak code comment ]
Signed-off-by: Ignacio Moreno Gonzalez <Ignacio.MorenoGonzalez@kuka.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
In general, we use "$(E)" instead of "$(Q) echo", but we also have
a msg-gen macro which can be used here.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 68f92b551 removed images/google/protobuf directory, so it is
re-created each time during the build process.
This resulted in a weird behavior change. Previously, one could do
something like this:
git clone $CRURL criu
(cd criu && sudo make install-criu)
rm -rf criu
This worked fine, including running rm -rf as a non-root user, since no
new directories were created under criu -- all directories were still
owned by the original user.
Since commit 68f92b551 the same sequence fails:
rm: cannot remove '/home/runner/criu/images/google/protobuf/descriptor.pb-c.c': Permission denied
rm: cannot remove '/home/runner/criu/images/google/protobuf/descriptor.pb-c.d': Permission denied
rm: cannot remove '/home/runner/criu/images/google/protobuf/descriptor.pb-c.h': Permission denied
A workaround is to keep empty images/google/protobuf directory,
which is what this commit does.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 68f92b551 used `$$(Q)` instead of `$(Q)` in the Makefile target,
which resulted in the following error:
$(Q) echo "Generating descriptor.pb-c.c"
/bin/sh: 1: Q: not found
Generating descriptor.pb-c.c
$(Q) protoc --proto_path=/usr/include --proto_path=images/ --c_out=images/ /usr/include/google/protobuf/descriptor.proto
/bin/sh: 1: Q: not found
as well as:
$(Q) rm -rf images/google
/bin/sh: line 1: Q: command not found
Fix it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Currently the build scripts create the following symlink:
criu-4.1/images/google/protobuf/descriptor.proto -> /usr/include/google/protobuf/descriptor.proto
This symlink points to a system-wide absolute-path target. Also,
this symlink ends up in the release tarball. The tarball may later be
downloaded and unpacked by e.g. OS distributions. If unpacking is
done using Python 3.14+, it will fail.
This happens because Python 3.14 will switch the default behavior of
extractall() from "fully trusting the content of archive" to
"disallow common attack vectors while extracting the archive".
With this new behavior, extractall() raises an exception when at
least one file in the archive extracts or points to outside of the
extraction directory (these are called path traversal attacks and
zip slip attacks).
Reported-by: Dmitrii Kuvaiskii <dimakuv@amazon.de>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The test creates a file bindmount in criu mntns and binds it into test
mntns, this external file bindmount is autodetected and restored via
"--external mnt[]" criu option.
Note: In previous patch we fix the problem on this code path where file
bindmount restore fails as there is excess "/" in source path.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
E.g. I have a /etc/hosts in workspace mounted from the host, and get the following message.
(00.141008) 1: mnt-v2: Create plain mountpoint /tmp/.criu.mntns.K1biY1/mnt-0000000938 for 938
(00.141546) 1: mnt-v2: Mounting unsupported @938 (0)
(00.141887) 1: mnt-v2: Bind /tmp/agent/1-d8c746c6fda3a8b2/workspace/etc/hosts/ to /tmp/.criu.mntns.K1biY1/mnt-0000000938
(00.142179) 1: Error (criu/mount-v2.c:319): mnt-v2: Failed to open_tree /tmp/agent/1-d8c746c6fda3a8b2/workspace/etc/hosts/: Not a directory
(00.143774) Error (criu/cr-restore.c:2320): Restoring FAILED.
Signed-off-by: Chuan Qiu <qiuc12@gmail.com>
Add ZDTM static tests for IP4/ICMP and IP6/ICMP
socket feature.
Signed-off-by: समीर सिंह Sameer Singh <lumarzeli30@gmail.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
Currently there is no option to checkpoint/restore programs that use
ICMP sockets, such as `ping`. This patch adds support for the same.
Fixes#2557
Signed-off-by: समीर सिंह Sameer Singh <lumarzeli30@gmail.com>
net/unix/max_dgram_qlen can't be tuned from non-root userns before:
v5.17-rc1~170^2~215 ("net: Enable max_dgram_qlen unix sysctl to be
configurable by non-init user namespaces")
Signed-off-by: Andrei Vagin <avagin@google.com>
We dump sysctls from criu user namespace, but restore from restored user
namespace. So group id values should be mapped to the restored user
namespace gid space to restore correctly.
Signed-off-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We have ability to skip sysctl if there is no value, but we still give
n requests to sysctl_op, that is not correct and probably can segfault
on nullptr access. Fix it by adding ri to count non skipped requests.
To be on the safe side, let's add a check that ri == n on read, as we
should not do any skips there.
While on it lets fix bad error message prefix: s/unix/ipv4/.
Remove excess has_iarg set, and add sarg reset to NULL for the case
sysctl_op skipped it.
Signed-off-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Having CTL_FLAGS_IPC_EACCES_SKIP == (CTL_FLAGS_OPTIONAL |
CTL_FLAGS_READ_EIO_SKIP) is probably not what we want. So let's make it
a real distinct flag.
Fixes: 840735aa0 ("ipc_sysctl: Prioritize restoring IPC variables using non usernsd approach")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The `criu cpuinfo check` command calls cpu_validate_cpuinfo(), which
attempts to open the cpuinfo.img file using `open_image()`. If the
image file is not found, `open_image()` returns an "empty image"
object. As a result, `cpu_validate_cpuinfo()` tries to read from it
and fails with the following error:
(00.002473) Error (criu/protobuf.c:72): Unexpected EOF on (empty-image)
This patch adds a check for an empty image and appropriate error message.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The cpuinfo command requires a "dump" or "check" subcommand. Thus, we
replace `CR_CPUINFO` with `CR_CPUINFO_DUMP` and `CR_CPUINFO_CHECK`.
This allows us to remove unnecessary subcommand check in
`image_dir_mode()` and perform all parsing in `parse_criu_mode()`.
With this change the check for validating the cpuinfo subcommand is
now done only once with `CR_CPUINFO_DUMP` or `CR_CPUINFO_CHECK` enum.
Signed-off-by: Liana Koleva <43767763+lianakoleva@users.noreply.github.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
CRIU currently requires a number of dependencies in order to build from
source. The package names vary across distributions and package
managers. A Nix flake allows developers to spin up a dev environment
with `nix develop`, eliminating the hassle of manual dependency
management. It also prevents polluting the global package set on the
machine.
Signed-off-by: Prajwal S N <prajwalnadig21@gmail.com>
In this test we want to ensure that contents of droppable mappings
and mappings with MADV_WIPEONFORK is properly restored in
parent/child processes.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Support MAP_DROPPABLE [1] by detecting it from /proc/<pid>/smaps
and restoring it as a normal private mapping flag on vma with only
difference that instead of MAP_PRIVATE we should use MAP_DROPPABLE.
[1] 9651fcedf7
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Support VM_WIPEONFORK [1] by detecting it from /proc/<pid>/smaps
and setting a corresponding MADV_WIPEONFORK flag on vma.
[1] d2cd9ede6e
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
The opts['action'] contains actor function and not the action name, so
we should compare it with a function.
While on it let's also add a comment about --criu-bin option if CRIU
binary is missing.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
By default zdtm expects that criu is built from source first and only
then you can run zdtm tests against it. But what if you really want to
run tests against a criu version installed on the system? Yes there is
already a nice option for zdtm to change the criu binary it uses
"--criu-bin", but it would still end up using the pycriu module from
source and you would still have to build everything beforehand.
Let's add an option to change the path where zdtm searches for pycriu
module "--pycriu-search-path". This way we can run zdtm tests on the
criu installed on the system directly without building criu from source,
e.g. on Fedora it works like:
test/zdtm.py run --criu-bin /usr/sbin/criu \
--pycriu-search-path /usr/lib/python3.13/site-packages \
-t zdtm/static/env00
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This patch implements the entire logic to enable the offloading of
buffer object content restoration.
The goal of this patch is to offload the buffer object content
restoration to the main CRIU process so that this restoration can occur
in parallel with other restoration logic (mainly the restoration of
memory state in the restore blob, which is time-consuming) to speed up
the restore phase. The restoration of buffer object content usually
takes a significant amount of time for GPU applications, so
parallelizing it with other operations can reduce the overall restore
time.
It has three parts: the first replaces the restoration of buffer objects
in the target process by sending a parallel restore command to the main
CRIU process; the second implements the POST_FORKING hook in the amdgpu
plugin to enable buffer object content restoration in the main CRIU
process; the third stops the parallel thread in the RESUME_DEVICES_LATE
hook.
This optimization only focuses on the single-process situation (common
case). In other scenarios, it will turn to the original method. This is
achieved with the new `parallel_disabled` flag.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently the restore of buffer object comsumes a significant amount of
time. However, this part has no logical dependencies with other restore
operations. This patch introduce some structures and some helper
functions for the target process to offload this task to the main CRIU
process.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
When enabling parallel restore, the target process and the main CRIU
process need an IPC interface to communicate and transfer restore
commands. This patch adds a Unix domain TCP socket and stores this
socket in `fdstore`.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently, parallel restore only focuses on the single-process
situation. Therefore, it needs an interface to know if there is only one
process to restore. This patch adds a `has_children` function in
`pstree.h` and replaces some existing implementations with this
function.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently, when CRIU calls `cr_plugin_init`, `fdstore` is not
initialized. However, during the plugin restore procedure, there may be
some common file operations used in multiple hooks. This patch moves
`cr_plugin_init` after `fdstore_init`, allowing `cr_plugin_init` to use
`fdstore` to place these file operations.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently, in the target process, device-related restore operations and
other restore operations almost run sequentially. When the target
process executes the corresponding CRIU hook functions, it can't perform
other restore operations. However, for GPU applications, some device
restore operations have no logical dependencies on other common restore
operations and can be parallelized with other operations to speed up the
process.
Instead of launching a thread in child processes for parallelization,
this patch chooses to add a new hook, `POST_FORKING`, in the main CRIU
process to handle these restore operations. This is because the
restoration of memory state in the restore blob is one of the most
time-consuming parts of all restore logic. The main CRIU process can
easily parallelize these operations, whereas parallelizing in threads
within child processes is challenging.
- POST_FORKING
*POST_FORKING: Hook to enable the main CRIU process to perform some
restore operations of plugins.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Building CRIU on Ubuntu 20.04 fails with the following error:
criu/sk-inet.c: In function 'can_dump_ipproto':
criu/sk-inet.c:131:16: error: 'IPPROTO_MPTCP' undeclared (first use in this function); did you mean 'IPPROTO_MTP'?
131 | if (proto == IPPROTO_MPTCP)
| ^~~~~~~~~~~~~
| IPPROTO_MTP
Add definition for MPTCP to fix this error.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The container checkpointing procedure in Kubernetes freezes running
containers to create a consistent snapshot of both the runtime state
and the rootfs of the container. However, when checkpointing a GPU
container, the container must be unfrozen before invoking the
cuda-checkpoint tool.
This is achieved in prepare_freezer_for_interrupt_only_mode(), which
needs to be called before the PAUSE_DEVICES hook. The patch introducing
this functionality fixes this problem for containers with multiple
processes. However, if the container has a single process,
prepare_freezer_for_interrupt_only_mode() must be invoked immediately
before the PAUSE_DEVICES hook.
Fixes: #2514
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
In 0a7c5fd1bd we swapped the BSD
implementation of strlcat and strlcpy in favor of our own replacement.
The checks and the predefined macros are not needed anymore.
Signed-off-by: Lorenzo Fontana <fontanalorenz@gmail.com>
In some cases, they might not work in virtual machines if the hypervisor
doesn't virtualize them. For example, they don't work in AMD SEV virtual
machines if the Debug Virtualization extension isn't supported or isn't
enabled in SEV_FEATURES.
Fixes#2658
Signed-off-by: Andrei Vagin <avagin@gmail.com>
With Go version 1.24, ListenConfig now uses MPTCP by default [1].
Checkpoint/restore for this protocol is not currently supported
and adding support requires kernel changes that are not trivial
to implement. As a result, checkpointing of many containers that
run Go programs is likely to fail with the following error [2]:
(00.026522) Error (criu/sk-inet.c:130): inet: Unsupported proto 262 for socket 2f9bc5
This patch adds a message with suggested workaround for this problem.
[1] https://go.dev/doc/go1.24#netpkgnet
[2] https://github.com/checkpoint-restore/criu/issues/2655
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
It makes root mount readonly and checks that it is still readonly after
migration.
Make zdtm/static writable for logs via "bind" desc option.
v2: explain why we don't have explicit rw/ro flag check
v3: use new zdtm "bind" desc option
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Add {'bind': 'path/to/bindmount'} zdtm descriptor option, so that in
test mount namespace a directory bindmount can be created before running
the test.
This is useful to leave test directory writable (e.g. for logs) while
the test makes root mount readonly. note: We create this bindmount early
so that all test files are opened on it initially and not on the below
mount. Will be used in mnt_ro_root test.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Mount flags belong to mount and mount namespace of the Container, so we
should preserve them, as Container user will not expect mounts switching
between ro and rw over c/r.
Fixes: #2632
v5: fix both mount-v1 and mount-v2
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Building CRIU package on Debian 11 aarch64 fails with
criu/arch/aarch64/crtools.c: In function 'save_pac_keys':
criu/arch/aarch64/crtools.c:32:31: error: storage size of 'paca' isn't known
struct user_pac_address_keys paca;
^~~~
criu/arch/aarch64/crtools.c:33:31: error: storage size of 'pacg' isn't known
struct user_pac_generic_keys pacg;
^~~~
criu/arch/aarch64/crtools.c:47:15: error: 'HWCAP_PACA' undeclared (first use in this function); did you mean 'HWCAP_FCMA'?
if (hwcaps & HWCAP_PACA) {
^~~~~~~~~~
HWCAP_FCMA
criu/arch/aarch64/crtools.c:47:15: note: each undeclared identifier is reported only once for each function it appears in
criu/arch/aarch64/crtools.c:53:44: error: 'NT_ARM_PACA_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_SVE'?
if ((ret = ptrace(PTRACE_GETREGSET, pid, NT_ARM_PACA_KEYS, &iov))) {
^~~~~~~~~~~~~~~~
NT_ARM_SVE
criu/arch/aarch64/crtools.c:73:39: error: 'NT_ARM_PAC_ENABLED_KEYS' undeclared (first use in this function)
ret = ptrace(PTRACE_GETREGSET, pid, NT_ARM_PAC_ENABLED_KEYS, &iov);
^~~~~~~~~~~~~~~~~~~~~~~
criu/arch/aarch64/crtools.c:82:15: error: 'HWCAP_PACG' undeclared (first use in this function); did you mean 'HWCAP_AES'?
if (hwcaps & HWCAP_PACG) {
^~~~~~~~~~
HWCAP_AES
criu/arch/aarch64/crtools.c:88:44: error: 'NT_ARM_PACG_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_SVE'?
if ((ret = ptrace(PTRACE_GETREGSET, pid, NT_ARM_PACG_KEYS, &iov))) {
^~~~~~~~~~~~~~~~
NT_ARM_SVE
criu/arch/aarch64/crtools.c:33:31: error: unused variable 'pacg' [-Werror=unused-variable]
struct user_pac_generic_keys pacg;
^~~~
criu/arch/aarch64/crtools.c:32:31: error: unused variable 'paca' [-Werror=unused-variable]
struct user_pac_address_keys paca;
^~~~
criu/arch/aarch64/crtools.c: In function 'arch_ptrace_restore':
criu/arch/aarch64/crtools.c:227:31: error: storage size of 'upaca' isn't known
struct user_pac_address_keys upaca;
^~~~~
criu/arch/aarch64/crtools.c:228:31: error: storage size of 'upacg' isn't known
struct user_pac_generic_keys upacg;
^~~~~
criu/arch/aarch64/crtools.c:241:18: error: 'HWCAP_PACA' undeclared (first use in this function); did you mean 'HWCAP_FCMA'?
if (!(hwcaps & HWCAP_PACA)) {
^~~~~~~~~~
HWCAP_FCMA
criu/arch/aarch64/crtools.c:255:44: error: 'NT_ARM_PACA_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_SVE'?
if ((ret = ptrace(PTRACE_SETREGSET, pid, NT_ARM_PACA_KEYS, &iov))) {
^~~~~~~~~~~~~~~~
NT_ARM_SVE
criu/arch/aarch64/crtools.c:261:44: error: 'NT_ARM_PAC_ENABLED_KEYS' undeclared (first use in this function)
if ((ret = ptrace(PTRACE_SETREGSET, pid, NT_ARM_PAC_ENABLED_KEYS, &iov))) {
^~~~~~~~~~~~~~~~~~~~~~~
criu/arch/aarch64/crtools.c:268:18: error: 'HWCAP_PACG' undeclared (first use in this function); did you mean 'HWCAP_AES'?
if (!(hwcaps & HWCAP_PACG)) {
^~~~~~~~~~
HWCAP_AES
criu/arch/aarch64/crtools.c:275:44: error: 'NT_ARM_PACG_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_SVE'?
if ((ret = ptrace(PTRACE_SETREGSET, pid, NT_ARM_PACG_KEYS, &iov))) {
^~~~~~~~~~~~~~~~
NT_ARM_SVE
criu/arch/aarch64/crtools.c:233:6: error: variable 'ret' set but not used [-Werror=unused-but-set-variable]
int ret;
^~~
criu/arch/aarch64/crtools.c:228:31: error: unused variable 'upacg' [-Werror=unused-variable]
struct user_pac_generic_keys upacg;
^~~~~
criu/arch/aarch64/crtools.c:227:31: error: unused variable 'upaca' [-Werror=unused-variable]
struct user_pac_address_keys upaca;
^~~~~
This patch adds the missing constants and structs if undefined.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
CRIU locks the network during restore in an "empty" network namespace.
However, "empty" in this context means CRIU isn't restoring the
namespace. This network namespace can be the same namespace where
processes have been dumped and so the network is already locked in it.
Fixes#2650
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Currently we save FP regs before parasite code runs, and restore after
for --leave-running, --check-only, and in case of errors. In case of
errors the error may have happened before FP regs were saved, so we
should only restore them if they were actually saved.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
On a RHEL 8 based system building CRIU fails with:
criu/arch/aarch64/crtools.c: In function 'save_pac_keys':
criu/arch/aarch64/crtools.c:73:39: error: 'NT_ARM_PAC_ENABLED_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_PACA_KEYS'?
ret = ptrace(PTRACE_GETREGSET, pid, NT_ARM_PAC_ENABLED_KEYS, &iov);
^~~~~~~~~~~~~~~~~~~~~~~
NT_ARM_PACA_KEYS
criu/arch/aarch64/crtools.c:73:39: note: each undeclared identifier is reported only once for each function it appears in
criu/arch/aarch64/crtools.c: In function 'arch_ptrace_restore':
criu/arch/aarch64/crtools.c:261:44: error: 'NT_ARM_PAC_ENABLED_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_PACA_KEYS'?
if ((ret = ptrace(PTRACE_SETREGSET, pid, NT_ARM_PAC_ENABLED_KEYS, &iov))) {
^~~~~~~~~~~~~~~~~~~~~~~
NT_ARM_PACA_KEYS
This adds the missing define if it is undefined.
Signed-off-by: Adrian Reber <areber@redhat.com>
The `goto interrupt` label is unnecessary as the code directly
returns after `cuda_process_checkpoint_action()`.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When handing errors for functions such as `ptrace()`, `pipe()`, and
`fork()` it would be better to use `pr_perror` instead of `pr_err`
as it would include a message describing the encountered error.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Thomas Gleixner introduced the new interface to create posix timers
with specifed timer IDs:
ec2d0c0462
Previously, CRIU recreated timers by repeatedly creating and deleting
them until the desired ID was reached. This approach isn't fast,
especially for timers with large IDs. For example, restoring two timers
with IDs 1000000 and 2000000 took approximately 1.5 seconds.
The new `prctl()` based interface allows direct creation of timers with
specified IDs, reducing the restoration time to around 3 microseconds
for the same example.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The stack test incorrectly assumed the page immediately
following the stack pointer could never be changed. This doesn't work,
because this page can be a part of another mapping.
This commit introduces a dedicated "stack redzone," a small guard region
directly after the stack. The stack test is modified to specifically
check for corruption within this redzone.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
This is highly confusing, and it seems that the ret variable
is not handled in the subsequent process.
Signed-off-by: Yuanhong Peng <yummypeng@linux.alibaba.com>
This release of CRIU (4.1.1) addresses a critical compatibility issue
introduced in the Linux kernel and back-ported to all stable releases.
The kernel commit (12f147ddd6de "do_change_type(): refuse to operate on
unmounted/not ours mounts") addressed the security issue introduced
almost 20 years ago. Unfortunately, this change inadvertently broke the
restore functionality of mount namespaces within CRIU. Users attempting
to restore a container on updated kernels would encounter the error:
"mnt-v2: Failed to make mount 476 slave: Invalid argument."
This release contains the necessary adjustments to CRIU, allowing it to
work seamlessly with kernels incorporating this security change.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
A kernel change (commit 12f147ddd6de, "do_change_type(): refuse to
operate on unmounted/not ours mounts") modified how mount propagation
properties can be changed. Previously, these properties could be changed
from any mount namespace. Now, they can only be modified from the
specific mount namespace where the target mount is actually mounted
This commit addresses this new restriction by ensuring that CRIU enters the
correct mount namespace before attempting to restore mount propagation
properties (MS_SLAVE or MS_SHARED) for a mount.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Major changes:
* RISC-V Support
* PIDFD Support
* CUDA Enhancements
* Fixes here and there
The full changelog can be found here: https://criu.org/Download/criu/4.1.
Signed-off-by: Andrei Vagin <avagin@google.com>
When using pr_err in signal handler, locking is used
in an unsafe manner. If another signal happens while holding the
lock, deadlock can happen.
To fix this, we can introduce mutex_trylock similar to
pthread_mutex_trylock that returns immediately. Due to the fact
that lock is used only for writing first_err, this change garantees
that deadlock cannot happen.
Fixes: #358
Signed-off-by: Ivan Pravdin <ipravdin.official@gmail.com>
free_userns_maps is called to clean up uid/gid map when the dump
finishes. If we try to clean up these maps in error cases, it can lead
to double free panic. So just skip cleaning up these maps and let
free_userns_maps do its job.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
There are a couple of tests that require the iptables binary.
Instead of adding a checkskip script, which could also handle this,
this change now uses CRIU's feature detection to see if the CRIU
feature 'has_ipt_legacy' exists.
Signed-off-by: Adrian Reber <areber@redhat.com>
If the tests in others/rpc are failing no information about that error
can be seen in a CI run. This change displays the log files if the test
fails.
Signed-off-by: Adrian Reber <areber@redhat.com>
The tests in others/rpc are running as non-root and
fail silently if the nftables network locking backend is used.
This switches those tests to skip the network locking.
Signed-off-by: Adrian Reber <areber@redhat.com>
The building section also contains the information how to change the
network locking backend without source code changes.
Signed-off-by: Adrian Reber <areber@redhat.com>
As different Linux distributions are switching away from iptables
to nftables, this makes it easier to compile CRIU with a different
default network locking backend. Instead of changing the source
code it is now possible to select the nft backend like this:
make NETWORK_LOCK_DEFAULT=NETWORK_LOCK_NFTABLES
Signed-off-by: Adrian Reber <areber@redhat.com>
Let's change the data types of `nbucket` and `nchain` to uint32.
This should fix the following compile-time error on arm32:
/criu/criu/pie/util-vdso.c:336: undefined reference to `__aeabi_uldivmod'
Signed-off-by: Andrei Vagin <avagin@google.com>
PAC stands for Pointer Authentication Code. Each process has 5 PAC keys
and a mask of enabled keys. All this properties have to be C/R-ed.
As they are per-process protperties, we can save/restore them just for
one thread.
Signed-off-by: Andrei Vagin <avagin@google.com>
Threads are put into cgroups through the cgroupd thread, which
communicates with other threads using a socketpair.
Previously, each thread received a dup'd copy of the socket, and did
the following
sendmsg(socket_dup_fd, my_cgroup_set);
// wait for ack.
while (1) {
recvmsg(socket_dup_fd, &h, MSG_PEEK);
if (h.pid != my_pid) continue;
recvmsg(socket_dup_fd, &h, 0);
}
close(socket_dup_fd);
When restoring many threads, many threads would be spinning in the
above loop waiting for their PID to appear.
In my test-case, restoring a process with a 11.5G heap and 491 threads
could take anywhere between 10 seconds and 60 seconds to complete.
To avoid the spinning, we drop the loop and MSG_PEEK, and add a lock
around the above code. This does not decrease parallelism, as the
cgroupd daemon uses a single thread anyway.
With the lock in place, the same restore consistently takes around 10
seconds on my machine (Thinkpad P14s, AMD Ryzen 8840HS).
There is a similar "daemon" thread for user namespaces. That already
is protected with a similar userns_sync_lock in __userns_call().
Fixes#2614
Signed-off-by: Han-Wen Nienhuys <hanwen@engflow.com>
* Hash buckets is an array of 32-bit words. While DT_HASH is 32-bit on
most platforms except s390 (where it's 64-bit).
* The bloom filter word size differs between 32-bit and 64-bit ELF
files. This commit adjusts the code to handle both cases.
Signed-off-by: Andrei Vagin <avagin@google.com>
Currently CRIU has the possibility to specify a LSM label during
restore. Unfortunately the information is completely ignored in the case
of SELinux.
This change selects the lsm label from the user if it is provided and
else the label from the checkpoint image is used.
Signed-off-by: Adrian Reber <areber@redhat.com>
With Python 3.13, the `subprocess` module now uses the
`posix_spawn()` function [1], which requires the `signal`
module to be imported.
Fixes: #2607
[1] https://docs.python.org/3/whatsnew/3.13.html#subprocess
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Add relevant elf header constants and notes for the arm platform
to enable coredump generation.
Signed-off-by: समीर सिंह Sameer Singh <lumarzeli30@gmail.com>
Add relevant elf header constants and notes for the aarch64 platform
to enable coredump generation.
Signed-off-by: समीर सिंह Sameer Singh <lumarzeli30@gmail.com>
strstartswith() function is incorrect choice for finding parent
directory so i change it to issubpath() function
Signed-off-by: Dmitrii Chervov <dschervov1@yandex.ru>
Currently Fedora rawhide based CI runs fail with:
/bin/sh: line 1: awk: command not found
Let's install it.
Signed-off-by: Adrian Reber <areber@redhat.com>
This way,
- Makefile is less cluttered;
- one can run codespell from the command line.
Fixes: fd7e97fcf ("lint: exclude tags file from codespell")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Trying to run latest CRIU on CentOS Stream 10 or Ubuntu 24.04 (aarch64)
fails like this:
# criu/criu check -v4
[...]
(00.096460) vdso: Parsing at ffffb2e2a000 ffffb2e2c000
(00.096539) vdso: PT_LOAD p_vaddr: 0
(00.096567) vdso: DT_STRTAB: 1d0
(00.096592) vdso: DT_SYMTAB: 128
(00.096616) vdso: DT_STRSZ: 8a
(00.096640) vdso: DT_SYMENT: 18
(00.096663) Error (criu/pie-util-vdso.c:193): vdso: Not all dynamic entries are present
(00.096688) Error (criu/vdso.c:627): vdso: Failed to fill self vdso symtable
(00.096713) Error (criu/kerndat.c:1906): kerndat_vdso_fill_symtable failed when initializing kerndat.
(00.096812) Found mmap_min_addr 0x10000
(00.096881) files stat: fs/nr_open 1073741816
(00.096908) Error (criu/crtools.c:267): Could not initialize kernel features detection.
This seems to be related to the kernel (6.12.0-41.el10.aarch64). The
Ubuntu user-space is running in a container on the same kernel.
Looking at the kernel this seems to be related to:
commit 48f6430505c0b0498ee9020ce3cf9558b1caaaeb
Author: Fangrui Song <i@maskray.me>
Date: Thu Jul 18 10:34:23 2024 -0700
arm64/vdso: Remove --hash-style=sysv
glibc added support for .gnu.hash in 2006 and .hash has been obsoleted
for more than one decade in many Linux distributions. Using
--hash-style=sysv might imply unaddressed issues and confuse readers.
Just drop the option and rely on the linker default, which is likely
"both", or "gnu" when the distribution really wants to eliminate sysv
hash overhead.
Similar to commit 6b7e26547fad ("x86/vdso: Emit a GNU hash").
The commit basically does:
-ldflags-y := -shared -soname=linux-vdso.so.1 --hash-style=sysv \
+ldflags-y := -shared -soname=linux-vdso.so.1 \
Which results in only a GNU hash being added to the ELF header. This
change has been merged with 6.11.
Looking at the referenced x86 commit:
commit 6b7e26547fad7ace3dcb27a5babd2317fb9d1e12
Author: Andy Lutomirski <luto@amacapital.net>
Date: Thu Aug 6 14:45:45 2015 -0700
x86/vdso: Emit a GNU hash
Some dynamic loaders may be slightly faster if a GNU hash is
available. Strangely, this seems to have no effect at all on
the vdso size.
This is unlikely to have any measurable effect on the time it
takes to resolve vdso symbols (since there are so few of them).
In some contexts, it can be a win for a different reason: if
every DSO has a GNU hash section, then libc can avoid
calculating SysV hashes at all. Both musl and glibc appear to
have this optimization.
It's plausible that this breaks some ancient glibc version. If
so, then, depending on what glibc versions break, we could
either require COMPAT_VDSO for them or consider reverting.
Which is also a really simple change:
-VDSO_LDFLAGS = -fPIC -shared $(call cc-ldoption, -Wl$(comma)--hash-style=sysv) \
+VDSO_LDFLAGS = -fPIC -shared $(call cc-ldoption, -Wl$(comma)--hash-style=both) \
The big difference here is that for x86 both hash sections are
generated. For aarch64 only the newer GNU hash is generated. That is why
we only see this error on kernel >= 6.11 and aarch64.
Changing from DT_HASH to DT_GNU_HASH seems to work on aarch64. The test
suite runs without any errors.
Unfortunately I am not aware of all implication of this change and if a
successful test suite run means that it still works.
Looking at the kernel I see following hash styles for the VDSO:
aarch64: not specified (only GNU hash style)
arm: --hash-style=sysv
loongarch: --hash-style=sysv
mips: --hash-style=sysv
powerpc: --hash-style=both
riscv: --hash-style=both
s390: --hash-style=both
x86: --hash-style=both
Only aarch64 on kernels >= 6.11 is a problem right now, because all
other platforms provide the old style hashing.
Signed-off-by: Adrian Reber <areber@redhat.com>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Co-authored-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
It is per net namespace, we need it to allow creation of unprivileged
ICMP sockets.
Note: in case this sysctl was disabled after unprivileged ICMP
socket was created we still need to somehow handle it on restore.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
For two cases libcriu was setting the RPC protobuf field `has_*` before
checking if the given parameter is valid. This can lead to situations,
if the caller doesn't check the return value, that we pass as RPC struct
to CRIU which has the `has_*` protobuf field set to true, but does not
have a verified value (or non at all) set for the actual RPC entry.
Signed-off-by: Adrian Reber <areber@redhat.com>
Temporarily disable CUDA plugin for `criu pre-dump`.
pre-dump currently fails with the following error:
Handling VMA with the following smaps entry: 1822c000-18da5000 rw-p 00000000 00:00 0 [heap]
Handling VMA with the following smaps entry: 200000000-200200000 ---p 00000000 00:00 0
Handling VMA with the following smaps entry: 200200000-200400000 rw-s 00000000 00:06 895 /dev/nvidia0
Error (criu/proc_parse.c:116): handle_device_vma plugin failed: No such file or directory
Error (criu/proc_parse.c:632): Can't handle non-regular mapping on 705693's map 200200000
Error (criu/cr-dump.c:1486): Collect mappings (pid: 705693) failed with -1
We plan to enable support for pre-dump by skipping nvidia mappings
in a separate patch.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Move `run_plugins(CHECKPOINT_DEVICES)` out of `collect_pstree()` to
ensure that the function's sole responsibility is to use the cgroup
freezer for the process tree. This allows us to avoid a time-out
error when checkpointing applications with large GPU state.
v2: This patch calls `checkpoint_devices()` only for `criu dump`.
Support for GPU checkpointing with `pre-dump` will be introduced in
a separate patch.
Suggested-by: Andrei Vagin <avagin@google.com>
Suggested-by: Jesus Ramos <jeramos@nvidia.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When creating a checkpoint of large models, the `checkpoint` action of
`cuda-checkpoint` can exceed the CRIU timeout. This causes CRIU to fail
with the following error, leaving the CUDA task in a locked state:
cuda_plugin: Checkpointing CUDA devices on pid 84145 restore_tid 84202
Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0
Error (cuda_plugin.c:139): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call
Error (cuda_plugin.c:396): cuda_plugin: CHECKPOINT_DEVICES failed with
net: Unlock network
cuda_plugin: finished cuda_plugin stage 0 err -1
cuda_plugin: resuming devices on pid 84145
cuda_plugin: Restore thread pid 84202 found for real pid 84145
Unfreezing tasks into 1
Unseizing 84145 into 1
Error (criu/cr-dump.c:2111): Dumping FAILED.
To fix this, we set `task_info->checkpointed` before invoking
the `checkpoint` action to ensure that the CUDA task is resumed
even if CRIU times out.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Using libnftables the chain to lock the network is composed of
("CRIU-%d", real_pid). This leads to around 40 zdtm tests failing
with errors like this:
Error: No such file or directory; did you mean table 'CRIU-62' in family inet?
delete table inet CRIU-86
The reason is that as soon as a process is running in a namespace the
real PID can be anything and only the PID in the namespace is restored
correctly. Relying on the real PID does not work for the chain name.
Using the PID of the innermost namespace would lead to the chain be
called 'CRIU-1' most of the time which is also not really unique.
With this commit the change is now named using the already existing CRIU
run ID. To be able to correctly restore the process and delete the
locking table, the CRIU run id during checkpointing is now stored in the
inventory as dump_criu_run_id.
Signed-off-by: Adrian Reber <areber@redhat.com>
criu_run_id will be used in upcoming changes to create and remove
network rules for network locking. Instead of trying to come up with
a way to create unique IDs, just use an existing library.
libuuid should be installed on most systems as it is indirectly required
by systemd (via libmount).
Signed-off-by: Adrian Reber <areber@redhat.com>
It creates a few timers with log expiration intervals, waites for C/R
and check that timers are armed and their intervals have been restored.
Signed-off-by: Austin Kuo <hsuanchikuo@gmail.com>
On aarch64 the test cmdlinenv00 was failing with:
FAIL: cmdlinenv00.c:120: auxv corrupted on restore (errno = 11 (Resource temporarily unavailable))
Starting with Linux kernel version 6.3 the size of AUXV was changed:
commit 28c8e088427ad30b4260953f3b6f908972b77c2d
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date: Wed Jan 4 14:20:54 2023 -0500
rseq: Increase AT_VECTOR_SIZE_BASE to match rseq auxvec entries
Two new auxiliary vector entries are introduced for rseq without
matching increment of the AT_VECTOR_SIZE_BASE, which causes failures
with CONFIG_HARDENED_USERCOPY=y.
Fixes: 317c8194e6ae ("rseq: Introduce feature size and alignment ELF auxiliary vector entries")
With this change AT_VECTOR_SIZE increases from 40 to 50 on aarch64. CRIU
uses AT_VECTOR_SIZE to read the content of /proc/PID/auxv
auxv_t mm_saved_auxv[AT_VECTOR_SIZE];
ret = read(fd, mm_saved_auxv, sizeof(mm_saved_auxv));
Now the tests works again on aarch64.
Signed-off-by: Adrian Reber <areber@redhat.com>
Running the zdtm/static/unlink_regular00 test on Ubuntu 24.04 on aarch64
results in following error:
# ./zdtm.py run -t zdtm/static/unlink_regular00 -k always
userns is supported
=== Run 1/1 ================ zdtm/static/unlink_regular00
==================== Run zdtm/static/unlink_regular00 in ns ====================
Skipping rtc at root
Start test
Test is SUID
./unlink_regular00 --pidfile=unlink_regular00.pid --outfile=unlink_regular00.out --dirname=unlink_regular00.test
Run criu dump
*** buffer overflow detected ***: terminated
############# Test zdtm/static/unlink_regular00 FAIL at CRIU dump ##############
Test output: ================================
<<< ================================
Send the 9 signal to 47
Wait for zdtm/static/unlink_regular00(47) to die for 0.100000
##################################### FAIL #####################################
According to the backtrace:
#0 __pthread_kill_implementation (threadid=281473158467616, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1 0x0000ffff93477690 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2 0x0000ffff9342cb3c in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x0000ffff93417e00 in __GI_abort () at ./stdlib/abort.c:79
#4 0x0000ffff9346abf0 in __libc_message_impl (fmt=fmt@entry=0xffff93552a78 "*** %s ***: terminated\n") at ../sysdeps/posix/libc_fatal.c:132
#5 0x0000ffff934e81a8 in __GI___fortify_fail (msg=msg@entry=0xffff93552a28 "buffer overflow detected") at ./debug/fortify_fail.c:24
#6 0x0000ffff934e79e4 in __GI___chk_fail () at ./debug/chk_fail.c:28
#7 0x0000ffff934e9070 in ___snprintf_chk (s=s@entry=0xffffc6ed04a3 "testfile", maxlen=maxlen@entry=4056, flag=flag@entry=2, slen=slen@entry=4053,
format=format@entry=0xaaaacffe3888 "link_remap.%d") at ./debug/snprintf_chk.c:29
#8 0x0000aaaacff4b8b8 in snprintf (__fmt=0xaaaacffe3888 "link_remap.%d", __n=4056, __s=0xffffc6ed04a3 "testfile")
at /usr/include/aarch64-linux-gnu/bits/stdio2.h:54
#9 create_link_remap (path=path@entry=0xffffc6ed2901 "/zdtm/static/unlink_regular00.test/subdir/testfile", len=len@entry=60, lfd=lfd@entry=20,
idp=idp@entry=0xffffc6ed14ec, nsid=nsid@entry=0xaaaada2bac00, parms=parms@entry=0xffffc6ed2808, fallback=0xaaaacff4c6c0 <dump_linked_remap+96>,
fallback@entry=0xffffc6ed2797) at criu/files-reg.c:1164
#10 0x0000aaaacff4c6c0 in dump_linked_remap (path=path@entry=0xffffc6ed2901 "/zdtm/static/unlink_regular00.test/subdir/testfile", len=len@entry=60,
parms=parms@entry=0xffffc6ed2808, lfd=lfd@entry=20, id=id@entry=12, nsid=nsid@entry=0xaaaada2bac00, fallback=fallback@entry=0xffffc6ed2797)
at criu/files-reg.c:1198
#11 0x0000aaaacff4d8b0 in check_path_remap (nsid=0xaaaada2bac00, id=12, lfd=20, parms=0xffffc6ed2808, link=<optimized out>) at criu/files-reg.c:1426
#12 dump_one_reg_file (lfd=20, id=12, p=0xffffc6ed2808) at criu/files-reg.c:1827
#13 0x0000aaaacff51078 in dump_one_file (pid=<optimized out>, fd=4, lfd=20, opts=opts@entry=0xaaaada2ba2c0, ctl=ctl@entry=0xaaaada2c4d50,
e=e@entry=0xffffc6ed39c8, dfds=dfds@entry=0xaaaada2c3d40) at criu/files.c:581
#14 0x0000aaaacff5176c in dump_task_files_seized (ctl=ctl@entry=0xaaaada2c4d50, item=item@entry=0xaaaada2b8f80, dfds=dfds@entry=0xaaaada2c3d40)
at criu/files.c:657
#15 0x0000aaaacff3d3c0 in dump_one_task (parent_ie=0x0, item=0xaaaada2b8f80) at criu/cr-dump.c:1679
#16 cr_dump_tasks (pid=<optimized out>) at criu/cr-dump.c:2224
#17 0x0000aaaacff163a0 in main (argc=<optimized out>, argv=0xffffc6ed40e8, envp=<optimized out>) at criu/crtools.c:293
This line is the problem:
snprintf(tmp + 1, sizeof(link_name) - (size_t)(tmp - link_name - 1), "link_remap.%d", rfe.id);
The problem was that the `-1` was on the inside of the braces and not on
the outside. This way the destination size was increase by 1 instead of
being decreased by 1 which triggered the buffer overflow detection.
Signed-off-by: Adrian Reber <areber@redhat.com>
Based on the code, the `ret` variable at this point does not
represent the task state, so this log message should be
moved to a position after the `compel_wait_task()` function.
Signed-off-by: Yuanhong Peng <yummypeng@linux.alibaba.com>
When using the nftables network locking backend and restoring a process
a second time the network locking has already been deleted by the first
restore. The second restore will print out to the console text like:
Error: Could not process rule: No such file or directory
delete table inet CRIU-202621
With this change CRIU's log FD is used by libnftables stdout and stderr.
Signed-off-by: Adrian Reber <areber@redhat.com>
Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.
In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).
When criu is used, it fails to freeze a cgroup from time to time
(see [1], [2]). Let's try adding kludges similar to ones in runc.
Alas, I have absolutely no way to test this, so please review carefully.
[1]: https://github.com/opencontainers/runc/issues/4273
[2]: https://github.com/opencontainers/runc/issues/4457
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There are a few issues with the freeze_processes logic:
1. Commit 9fae23fbe2 grossly (by 1000x) miscalculated the number of
attempts required, as a result, we are seeing something like this:
> (00.000340) freezing processes: 100000 attempts with 100 ms steps
> (00.000351) freezer.state=THAWED
> (00.000358) freezer.state=FREEZING
> (00.100446) freezer.state=FREEZING
> ...close to 100 lines skipped...
> (09.915110) freezer.state=FREEZING
> (10.000432) Error (criu/cr-dump.c:1467): Timeout reached. Try to interrupt: 0
> (10.000563) freezer.state=FREEZING
For 10s with 100ms steps we only need 100 attempts, not 100000.
2. When the timeout is hit, the "failed to freeze cgroup" error is not
printed, and the log_unfrozen_stacks is not called either.
3. The nanosleep at the last iteration is useless (this was hidden by
issue 1 above, as the timeout was hit first).
Fix all these.
While at it,
4. Amend the error message with the number of attempts, sleep duration,
and timeout.
5. Modify the "freezing cgroup" debug message to be in sync with the
above error.
Was:
> freezing processes: 100000 attempts with 100 ms steps
Now:
> freezing cgroup some/name: 100 x 100ms attempts, timeout: 10s
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The kernel releases a test socket asynchronously, so the restore can
fail if it is executed before the kernel actually destroys the socket.
Fixes#2537
Signed-off-by: Andrei Vagin <avagin@google.com>
Right now, this test fails with this error:
Error (criu/files-reg.c:1031): Can't dump ghost file
/criu/test/javaTests/omrvmem_000000626_Mlm48x of 2097152 size,
increase limit
Signed-off-by: Andrei Vagin <avagin@google.com>
cuda-checkpoint returns the positive CUDA error code when it runs into an issue
and passing that along as the return value would cause errors to get ignored
Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
The vvar_vclock was introduced by [1]. Basically, the old vvar vma has
been splited on two parts. In term of C/R, these two vma-s can be still
treated as one.
[1] e93d2521b27f ("x86/vdso: Split virtual clock pages into dedicated mapping")
Signed-off-by: Andrei Vagin <avagin@google.com>
Fix for the following error when building CRIU on Rocky Linux 8
criu/pidfd.c: In function ‘pidfd_open’:
criu/pidfd.c:119:17: error: ‘__NR_pidfd_open’ undeclared (first use in this function); did you mean ‘pidfd_open’?
return syscall(__NR_pidfd_open, pid, flags);
^~~~~~~~~~~~~~~
pidfd_open
criu/pidfd.c:119:17: note: each undeclared identifier is reported only once for each function it appears in
criu/pidfd.c:120:1: error: control reaches end of non-void function [-Werror=return-type]
}
^
criu/pidfd.c: At top level:
cc1: error: unrecognized command line option ‘-Wno-unknown-warning-option’ [-Werror]
cc1: error: unrecognized command line option ‘-Wno-dangling-pointer’ [-Werror]
cc1: all warnings being treated as errors
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
We need to dynamically calculate TASK_SIZE depending
on the MMU on RISC-V system. [We are using analogical
approach on aarch64/ppc64le.]
This change was tested on physical machine:
StarFive VisionFive 2
isa : rv64imafdc_zicntr_zicsr_zifencei_zihpm_zca_zcd_zba_zbb
mmu : sv39
uarch : sifive,u74-mc
mvendorid : 0x489
marchid : 0x8000000000000007
mimpid : 0x4210427
hart isa : rv64imafdc_zicntr_zicsr_zifencei_zihpm_zca_zcd_zba_zbb
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
We don't need to have compel/arch/riscv64/plugins/std/syscalls/syscalls.S
tracked in git. It is autogenerated. We also need to update our .gitignore
to ignore autogenerated files with syscall tables.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
If a CUDA process is already in a "locked" or "checkpointed" state
during criu dump, the CUDA plugin currently fails with an error because
it attempts an unnecessary "lock" action using the cuda-checkpoint tool.
This patch extends the CUDA plugin to handle such cases by first
verifying the initial state of the CUDA processes and skipping
unnecessary "lock" and "checkpoint" actions when a process has been
locked or checkpointed before CRIU is invoked.
In particular, CUDA tasks may already be in a "locked" or "checkpointed"
state to ensure consistent checkpoint/restore for distributed workloads,
such as model training, where multiple containers run across different
cluster nodes.
Another use case for this functionality is optimizing resource
utilization, where CUDA tasks with low-priority are preempted
immediately to release GPU resources needed by high-priority
tasks, and the paused workloads are later resumed or migrated
to another node.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
We have multiple processes open a pidfd to a common dead process.
After C/R we check that the inode numbers for these pidfds are equal or
not.
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
Currently, the `waitpid()` call on the tmp process can be made by a
process which is not its parent. This causes restore to fail.
This patch instead selects one process to create the tmp process and
open all the fds that point to it. These fds are sent to the correct
process(es).
Fixes: #2496
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
The check for `/dev/nvidiactl` to determine if the CUDA plugin can be
used is unreliable because in some cases the default path for driver
installation is different [1]. This patch changes the logic to check
if a GPU device is available in `/proc/driver/nvidia/gpus/`. This
approach is similar to `torch.cuda.is_available()` and it is a more
accurate indicator.
The subsequent check for support of the `cuda-checkpoint --action`
option would confirm if the driver supports checkpoint/restore.
[1] https://github.com/NVIDIA/gpu-operatorFixes: #2509
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Container runtimes like CRI-O and containerd utilize the freezer cgroup
to create a consistent snapshot of container root filesystem (rootfs)
changes. In this case, the container is frozen before invoking CRIU.
After CRIU successfully completes, a copy of the container rootfs diff
is saved, and the container is then unfrozen.
However, the `cuda-checkpoint` tool is not able to perform a 'lock'
action on frozen threads. To support GPU checkpointing with these
container runtimes, we need to unfreeze the cgroup and return it to its
original state once the checkpointing is complete.
To reflect this new behavior, the following changes are applied:
- `dont_use_freeze_cgroup(void)` -> `set_compel_interrupt_only_mode(void)`
- `bool freeze_cgroup_disabled` -> `bool compel_interrupt_only_mode`
- `check_freezer_cgroup(void)` -> `prepare_freezer_for_interrupt_only_mode(void)`
Note that when `compel_interrupt_only_mode` is set to `true`,
`compel_interrupt_task()` is used instead of `freeze_processes()`
to prevent tasks from running during `criu dump`.
Fixes: #2508
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When `check_freezer_cgroup()` has non-zero return value, `goto err` calls
`return ret`. However, the value of `ret` has been set to `0` in the lines
above and CRIU does not handle the error properly.
This problem is related to https://github.com/checkpoint-restore/criu/issues/2508
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When restoring dumps in new mount + pid namespaces where multiple dumps
share the same network namespace, CRIU may fail due to conflicting
unix socket names. This happens because the service worker creates
sockets using a pattern that includes criu_run_id, but util_init()
is called after cr_service_work() starts.
The socket naming pattern "crtools-fd-%d-%d" uses the restore PID
and criu_run_id, however criu_run_id is always 0 when not initialized,
leading to conflicts when multiple restores run simultaneously either
in the same CRIU process or because of multiple CRIU processes
doing the same operation in different PID namespaces.
Fix this by:
- Moving util_init() before cr_service_work() starts
- Adding a second util_init() call in the service worker fork
to ensure unique IDs across multiple worker runs
- Making sure that dump and restore operations have util_init() called
early to generate unique socket names
With this fix, socket names always include the namespace ID, preventing
conflicts when multiple processes with the same pid share a network
namespace.
Fixes#2499
[ avagin: minore code changes ]
Signed-off-by: Lorenzo Fontana <fontanalorenz@gmail.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
After a fork, both the child and parent processes may trigger a page fault (#PF)
at the same virtual address, referencing the same position in the page image.
If deduplication is enabled, the last process to trigger the page fault will fail.
Therefore, deduplication should be disabled after a fork to prevent this issue.
Signed-off-by: Liu Hua <weldonliu@tencent.com>
This patch blocks SIGCHLD during temporary process creation to prevent a
race condition between kill() and waitpid() where sigchld_handler()
causes `criu restore` to fail with an error.
Fixes: #2490
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch adds two test plugins to verify that CRIU plugins listed
in the inventory image are enabled, while those that are not listed
can be disabled.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch extends the inventory image with a `plugins` field that
contains an array of plugins which were used during checkpoint,
for example, to save GPU state. In particular, the CUDA and AMDGPU
plugins are added to this field only when the checkpoint contains
GPU state. This allows to disable unnecessary plugins during restore,
show appropriate error messages if required CRIU plugin are missing,
and migrate a process that does not use GPU from a GPU-enabled system
to CPU-only environment.
We use the `optional plugins_entry` for backwards compatibility. This
entry allows us to distinguish between *unset* and *missing* field:
- When the field is missing, it indicates that the checkpoint was
created with a previous version of CRIU, and all plugins should be
*enabled* during restore.
- When the field is empty, it indicates that no plugins were used during
checkpointing. Thus, all plugins can be *disabled* during restore.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch fixes the following errors reported by ruff:
lib/pycriu/images/pb2dict.py:307:24: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
|
305 | elif field.type in _basic_cast:
306 | cast = _basic_cast[field.type]
307 | if pretty and (cast == int):
| ^^^^^^^^^^^ E721
308 | if is_hex:
309 | # Fields that have (criu).hex = true option set
|
lib/pycriu/images/pb2dict.py:379:13: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
|
377 | elif field.type in _basic_cast:
378 | cast = _basic_cast[field.type]
379 | if (cast == int) and is_string(value):
| ^^^^^^^^^^^ E721
380 | if _marked_as_dev(field):
381 | return encode_dev(field, value)
|
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
We open a pidfd to a thread using `PIDFD_THREAD` flag and after C/R
ensure that we can send signals using it with `PIDFD_SIGNAL_THREAD`.
signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
After, C/R of pidfds that point to dead processes their inodes might
change. But if two pidfds point to same dead process they should
continue to do so after C/R.
This test ensures that this happens by calling `statx()` on pidfds after
C/R and then comparing their inode numbers.
Support for comparing pidfds by using `statx()` and inode numbers was
introduced alongside pidfs. So if `f_type` of pidfd is not equal to
`PID_FS_MAGIC` then we skip this test.
signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
Validate that pidfds can been used to send signals to different
processes after C/R using the `pidfd_send_signal()` syscall.
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
Process file descriptors (pidfds) were introduced to provide a stable
handle on a process. They solve the problem of pid recycling.
For a detailed explanation, see https://lwn.net/Articles/801319/ and
http://www.corsix.org/content/what-is-a-pidfd
Before Linux 6.9, anonymous inodes were used for the implementation of
pidfds. So, we detect them in a fashion similiar to other fd types that
use anonymous inodes by calling `readlink()`.
After 6.9, pidfs (a file system for pidfds) was introduced.
In 6.9 `S_ISREG()` returned true for pidfds, but this again changed with
6.10.
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/pidfs.c?h=v6.11-rc2#n285)
After this change, pidfs inodes have no file type in st_mode in
userspace.
We use `PID_FS_MAGIC` to detect pidfds for kernel >= 6.9
Hence, check for pidfds occurs before the check for regular files.
For pidfds that refer to dead processes, we lose the pid of the process
as the Pid and NSpid fields in /proc/<pid>/fdinfo/<pidfd> change to -1.
So, we create a temporary process for each unique inode and open pidfds
that refer to this process. After all pidfds have been opened we kill
this temporary process.
This commit does not include support for pidfds that point to a specific
thread, i.e pidfds opened with `PIDFD_THREAD` flag.
Fixes: #2258
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
We only use the last pid from the list in NSpid entry (from
/proc/<pid>/fdinfo/<pidfd>) while restoring pidfds.
The last pid refers to the pid of the process in the most deeply nested
pid namespace. Since CRIU does not currently support nested pid
namespaces, this entry is the one we want.
After Linux 6.9, inode numbers can be used to compare pidfds. pidfds
referring to the same process will have the same inode numbers. We use
inode numbers to restore pidfds that point to dead processes.
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
By default, CRIU uses the path "/usr/lib/criu" to install and load
plugins at runtime. This path is defined by the `PLUGINDIR` variable
in Makefile.install and `CR_PLUGIN_DEFAULT` in `criu/include/plugin.h`.
However, some distribution packages might install the CRIU plugins at
"/usr/lib64/criu" instead. This patch updates the makefile to align
the path defined by `CR_PLUGIN_DEFAULT` with the value of `PLUGINDIR`.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch fixes the following warnings that appear
when building an RPM package:
+ /usr/lib/rpm/redhat/brp-mangle-shebangs
*** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.c is executable but has no shebang, removing executable bit
*** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.h is executable but has no shebang, removing executable bit
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Major changes:
* CUDA plugin to support checkpointing and restoring NVIDIA CUDA applications.
* Shadow stack support
* Pagemap cache: Added support for PAGEMAP_SCAN ioctl
The full changelog can be found here: https://criu.org/Download/criu/4.0.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The topology parsing assumed that all parameter names were
30 characters or fewer, but
recommended_sdma_engine_id_mask
is 31 characters.
Make the maximum length a macro, and set it to 64.
Signed-off-by: David Francis <David.Francis@amd.com>
The presence of /dev/nvidiactl indicates that the system has a
compatible NVIDIA GPU driver installed and that the GPU is accessible to
the operating system.
Signed-off-by: Andrei Vagin <avagin@google.com>
Some plugins (e.g., CUDA) may not function correctly when processes are
frozen using cgroups. This change introduces a mechanism to disable the
use of freeze cgroups during process seizing, even if explicitly
requested via the --freeze-cgroup option.
The CUDA plugin is updated to utilize this new mechanism to ensure
compatibility.
Signed-off-by: Andrei Vagin <avagin@google.com>
This patch fixes the following typos reported by codespell:
./test/others/bers/bers.c:394: dependin ==> depending, depend in
./criu/kerndat.c:837: hitted ==> hit
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The `uninstall_module.py` script is a wrapper for the `pip uninstall`
command that enables support for specifying installation prefix
(i.e., `--prefix`). When this functionality is used, we intentionally
set `sys.path` to include only search paths for the specified prefix
to avoid unintentional uninstallation of packages in system paths.
Since `importlib_metadata` version 8.1.0, the `Distribution.from_name()`
method has been modified [1] to perform additional pre-processing of
Distribution objects [2] that requires loading distribution metadata
and results in the following error:
File "/usr/local/lib/python3.12/site-packages/importlib_metadata/__init__.py", line 422, in <lambda>
buckets = bucket(dists, lambda dist: bool(dist.metadata))
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/importlib_metadata/__init__.py", line 454, in metadata
from . import _adapters
File "/usr/local/lib/python3.12/site-packages/importlib_metadata/_adapters.py", line 3, in <module>
import email.message
File "/usr/lib64/python3.12/email/message.py", line 11, in <module>
import quopri
ModuleNotFoundError: No module named 'quopri'
This error occurs because we have excluded system paths from the list
of search paths (`sys.path`).
However, this pre-processing is not required for our use case, as we
only use the discovery mechanism of importlib_metadata to resolve the
metadata directory path of the module being uninstalled.
To fix this problem, this patch updates `uninstall_module` to avoid the
`from_name()` method and use `discover(name=package_name)` directly.
[1] a65c29adc0
[2] a65c29ad/importlib_metadata/__init__.py (L391)Fixes: #2468
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When attempting to checkpoint a container with CUDA processes,
CRIU could fail with the following error:
Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 1
Error (cuda_plugin.c:143): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call
Error (cuda_plugin.c:384): cuda_plugin: PAUSE_DEVICES failed with
In this situation, the target process is locked, but CRIU fails due to
a timeout and exits with an error. We need to make sure that the target
PID is unlocked in such case.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Some test environments (Actuated runners for example) do not support
maclvan devices. Skip tests depending on it automatically.
Signed-off-by: Adrian Reber <areber@redhat.com>
Previously the check was just if /sys/fs/selinux is mounted. This
extends the check to see if all necessary tools are installed.
Signed-off-by: Adrian Reber <areber@redhat.com>
Running 'crit x ./ rss' on aarch64 crashes with:
File "/home/criu/crit/crit/__main__.py", line 331, in explore_rss
while vmas[vmi]['start'] < pme:
~~~~^^^^^
IndexError: list index out of range
This adds an additional check to the while loop to do access indexes out
of range.
Signed-off-by: Adrian Reber <areber@redhat.com>
Errors on aarch64:
In file included from amdgpu_plugin_drm.h:10,
from amdgpu_plugin.c:33:
amdgpu_plugin.c: In function 'amdgpu_plugin_dump_file':
amdgpu_plugin_util.h:24:20: error: format '%lld' expects argument of type 'long long int', but argument 6 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin.c:1236:9: note: in expansion of macro 'pr_info'
1236 | pr_info("devices:%d bos:%d objects:%d priv_data:%lld\n", args.num_devices, args.num_bos, args.num_objects,
| ^~~~~~~
cc1: all warnings being treated as errors
Errors on ppc64:
In file included from amdgpu_plugin_drm.h:10,
from amdgpu_plugin.c:33:
amdgpu_plugin.c: In function 'amdgpu_plugin_dump_file':
amdgpu_plugin_util.h:24:20: error: format '%llu' expects argument of type 'long long unsigned int', but argument 6 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin.c:1236:9: note: in expansion of macro 'pr_info'
1236 | pr_info("devices:%u bos:%u objects:%u priv_data:%llu\n",
| ^~~~~~~
cc1: all warnings being treated as errors
In file included from amdgpu_plugin_util.c:38:
amdgpu_plugin_util.c: In function 'print_kfd_bo_stat':
amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin_util.c:196:17: note: in expansion of macro 'pr_info'
196 | pr_info("%s(), %d. KFD BO Addr: %llx \n", __func__, idx, bo->addr);
| ^~~~~~~
amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin_util.c:197:17: note: in expansion of macro 'pr_info'
197 | pr_info("%s(), %d. KFD BO Size: %llx \n", __func__, idx, bo->size);
| ^~~~~~~
amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin_util.c:198:17: note: in expansion of macro 'pr_info'
198 | pr_info("%s(), %d. KFD BO Offset: %llx \n", __func__, idx, bo->offset);
| ^~~~~~~
amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin_util.c:199:17: note: in expansion of macro 'pr_info'
199 | pr_info("%s(), %d. KFD BO Restored Offset: %llx \n", __func__, idx, bo->restored_offset);
| ^~~~~~~
cc1: all warnings being treated as errors
Co-developed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Skip cross-compilation on armv7 because, among many other errors,
it fails with the following:
In file included from ../../include/common/lock.h:9,
from ../../criu/include/files.h:9,
from amdgpu_plugin.c:30:
../../include/common/asm/atomic.h:60:2: error: #error ARM architecture version (CONFIG_ARMV*) not set or unsupported.
60 | #error ARM architecture version (CONFIG_ARMV*) not set or unsupported.
| ^~~~~
../../include/common/asm/atomic.h: In function 'atomic_add_return':
../../include/common/asm/atomic.h:81:9: error: implicit declaration of function 'smp_mb' [-Werror=implicit-function-declaration]
81 | smp_mb();
| ^~~~~~
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
To enable cross-compile we need to use the CC definition from
criu/scripts/nmk/scripts/tools.mk:
CC := $(CROSS_COMPILE)$(HOSTCC)
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Here is an example how to run one test:
$ python test/zdtm.py run -t zdtm/static/env00 --ignore-taint --mocked-cuda-checkpoint
Signed-off-by: Andrei Vagin <avagin@google.com>
1. os auto assignment vma addr maybe conflict with vma in gpu living migrate scene;
2. so, we should give choice to user;
Signed-off-by: haozi007 <liuhao27@huawei.com>
New internal glibc types __timeval64 [1] and __suseconds64_t [2] have
been introduced as a solution for the Y2038 problem [3]. These 64-bit
types are used across all architectures. However, this change causes
the following build errors when cross-compiling on ARMv7 (armhf):
criu/timer.c:49:17: error: format '%ld' expects argument of type 'long int', but argument 5 has type '__suseconds64_t' {aka 'long long int'} [-Werror=format=]
49 | pr_info("Restored %s timer to %" PRId64 ".%ld -> %" PRId64 ".%ld\n", n,
| ^~~~~~~~~~~~~~~~~~~~~~~~
50 | (int64_t)val->it_value.tv_sec, val->it_value.tv_usec,
| ~~~~~~~~~~~~~~~~~~~~~
| |
| __suseconds64_t {aka long long int}
criu/timer.c:49:17: error: format '%ld' expects argument of type 'long int', but argument 7 has type '__suseconds64_t' {aka 'long long int'} [-Werror=format=]
49 | pr_info("Restored %s timer to %" PRId64 ".%ld -> %" PRId64 ".%ld\n", n,
| ^~~~~~~~~~~~~~~~~~~~~~~~
50 | (int64_t)val->it_value.tv_sec, val->it_value.tv_usec,
51 | (int64_t)val->it_interval.tv_sec, val->it_interval.tv_usec);
| ~~~~~~~~~~~~~~~~~~~~~~~~
| |
| __suseconds64_t {aka long long int}
ns.c:234:48: error: format '%ld' expects argument of type 'long int', but argument 5 has type 'time_t' {aka 'long long int'} [-Werror=format=]
234 | len = snprintf(buf, sizeof(buf), "%d %ld 0", clk_id, offset);
| ~~^ ~~~~~~
| | |
| long int time_t {aka long long int}
| %lld
msg.c:58:41: error: format '%ld' expects argument of type 'long int', but argument 3 has type '__suseconds64_t' {aka 'long long int'} [-Werror=format=]
58 | off += sprintf(buf + off, ".%.3ld: ", tv.tv_usec / 1000);
| ~~~~^ ~~~~~~~~~~~~~~~~~
| | |
| long int __suseconds64_t {aka long long int}
| %.3lld
../lib/zdtmtst.h:137:26: error: format '%ld' expects argument of type 'long int', but argument 4 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
137 | test_msg("ERR: %s:%d: " format " (errno = %d (%s))\n", __FILE__, __LINE__, ##arg, errno, \
| ^~~~~~~~~~~~~~
pthread_timers_h.c:72:17: note: in expansion of macro 'pr_perror'
72 | pr_perror("wrong interval: %ld:%ld", itimerspec.it_interval.tv_sec, itimerspec.it_interval.tv_nsec);
| ^~~~~~~~~
vdso00.c:22:32: error: format '%li' expects argument of type 'long int', but argument 3 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
22 | test_msg("%d time: %10li\n", getpid(), tv.tv_sec);
| ~~~~^ ~~~~~~~~~
| | |
| long int __time64_t {aka long long int}
| %10lli
vdso00.c:29:32: error: format '%li' expects argument of type 'long int', but argument 3 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
29 | test_msg("%d time: %10li\n", getpid(), tv.tv_sec);
| ~~~~^ ~~~~~~~~~
| | |
| long int __time64_t {aka long long int}
| %10lli
vdso01.c:357:42: error: format '%li' expects argument of type 'long int', but argument 2 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
357 | test_msg("gettimeofday: tv_sec %li vdso_gettimeofday: tv_sec %li\n", tv1.tv_sec, tv2.tv_sec);
| ~~^ ~~~~~~~~~~
| | |
| long int __time64_t {aka long long int}
| %lli
vdso01.c:357:72: error: format '%li' expects argument of type 'long int', but argument 3 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
357 | test_msg("gettimeofday: tv_sec %li vdso_gettimeofday: tv_sec %li\n", tv1.tv_sec, tv2.tv_sec);
| ~~^ ~~~~~~~~~~
| | |
| long int __time64_t {aka long long int}
|
vdso01.c:328:43: error: format '%li' expects argument of type 'long int', but argument 2 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
328 | test_msg("clock_gettime: tv_sec %li vdso_clock_gettime: tv_sec %li\n", ts1.tv_sec, ts2.tv_sec);
| ~~^ ~~~~~~~~~~
| | |
| long int __time64_t {aka long long int}
| %lli
vdso01.c:328:74: error: format '%li' expects argument of type 'long int', but argument 3 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
328 | test_msg("clock_gettime: tv_sec %li vdso_clock_gettime: tv_sec %li\n", ts1.tv_sec, ts2.tv_sec);
| ~~^ ~~~~~~~~~~
| | |
| long int __time64_t {aka long long int}
|
../lib/zdtmtst.h:144:26: error: format '%ld' expects argument of type 'long int', but argument 4 has type 'time_t' {aka 'long long int'} [-Werror=format=]
144 | test_msg("FAIL: %s:%d: " format " (errno = %d (%s))\n", __FILE__, __LINE__, ##arg, errno, \
| ^~~~~~~~~~~~~~~
mtime_mmap.c:80:17: note: in expansion of macro 'fail'
80 | fail("mtime %ld wasn't updated on mmapped %s file", mtime_new, filename);
| ^~~~
../lib/zdtmtst.h:144:26: error: format '%ld' expects argument of type 'long int', but argument 4 has type '__time64_t' {aka 'long long int'} [-Werror=format=]
144 | test_msg("FAIL: %s:%d: " format " (errno = %d (%s))\n", __FILE__, __LINE__, ##arg, errno, \
| ^~~~~~~~~~~~~~~
mtime_mmap.c:101:17: note: in expansion of macro 'fail'
101 | fail("After migration, mtime changed to %ld", fst.st_mtime);
| ^~~~
[1] https://sourceware.org/git/?p=glibc.git;h=504c98717062cb9bcbd4b3e59e932d04331ddca5
[2] https://sourceware.org/git/?p=glibc.git;h=3fced064f23562ec24f8312ffbc14950993969e6
[3] https://en.wikipedia.org/wiki/Year_2038_problem
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
By default, if the "CRIU_LIBS_DIR" environment variable is not set,
CRIU will load all plugins installed in `/usr/lib/criu`. This may
result in running the ZDTM tests with plugins for a different version
of CRIU (e.g., installed from a package).
This patch updates ZDTM to always set the "CRIU_LIBS_DIR" environment
variable and use a local "plugins" directory. This directory contains
copies of the plugin files built from source. In addition, this patch
adds the `--criu-plugin` option to the `zdtm.py run` command, allowing
tests to be run with specified CRIU plugins.
Example:
- Run test only with AMDGPU plugin
./zdtm.py run -t zdtm/static/busyloop00 --criu-plugin amdgpu
- Run test only with CUDA plugin
./zdtm.py run -t zdtm/static/busyloop00 --criu-plugin cuda
- Run test with both AMDGPU and CUDA plugins
./zdtm.py run -t zdtm/static/busyloop00 --criu-plugin amdgpu cuda
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When the cuda-checkpoint tool is not installed, execvp() is expected to
fail and return -1. In this case, we need to call exit() to terminate
the child process that was created earlier with fork().
Since CRIU can be used with applications that do not use CUDA, even
when the CUDA plugin is installed, this patch also updates the log
messages to show debug and warning (instead of error) when the
cuda-checkpoint tool is not found in $PATH.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Andrei Vagin <avagin@google.com>
Show information about mounts available on the host filesystem.
This is useful for debugging.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
CRIU provides two plugins for checkpoint/restore of GPU applications:
amdgpu and cuda. Both plugins use the `RESUME_DEVICES_LATE` hook to
enable restore:
CR_PLUGIN_REGISTER_HOOK(CR_PLUGIN_HOOK__RESUME_DEVICES_LATE, amdgpu_plugin_resume_devices_late)
CR_PLUGIN_REGISTER_HOOK(CR_PLUGIN_HOOK__RESUME_DEVICES_LATE, cuda_plugin_resume_devices_late)
However, CRIU currently does not support running more than one plugin
for the same hook. As a result, when both plugins are installed, the
resume function for CUDA applications is not executed. To fix this,
we need to make sure that both `plugin_resume_devices_late()` functions
return `-ENOTSUP` when restore is not supported.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The plugin hook "PAUSE_DEVICES" was recently introduced in the following
commit. This hook was intended to execute the cuda-checkpoint tool
before the process tree is frozen. However, the run_plugins() call has
been placed immediately *after* freeze_processes(). This causes the
cuda-checkpoint tool to hang indefinitely during the checkpointing
of CUDA applications running in containers, eventually leading to its
termination by the timeout alarm.
a85f488595
criu/plugin: Introduce new plugin hooks PAUSE_DEVICES and CHECKPOINT_DEVICES to be used during pstree collection
This problem can be reproduced with the following example:
sudo podman run -d --rm \
--device nvidia.com/gpu=all --security-opt=label=disable \
quay.io/radostin/cuda-counter
sudo podman container checkpoint -l -e /tmp/checkpoint.tar
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
For historical reasons, some tools like rpm [1] or ldd [2,3]
may expect the executable bit to be present for the correct
identification of shared libraries. The executable bit on .so
files is set by default by compilers (e.g., GCC). It is not
strictly necessary but primarily a convention.
[1] https://docs.fedoraproject.org/en-US/package-maintainers/CommonRpmlintIssues/#unstripped_binary_or_object
[2] https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/ldd.bash.in;h=d6b640df;hb=HEAD#l154
[3] $ sudo ldd /usr/lib/criu/*.so
/usr/lib/criu/amdgpu_plugin.so:
ldd: warning: you do not have execution permission for `/usr/lib/criu/amdgpu_plugin.so'
linux-vdso.so.1 (0x00007fd0a2a3e000)
libdrm.so.2 => /lib64/libdrm.so.2 (0x00007fd0a29eb000)
libdrm_amdgpu.so.1 => /lib64/libdrm_amdgpu.so.1 (0x00007fd0a29de000)
libc.so.6 => /lib64/libc.so.6 (0x00007fd0a27fc000)
/lib64/ld-linux-x86-64.so.2 (0x00007fd0a2a40000)
/usr/lib/criu/cuda_plugin.so:
ldd: warning: you do not have execution permission for `/usr/lib/criu/cuda_plugin.so'
linux-vdso.so.1 (0x00007f1806e13000)
libc.so.6 => /lib64/libc.so.6 (0x00007f1806c08000)
/lib64/ld-linux-x86-64.so.2 (0x00007f1806e15000)
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch updates the dependencies section of the AMDGPU plugin man
page to reflect that the plugin has been merged upstream and to fix a
formatting issue.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
In commit 2e456ccf0c34a056e3ccafac4a0c7effef14d918 ("Linux: Make
__rseq_size useful for feature detection (bug 31965)") glibc 2.40
changed the meaning of __rseq_size slightly: it is now the size
of the active/feature area (20 bytes initially), and not the size
of the entire initially defined struct (32 bytes including padding).
The reason for the change is that the size including padding does not
allow detection of newly added features while previously unused
padding is consumed.
The prep_libc_rseq_info change in criu/cr-restore.c is not necessary
on kernels which have full ptrace support for obtaining rseq
information because the code is not used. On older kernels, it is
a correctness fix because with size 20 (the new value), rseq
registeration would fail.
The two other changes are required to make rseq unregistration work
in tests.
Signed-off-by: Florian Weimer <fweimer@redhat.com>
cgroup testcases live in the same cgroup root zdtmtst and
zdtmtst.defaultroot controller then create child subgroup for testing. This
can cause problems when cgroup testcases run in parallel. For example,
testcase A dumps the child subgroup of testcase B since it's in the cgroup
root but in the middle of restoring of testcase A, testcase B completes and
cleans up the subgroup directory. This causes error in testcase A restore.
This commit adds excl flag to all cgroup testcases description so that
these don't run parallel.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
The CI tests with CentOS 7 have been disabled and removed [1,2].
This patch removes the obsolete Makefile targets for these tests.
[1] 24bc083653
[2] f8466ca798
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Sometimes due to sigblockmask inheritance cgroupd can inherit SIGTERM
blocked. That will lead cgroupd ignoring SIGTERM from stop_cgroupd() and
CRIU will get stuck due to waiting for never-stopping cgroupd.
I see this happening in lxc-checkpoint, also saw this in OpenVZ jenkins
on cgroup_inotify00 test.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Duplicate string in irmap_scan_path_add, otherwise it will free before
parsing next configuration input.
[ avagin: handle errors of xstrdup ]
Signed-off-by: Liu Hua <weldonliu@tencent.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Commit fc683cb01 ("compel: shstk: save CET state when CPU supports it")
started using PTRACE_ARCH_PRCTL to query shadow stack status. While
PTRACE_ARCH_PRCTL has existed in the kernel for a long time, it was only
added to glibc in version 2.27. Amazon Linux 2 (AL2) has glibc 2.26,
which does not have this definition. As a result, build on AL2 fails
with the below error:
compel/arch/x86/src/lib/infect.c: In function ‘get_task_xsave’:
compel/arch/x86/src/lib/infect.c:276:14: error: ‘PTRACE_ARCH_PRCTL’ undeclared (first use in this function)
276 | if (ptrace(PTRACE_ARCH_PRCTL, pid, (unsigned long)&features, ARCH_SHSTK_STATUS)) {
| ^~~~~~~~~~~~~~~~~
While the definition is present on the system via the kernel headers (in
asm/ptrace-abi.h) which can be reached by including linux/ptrace.h, the
comment in compel/include/uapi/ptrace.h says:
We'd want to include both sys/ptrace.h and linux/ptrace.h, hoping
that most definitions come from either one or another. Alas, on
Alpine/musl both files declare struct ptrace_peeksiginfo_args, so
there is no way they can be used together. Let's rely on libc one.
Since including linux/ptrace.h is not an option, define
PTRACE_ARCH_PRCTL if it doesn't already exist. An interesting point to
note is that in sys/ptrace.h, PTRACE_ARCH_PRCTL is an enum value so the
preprocessor doesn't know about it. PT_ARCH_PRCTL is the preprocessor
symbol that matches the value of PTRACE_ARCH_PRCTL. So look for
PT_ARCH_PRCTL to decide if PTRACE_ARCH_PRCTL is available or not.
Another interesting point to note is that AL2 ships with GCC 7 by
default, which does not support the -mshstk option, causing other build
failures. Luckily, it also ships GCC 10 which does have the option.
Using GCC 10 lets the build succeed.
Fixes: fc683cb01 ("compel: shstk: save CET state when CPU supports it")
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
Adding support for the NVIDIA cuda-checkpoint utility, requires the use of an
r555 or higher driver along with the cuda-checkpoint binary.
Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
PAUSE_DEVICES is called before a process is frozen and is used by the CUDA
plugin to place the process in a state that's ready to be checkpointed and
quiesce any pending work
CHECKPOINT_DEVICES is called after all processes in the tree have been frozen
and PAUSE'd and performs the actual checkpointing operation for CUDA
applications
Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
Restore rseq_cs state before calling RESUME_DEVICES_LATE as the CUDA plugin will
temporarily unfreeze a thread during the plugin hook to assist with device
restore
Run the plugin finalizer later in the dump sequence since the finalizer is used
by the CUDA plugin to handle some process cleanup
Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
Move PYTHON_EXTERNALLY_MANAGED and PIP_BREAK_SYSTEM_PACKAGES
into Makefile.install to avoid code duplication. In addition, add
PIPFLAGS variable to enable specifying pip options during installation.
This is particularly useful for packaging, where it is common for `pip install`
to run in an environment with pre-installed dependencies and without internet
access. In such environment, we need to specify the following options:
--no-build-isolation --no-index --no-deps
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Adds a exit_signal static method to criu_cli, criu_config and criu_rpc
used to detect a crash.
Fixes: #350
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
This commit adds a `--preload-libfault` option to ZDTM's run command.
This option runs CRIU with LD_PRELOAD to intercept libc functions
such as pread(). This method allows to simulate special cases,
for example, when a successful call to pread() transfers fewer
bytes than requested.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
It is possible for pread() to return fewer number of bytes than
requested. In such case, we need to repeat the read operation
with appropriate offset.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The unix_conf_op function reads the size of the sysctl entry array
twice. gcc thinks that it can lead to a time-of-check to time-of-use
(TOCTOU) race condition if the array size changes between the two reads.
Fixes#2398
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The rawhide tests runs in a container. Containers always have SELinux
disabled from the inside. Somehow /sys/fs/selinux is now mounted. We
used the existence of that directory if SELinux is available. This seems
to be no longer true.
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
A fault-injection test was introduced in commit [1] and later removed in
commit [2]. This patch removes the obsolete Makefile target.
[1] b95407e264
test: check, that parasite can rollback itself (v2)
[2] 2cb4532e26
tests: remove zdtm.sh (v2)
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Replace sprintf() with snprintf() and specify maximum length of
characters to avoid potential overflow.
Reported-by: GitHub CodeQL (https://codeql.github.com/)
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Currently there are no socket option test cases for TCP_CORK and
TCP_NODELAY, this commit adds related test cases.
The socket option test cases for TCP_KEEPCNT, TCP_KEEPIDLE, and
TCP_KEEPINTVL already exist in socket-tcp_keepalive.c, so they are
not included in this test case.
Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Currently some TCP socket option information is stored in SkOptsEntry,
which is a little confusing.
SkOptsEntry should only contain socket options that are common to
all sockets.
In this commit move the TCP-specific socket options from SkOptsEntry
to TcpOptsEntry.
Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Currently some of the TCP socket option information is stored in the
TcpStreamEntry, but the information in the TcpStreamEntry is only
restored after the TCP socket has established connection, which
results in these TCP socket options not being restored for
unconnected TCP sockets.
In this commit move the TCP socket options from TcpStreamEntry to
TcpOptsEntry and add dump_tcp_opts() and restore_tcp_opts() for TCP
socket options dump and restore.
Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
On some systems, nft binary might not be installed, or some kernel
options might be unconfigured, resulting in something like this:
sudo unshare -n nft create table inet CRIU
Error: Could not process rule: Operation not supported
create table inet CRIU
^^^^^^^^^^^^^^^^^^^^^^^
This is similar to what kerndat_has_nftables_concat() does, and if the
outcome is the same, it returns an error to kerndat_init(), and an error
from kerndat_init() is considered fatal.
Let's relax the check, returning mere "feature not working" instead of
a fatal error.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1) In dump_tcp_conn_state, if return from libsoccr_save is >=0, we check
that sizeof(struct libsoccr_sk_data) returned from libsoccr_save is
equal to sizeof(struct libsoccr_sk_data) we see in dump_tcp_conn_state
(probably to check if we use the right library version). And if sizes
are different we go to err_r, which just returns ret, which can
teoretically be 0 (if size in library is zero) and that would lead
dump_one_tcp treat this as success though it is obvious error.
2) In case of dump_opt or open_image fails we don't explicitly set ret
and rely that sizeof(struct libsoccr_sk_data) previously set to ret is
not 0, I don't really like it, it makes reading code too complex.
3) We have a lot of err_* labels which do exactly the same thing, there
is no point in having all of them, also it is better to choose the name
of the label based on what it really does.
So let's refactor error handling to avoid these inconsistencies.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
During restore, CRIU prints "Enqueue page-read" messages for
each page-read request [1]. However, this message does not
provide useful information, increases performance overhead
during restore and the size of log file.
$ ./zdtm.py run -t zdtm/static/maps06 -f h -k always
$ grep 'Enqueue page-read' dump/zdtm/static/maps06/56/1/restore.log | wc -l
20493
This commit replaces these log messages with a single message
that shows the number of enqueued page-read requests.
$ grep 'enqueued' dump/zdtm/static/maps06/56/1/restore.log
(00.061449) 56: nr_enqueued: 20493
[1] 91388fc
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
1. Tell which RPMs or DEBs are required in all cases.
2. Use $(info ...) everywhere.
3. Drop extra nested $(info), instead use (a document) a simpler kludge.
4. Simplify and unify the language, add missing periods.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Currently we have tabs + spaces on the wrapped line but the wrapped part
is not alligned to the opening bracket.
Fixes: bbe26d1b7 ("timer: fix allignment in function definition")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
CircleCI currently prints out the following warning:
This job is using a deprecated image 'ubuntu-2004:202010-01', please update to a newer image
According to https://discuss.circleci.com/t/linux-image-deprecations-and-eol-for-2024/
the recommended image name is: "image: default"
Signed-off-by: Adrian Reber <areber@redhat.com>
This patch extends the sched_policy00 test case to verify that
the SCHED_RESET_ON_FORK flag is restored correctly.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch extends CRIU with support for SCHED_RESET_ON_FORK.
When the SCHED_RESET_ON_FORK flag is set, the following rules
apply for subsequently created children:
- If the calling thread has a scheduling policy of SCHED_FIFO or
SCHED_RR, the policy is reset to SCHED_OTHER in child processes.
- If the calling process has a negative nice value, the nice value
is reset to zero in child processes.
(See 'man 7 sched')
Fixes: #2359
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
A memory interval is a half-open interval, so the condition
when pr->pe->vaddr == vma->e->end should not be interpreted
as an intersection and should cause vma to be marked with VMA_NO_PROT_WRITE.
Fixes: #2364
Signed-off-by: Artem Trushkin <at.120@ya.ru>
The restore of a task with shadow stack enabled adds these steps:
* switch from the default shadow stack to a temporary shadow stack
allocated in the premmaped area
* unmap CRIU mappings; nothing changed here, but it's important that
CRIU mappings can be removed only after switching to a temporary
shadow stack
* create shadow stack VMA with map_shadow_stack()
* restore shadow stack contents with wrss
* switch to "real" shadow stack
* lock shadow stack features
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
There are several gotachs when restoring a task with shadow stack:
* depending on the compiler options, glibc version and glibc tunables
CRIU can run with or without shadow stack.
* shadow stack VMAs are special, they must be created using a dedicated
map_shadow_stack() system call and can be modified only by a special
instruction (wrss) that is only available when shadow stack is
enabled.
* once shadow stack is enabled, it is not writable even with wrss;
writes to shadow stack can be only enabled with ptrace() and only when
shadow stack is enabled in the tracee.
* if the shadow stack is enabled during restore rather than by glibc,
calling retq after arch_prctl() that enables the shadow stack causes
#CP, so the function that enables shadow stack can never return.
Add the infrastructure required to cope with all of those:
* modify the restore code to allow trampoline (arch_shstk_trampoline)
that will enable shadow stack and call restore_task_with_children().
* add call to arch_shstk_unlock() right after the tasks are clone()ed;
this will allow unlocking shadow stack features and making shadow
stack writable.
* add stubs for architectures that do not support shadow stacks
* add implementation of arch_shstk_trampoline() and arch_shstk_unlock()
for x86, but keep it disabled; it will be enabled along with addtion
of the code that will restore shadow stack in the restorer blob
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Detect if CRIU runs with shadow stack enabled and store the result in
kerndat.
Unlike most kerndat knobs, kdat_has_shstk() does not check for
availability of the shadow stack in the kernel, but rather checks if
criu runs with shadow stack enabled.
This depends on hardware availabilty, kernel and glibc support, compiler
options and glibc tunables, so kdat_has_shstk() must be called every
time CRIU starts and its result cannot be cached.
The result will be used by the code that controls shadow stack
enablement in the next commit.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Shadow stacks must be populated using special WRSS instruction. This
instruction is only available when shadow stack is enabled, calling it
with disabled shadow stack causes #UD.
Moreover, shadow stack VMAs cannot be mremap()ed and they must be
created using map_shadow_stack() system call. This requires delaying the
restore of shadow stacks to restorer blob after the CRIU mappings are
cleared.
Introduce rst_shstk_info structure to hold shadow stack parameters
required in the restorer blob and populate this structure in
arch_prepare_shstk() method.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Shadow stack VMAs cannot be mmap()ed, they must be created using
map_shadow_stack() system call and populated using special wrss
instruction available only when shadow stack is enabled.
Premap them to reserve virtual address space and populate it to have
there contents available for later copying after enabling shadow stack.
Along with the space required by shadow stack VMAs also reserve an extra
page that will be later used as a temporary shadow stack.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
The shadow stack VMAs require special care because they can only be
created and populated using special system calls.
Add VMA_AREA_SHSTK flag and set it for VMAs that are marked as "ss" in
/proc/pid/smaps
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
When calling sigreturn with CET enabled, the kernel verifies that the
shadow stack has proper address of sa_restorer and a "restore token".
Normally, they pushed to the shadow stack when signal processing is
started.
Since compel calls sigreturn directly, the shadow stack should be
updated to match the kernel expectations for sigreturn invocation.
Add parasite_setup_shstk() that sets up the shadow stack with the
address of __export_parasite_head_start as sa_restorer and with the
required restore token.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
To support sigreturn with CET enabled parasite must rewind its stack
before calling sigreturn so that shadow stack will be compatible with
actual calling sequence.
In addition, calling sigreturn from top level routine
(__export_parasite_head_start) will significantly simplify the shadow
stack manipulations required to execute sigreturn.
For x86 make fini_sigreturn() return the stack pointer for the signal
frame that will be used by sigreturn and propagate that return value up
to __export_parasite_head_start.
In non-daemon mode parasite_trap_cmd() returns non-positive value
which allows to distinguish daemon and non-daemon mode and properly stop
at int3 in non-daemon mode.
Architectures other than x86 remain unchanged and will still call
sigreturn from fini_sigreturn().
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
All architectures create on-stack structure for floating point save area
in compel_get_task_regs() if the caller passes NULL rather than a valid
pointer.
The only place that calls compel_get_task_regs() with NULL for floating
point save area is parasite_start_daemon() and it is simpler to define
this strucuture on stack of parasite_start_daemon().
The availability of floating point save data is required in
parasite_start_daemon() to detect shadow stack presence early during
parasite infection and will be used in later patches.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Currently we have checkpoint/restore support only of cgroup v2 threaded
controllers. Threads originating in cgroup v1 environments will be
restored to the main thread's cgroup. This change extends the support
for a cgroups v1.
Signed-off-by: Stepan Pieshkin <stepanpieshkin@google.com>
This patch fixes the following lint error:
scripts/criu-ns:219:16: E713 [*] Test for membership should be `not in`
The change in this patch is auto-generated with `ruff --fix`.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Ruff (https://github.com/astral-sh/ruff) is a Python linter
written in Rust, designed to replace Flake8. It is significantly
faster and actively maintained.
In addition to replacing flake8 with ruff, this patch also
creates separate makefile targets for ruff, shellcheck and
codespell, so that they can be tested independently.
RUFF_FLAGS can be used to specify options such as '--fix'.
Example:
make lint
make ruff RUFF_FLAGS=--fix
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch fixes the following flake8 error:
python3 -m flake8 --config=scripts/flake8.cfg lib/pycriu/images/pb2dict.py
lib/pycriu/images/pb2dict.py:361:43: E721 do not compare types, for exact checks use `is` / `is not`, for instance checks use `isinstance()`
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The commit introducing PAGE_IS_SOFT_DIRTY has not been merged
in kernel v6.7.x.
fs/proc/task_mmu: report SOFT_DIRTY bits through the PAGEMAP_SCAN ioctl
e6a9a2cbc1
As a result, CRIU fails with the following error:
Error (criu/pagemap-cache.c:199): pagemap-cache: PAGEMAP_SCAN: Invalid argument'
Error (criu/pagemap-cache.c:225): pagemap-cache: Failed to fill cache for 63 (400000-402000)'
This patch updates check_pagemap() in kerndat to check if PAGE_IS_SOFT_DIRTY is supported.
Fixes: #2334
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Refactor code used to Checkpoint DRM devices. Code is moved
into amdgpu_plugin_drm.c file which hosts various methods to
checkpoint and restore a workload.
Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
Add a new compilation unit to host symbols and methods that will be
needed to C&R DRM devices. Refactor code that indicates support for
C&R and checkpoints KFD and DRM devices
Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
We already don't treat it as error in the plugin itself, but after
returning -1 from RESUME_DEVICES_LATE hook we print debug message in
criu about failed plugin, let's return 0 instead.
While on it let's replace ret to exit_code.
Fixes: a9cbdad76 ("plugin/amdgpu: Don't print error for "No such process" during resume")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
During the late stages of restore, each process being resumed gets
an ioctl call to KFD_CRIU_OP_RESUME. If the process has no kfd
process info, this call with fail with -ESRCH. This is normal
behaviour, so we shouldn't print an error message for it.
Signed-off-by: David Francis <David.Francis@amd.com>
To improve readability, this patch changes the return type of
iptables_has_criu_jump_target() to a boolean, where 'true' indicates
that iptables has CRIU jump target and 'false' indicates otherwise.
Suggested-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch removes a leftover declaration for log_closedir()
which has been removed in the following commit:
dc80d6f125
log: get rid of LOG_DIR_FD_OFF and opening cwd in log_init()
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Let's use hooked nft chain which actually affects packets.
Fixes: e5f4d8c6f ("test/nfconntrack: use nft or iptables-legacy")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This change adds a new injectable fault (135) to disable PAGEMAP_SCAN and fault
back to read pagemap files.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
PAGEMAP_SCAN is a new ioctl that allows to get page attributes in a more
effeciant way than reading pagemap files.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
In commit [1] was introduced a mechanism to auto-generate the files:
sys-exec-tbl*.c, syscalls*.S, syscall-codes*.h, and syscall*.h.
This commit also updated the gitignore rules to ignore auto-generated
files. However, after commit [2], the path for these files has changed
and the patterns specified in gitignore are no longer needed.
[1] bbc2f133 (x86/build: generate syscalls-{64,32}.built-in.o)
[2] 19fadee9 (compel: plugins,std -- Implement syscalls in std plugin)
Reported-by: @felicitia
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The image has a too old version of nettle which does not work with gnutls.
Just upgrade to the latest to make the error go away.
Signed-off-by: Adrian Reber <areber@redhat.com>
Newer versions of 'tail' rely on inotify and after a restore 'tail' is
unhappy with the state of inotify and just stops.
This replaces 'tail' with a minimal shell based test (thanks Andrei).
Signed-off-by: Adrian Reber <areber@redhat.com>
If ioctl(TIOCSLCKTRMIOS) fails with EPERM it means that a CRIU
process lacks of CAP_SYS_ADMIN capability. But we can use
ioctl(TIOCGLCKTRMIOS) to *read* current ->termios_locked
value from the kernel and if it's the same as we already have
we can skip failing ioctl(TIOCSLCKTRMIOS) safely.
Adrian has recently posted [1] a very good patch to allow ioctl(TIOCSLCKTRMIOS)
for processes that have CAP_CHECKPOINT_RESTORE (right now it requires CAP_SYS_ADMIN).
[1] https://lore.kernel.org/all/20231206134340.7093-1-areber@redhat.com/
Suggested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
WARNINGS variable should be amended, not redefined.
We still need, e.g., `-Wno-dangling-pointer` to build
criu on loongarch64 with gcc13.
Signed-off-by: Ivan A. Melnikov <iv@altlinux.org>
Checkpoint/restore with version 25.0.0-beta.1 fails
with the following error:
$ docker start --checkpoint=c1 cr
Error response from daemon: failed to create task for container: content digest fdb1054b00a8c07f08574ce52198c5501d1f552b6a5fb46105c688c70a9acb45: not found: unknown
Release notes:
https://github.com/moby/moby/discussions/46816
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
In the compel/arch/arm/plugins/std/syscalls/syscall.def, the syscall number of bind on ARM64 should be 200 instead of 235
Signed-off-by: Sally Kang <snapekang@gmail.com>
Two major highlights of this release:
* LoongArch64 support
* A lot of fixes and improvments form the Google backlog.
The full changelog can be found here: https://criu.org/Download/criu/3.19.
This marks the final release of the 3.x series. The upcoming version
will be 4.0! Additionally, the naming pattern will be changed. Any ideas
are welcome.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Newer versions of pip use an isolated virtual environment when building
Python projects. However, when the source code of CRIT is copied into
the isolated environment, the symlink for `../lib/py` (pycriu) becomes
invalid. As a workaround, we used the `--no-build-isolation` option for
`pip install`. However, this functionality has issues in some versions
of PIP [1, 2]. To fix this problem, this patch adds separate packages
for pycriu and crit, and each package is installed independently.
[1] https://github.com/pypa/pip/pull/8221
[2] https://github.com/pypa/pip/issues/8165#issuecomment-625401463
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
cgroup_ifpriomap test needs net_prio cgroup, which might not be
available. Make the .checkskip script check it.
Signed-off-by: Michał Mirosław <emmir@google.com>
At this point the correct position is already restored, so reading from
the fd results in the position being moved forward by 5 bytes.
Fixes: 9191f8728d ("criu/files-reg.c: add build-id validation functionality")
Signed-off-by: Michal Clapinski <mclapinski@google.com>
Eventpollentry's fields are set only when ret == 3 or ret == 6. The
remaining cases can be grouped together to an error
Signed-off-by: Taemin Ha <taemin.ha@utexas.edu>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
line 131 checks if (ret >= 0). line 133 could be replaced by a simple else statement
Signed-off-by: Taemin Ha <taeminha@cs.utexas.edu>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The condition meant to check fd2 instead of fd1, which is checked in
line 24.
Signed-off-by: Taemin Ha <taeminha@cs.utexas.edu>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The is_native field is a boolean. Therefore, else if() should can be
changed to a simple else{}.
Signed-off-by: Taemin Ha <taeminha@cs.utexas.edu>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
This check is redundant as line 201 checks for this condition.
Signed-off-by: Taemin Ha <taeminha@cs.utexas.edu>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
read_ns_sys_file() can return an error, but we are trying to parse a
buffer before checking a return code.
CID 417395 (#3 of 3): String not null terminated (STRING_NULL)
2. string_null: Passing unterminated string buf to strtol, which expects
a null-terminated string.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
GCC's lto source:
> To avoid this problem the compiler must assume that it sees the
> whole program when doing link-time optimization. Strictly
> speaking, the whole program is rarely visible even at link-time.
> Standard system libraries are usually linked dynamically or not
> provided with the link-time information. In GCC, the whole
> program option (@option{-fwhole-program}) asserts that every
> function and variable defined in the current compilation
> unit is static, except for function @code{main} (note: at
> link time, the current unit is the union of all objects compiled
> with LTO). Since some functions and variables need to
> be referenced externally, for example by another DSO or from an
> assembler file, GCC also provides the function and variable
> attribute @code{externally_visible} which can be used to disable
> the effect of @option{-fwhole-program} on a specific symbol.
As far as I read gcc's source, ipa_comdats() will avoid placing symbols
that are either already in a user-defined section or have
externally_visible attribute into new optimized gcc sections.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The "ColumnLimit: 120" is not only allowing lines to be longer than 80
characters but it also forces line wrapping at 120 characters. If total
expression length is more than 120 characters, clang-format will try to
wrap it as close to 120 as it can, it would not even allow to wrap at 80
characters if we really want it. But as we all know 80 characters is
Linux kernel coding style default and as far as our coding style is
based on it it is really strange to prohibit wrapping lines at 80
characters...
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
One memfd can be shared by a few restored files. Only of these files is
restored with a file created with memfd_open. Others are restored by reopening
memfd files via /proc/self/fd/.
It seems unnecessary for restoring memfd memory mappings. We can always use the
origin file.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
amdgpu_plugin.c:930:6: error: variable 'buffer' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
if (ret) {
^~~
amdgpu_plugin.c:988:8: note: uninitialized use occurs here
xfree(buffer);
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch adds the `libdrm-dev` package to the list of CRIU
dependencies installed in CI to build CRIU with amdgpu plugin.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
It means CRIU has to close it when it is not needed.
It looks more logically correct and matches the behaviour of
the RESTORE_EXT_FILE callback.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Currently most of the times we don't have problems with VVAR segment and
lazy restore because when VDSO is parked there is an munmap call that
calls UFFDIO_UNREGISTER on the destination address.
But we don't want to enable userfaultfd for VDSO and VVAR at the first
place.
Signed-off-by: Vladislav Khmelevsky <och95@yandex.ru>
Currently page_size() returns unsigned int value that is after "bitwise
not" is promoted to unsigned long value e.g. in uffd.c
handle_page_fault. Since the value is unsigned promotion is done with 0
MSB that results in lost of MSB pagefault address bits. So make
page_size to return unsigned long to avoid such situation.
Signed-off-by: Vladislav Khmelevsky <och95@yandex.ru>
When -- after restore -- sockets can't communicate, the test times out
while waiting on recvfrom(). Since the communication is local, send()
works instantaneously - so mark sockets with SOCK_NONBLOCK and report
failure if the message is not received immediately.
Signed-off-by: Michał Mirosław <emmir@google.com>
This fixes a failure to clean up after a failed test, where CRIU didn't start properly.
```
===================== Run zdtm/transition/socket-tcp in h ======================
Start test
./socket-tcp --pidfile=socket-tcp.pid --outfile=socket-tcp.out
Traceback (most recent call last):
File ".../zdtm_py.py", line 1906, in do_run_test
cr(cr_api, t, opts)
File ".../zdtm_py.py", line 1584, in cr
cr_api.dump("dump")
File ".../zdtm_py.py", line 1386, in dump
self.__dump_process = self.__criu_act(action,
File ".../zdtm_py.py", line 1224, in __criu_act
raise test_fail_exc("CRIU %s" % action)
test_fail_exc: CRIU dump
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<embedded module '_launcher'>", line 182, in run_filename_from_loader_as_main
File "<embedded module '_launcher'>", line 34, in _run_code_in_main
File ".../zdtm_py.py", line 2790, in <module>
fork_zdtm()
File ".../zdtm_py.py", line 2782, in fork_zdtm
do_run_test(tinfo[0], tinfo[1], tinfo[2], tinfo[3])
File ".../zdtm_py.py", line 1922, in do_run_test
t.kill()
File ".../zdtm_py.py", line 509, in kill
os.kill(int(self.__pid), sig)
ProcessLookupError: [Errno 3] No such process
```
Signed-off-by: Michał Mirosław <emmir@google.com>
cgroup04 test needs full control over mem and devices cgroup hierarchies.
Make the test's .checkskip script better at detecting if the cgroups are
available for use.
Signed-off-by: Michał Mirosław <emmir@google.com>
Make the errno values reported by cgroup04 always correct and showing
relevant parameters.
Constify constant strings, while at it.
Signed-off-by: Michał Mirosław <emmir@google.com>
At least in Google's VM environment, the kernel taints are unrelated to CRIU
runs. Don't fail tests if taints change, if kernel taints are ignored.
Signed-off-by: Michał Mirosław <emmir@google.com>
They break it with each kernel rebase. More details are here:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1857257
Last time, it was fixed a few month ago and it has been broken again in
5.15.0-1046-azure.
Let's bind-mount the CRIU directory into a test container to make it
independent of a container file system.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The version of CRIU is specified in the Makefile.versions file.
This patch generates '__varion__' value for the pycriu module.
This value can be used by crit to implement `--version`.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Power ISA 3.0 added a new syscall instruction. Kernel 5.9 added
corresponding support.
Add CRIU support to recognize the new instruction and kernel ABI changes
to properly dump and restore threads executing in syscalls. Without this
change threads executing in syscalls using the scv instruction will not
be restored to re-execute the syscall, they will be restored to execute
the following instruction and will return unexpected error codes
(ERESTARTSYS, etc) to user code.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
The amdgpu plugin would create a memory buffer at the size
of the largest VRAM bo (buffer object). On some systems, VRAM
size exceeds RAM size, so the largest bo might be larger than
the available memory.
Add an environment variable KFD_MAX_BUFFER_SIZE, which caps the
size of this buffer. By default, it is set to 0, and has no
effect. When active, any bo larger than its value will be
saved to/restored from file in multiple passes.
Signed-off-by: David Francis <David.Francis@amd.com>
Check membarrier registration both ways:
1. By issuing membarrier commands and checking if they succeed.
2. By issuing MEMBARRIER_CMD_GET_REGISTRATIONS.
The first way is needed for older kernels. The second way is needed to test
MEMBARRIER_CMD_GLOBAL_EXPEDITED.
Signed-off-by: Michal Clapinski <mclapinski@google.com>
MEMBARRIER_CMD_GET_REGISTRATIONS can tell us whether or not the process used
MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED unlike the old probing method.
Falls back to the old method when MEMBARRIER_CMD_GET_REGISTRATIONS is
unavailable.
Signed-off-by: Michal Clapinski <mclapinski@google.com>
There are multiple cases where good human readable code block is
converted to an unreadable mess by clang-format, so we don't want to
rely on clang-format completely. Also there is no way, as far as I can
see, to make clang-format only fix what we want it to fix without
breaking something.
So let's just display hints inline where clang-format is unhappy. When
reviewer sees such a warning it's a good sign that something is broken
in coding-style around this warning.
We add special script which parses diff generated by indent and
generates warning for each hunk.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This is highlight that code readability is the real goal of all the
coding-style rules. We should not do coding-style just for coding-style,
e.g. when clang-format suggests crazy formating we should not follow it
if we feel it is bad.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Ctags is mentioned in the beginning of the "Edit the source code" which
is really confusing: Do you need ctags to edit CRIU code? - No. It is
just one helpful tool to browse the code, and we do not want to enforce
it. So, what is it doing in contribution guide? People who really need
it should be able to find it in Makefile or just write oneliner of their
own to collect tags...
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
If there is only a single RW opened fd for a memfd, it can be used
to pass it to execveat() with AT_EMPTY_PATH to have its contents
executed. This currently works only for the original fd from
memfd_create(). For now we ignore processes that reopen the memfd's
rw and expect a particular executability trait of it. (Note: for
security purposes recent kernels have SEAL_EXEC to make memfds
non-executable.)
Signed-off-by: Michał Mirosław <emmir@google.com>
Plug a fd leak when returning error from check_pagemap().
(Cosmetic, as the process will exit soon anyway.)
Signed-off-by: Michał Mirosław <emmir@google.com>
This commit is introducing a test for the action-script functionality
of CRIU to verify that pre-dump, post-dump, pre-restore, pre-resume,
post-restore, post-resume hooks are executed during dump/restore.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Fix test of whether the kernel exposes page frame numbers to cope with the
possibility that the top of the stack is swapped out, which was happening
in about one 1 out of 3 million runs. This lead to a later failure when
trying to read the PFN of the zero page, after which criu would exit with
no error message.
Original-From: Ambrose Feinstein <ambrose@google.com>
Signed-off-by: Michał Mirosław <emmir@google.com>
There is only one user of memfd_open() outside of memfd.c: open_filemap().
It is restoring a file-backed mapping and doesn't need nor expect to
update F_SETOWN nor the fd's position. Check the inherited_fd() handling
in the callers to simplify the code.
Signed-off-by: Michał Mirosław <emmir@google.com>
The 288d6a61e2 change broke all the syscall numbers.
Reported-by: Michał Mirosław <emmir@google.com>
Fixes: (288d6a61e2 "loongarch64: reformat syscall_64.tbl for 8-wide tabs")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
While each preadv() is followed by a fallocate() that removes the data
range from image files on tmpfs, temporarily (between preadv() and
fallocate()) the same data is in two places; this increases the memory
overhead of restore operation by the size of a single preadv.
Uncapped preadv() would read up to 2 GiB of data, thus we limit that to
a smaller block size (128 MiB).
Based-on-work-by: Paweł Stradomski <pstradomski@google.com>
Signed-off-by: Michał Mirosław <emmir@google.com>
Note: Silently drops MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED as it's
not currently detectable. This is still better than silently dropping
all membarrier() registrations.
Signed-off-by: Michał Mirosław <emmir@google.com>
The VMA_AREA_MEMFD constant was introduced with commit
29a1a88bce
memfd: add memory mapping support
This patch extends the status map used in CRIT and coredump with the
value of this constant to recognize it.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This change fixes the issue:
```
The following packages have unmet dependencies:
docker-ce : Depends: containerd.io (>= 1.6.4)
E: Unable to correct problems, you have held broken packages.
```
Signed-off-by: Andrei Vagin <avagin@google.com>
The log prefix "amdgpu_plugin:" is defined with `LOG_PREFIX` in
`amdgpu_plugin.c`. However, the prefix is also included in each
log message. As a result it appears duplicated in the log messages:
(00.044324) amdgpu_plugin: amdgpu_plugin: devices:1 bos:58 objects:148 priv_data:45696
(00.045376) amdgpu_plugin: amdgpu_plugin: Thread[0x5589] started
(00.167172) amdgpu_plugin: amdgpu_plugin: img_path = amdgpu-kfd-62.img
(00.083739) amdgpu_plugin: amdgpu_plugin : amdgpu_plugin_dump_file() called for fd = 235
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Make the scan use the order of paths that came from the user.
Fixes: 4f2e4ab3be ("irmap: add --irmap-scan-path option"; 2015-09-16)
Signed-off-by: Michał Mirosław <emmir@google.com>
Move tcp_cork() and tcp_nodelay() to the only user: page-xfer.c. While
at it, fix error messages (as they do not refer to restoring the sockopt
values) and demote them as they are not fatal to the page transfer.
Signed-off-by: Michał Mirosław <emmir@google.com>
In criu/apparmor.c: write_aa_policy(), the arg path is passed as a char
pointer. The original code used sizeof(path) to get the size of it,
which is incorrect as it always return the size of the char pointer
(typically 8 or 4), not the actual capacity of the char array.
Given that this function is only invoked with path declared as `char
path[PATH_MAX]`, replacing sizeof(path) with PATH_MAX should correctly
represent the maximum size of it.
Fixes: 8723e3f ("check: add a feature test for apparmor_stacking")
Signed-off-by: Haorong Lu <ancientmodern4@gmail.com>
memfd is created by default with +x permissions set. This can be changed
by a process using fchmod() and expected to prevent using this fd for
exec(). Migrate the permissions.
Signed-off-by: Michał Mirosław <emmir@google.com>
Include the file descriptor and error code in the debug message to make
it more useful.
Fixes: e7ba90955c (2016-03-14 "cr-check: Inspect errno on syscall failures")
Signed-off-by: Michał Mirosław <emmir@google.com>
prctl(NO_NEW_PRIVS) when set prevents child processes gaining
capabilities not in permitted set. In this case, inability to
clear capability from BSET that is not in the permitted set is
harmless.
Signed-off-by: Michał Mirosław <emmir@google.com>
When restoring on a kernel that has different number of supported
capabilities than checkpoint one, check that the extra caps are unset.
There are two directions to consider:
1) dump.cap_last_cap > restore.cap_last_cap
- restoring might reduce the processes' capabilities if restored
kernel doesn't support checkpointed caps. Warn.
2) dump.cap_last_cap < restore.cap_last_cap
- restoring will fill the extra caps with zeroes. No changes.
Note: `last_cap` might change without affecting `n_words`.
Signed-off-by: Michał Mirosław <emmir@google.com>
Skip calling setgroups() when the list of auxiliary groups already has
the values we want. This allows restoring into an unprivileged user
namespace where setgroups() is disabled.
From: Ambrose Feinstein <ambrose@google.com>
Signed-off-by: Michał Mirosław <emmir@google.com>
When CRIU is run with the task's credentials on restore, don't set uids
and gids. This avoids the need to modify the SECURE_NO_SETUID_FIXUP flag
which requires CAP_SETPCAP.
From: Andy Tucker <agtucker@google.com>
Signed-off-by: Michał Mirosław <emmir@google.com>
Note: This removes the difference in calling convention of
restore_file_perms() returning -errno that was the only call that did
this in the caller.
From: Radosław Burny <rburny@google.com>
Signed-off-by: Michał Mirosław <emmir@google.com>
Add generic wrappers for fchown() and fchmod() that skip the calls if
no changes are needed. This will allow to unify places where we can
avoid errors when no-op requests are not permitted.
Signed-off-by: Michał Mirosław <emmir@google.com>
NR_fstat is a deprecated syscall, some
modern architectures such as riscv and
loongarch64 no longer support this syscall.
It is usually replaced by NR_statx.
NR_statx is supported since linux 4.10.
Signed-off-by: znley <shanjiantao@loongson.cn>
Fixes: #2222
Fixes: f1c8d38 ("kerndat: check if setsockopt IPV6_FREEBIND is supported")
Signed-off-by: Yan Evzman <yevzman@gmail.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
These errors originate from the filesystem scanning in irmap.c and are mostly
benign. Nevertheless, if they do result in a failed irmap lookup, that failed
lookup is more interesting from an application perspective.
Signed-off-by: Michał Mirosław <emmir@google.com>
Make logs about inaccessible mounts warnings, as the failures are
normally harmless (e.g. failure to read /dev/cgroup) and don't
make the CRIU run fail. (If it happens that the fsnotify can't
find a file, then to debug, full CRIU logs will be necessary anyway.)
Signed-off-by: Michał Mirosław <emmir@google.com>
Errors in early restore.log for status=1 from a subprocess are confusing,
esp. that they don't show what command failed. Since the result is
either ignored or logged anyway, mark the calls as "can fail".
Signed-off-by: Michał Mirosław <emmir@google.com>
This makes the error to mount cgroup hierarchy a bit less noisy:
Error (criu/cgroup.c:623): cg: Unable to mount cgroup2 : Invalid argument'
Instead of
Error (criu/cgroup.c:623): cg: Unable to mount cgroup2 : Invalid argument'
Error (criu/cgroup.c:715): cg: failed walking /proc/self/fd/-1/zdtmtst for empty cgroups: No such file or directory'
Signed-off-by: Michał Mirosław <emmir@google.com>
This patch removes the code for Python 2 compatibility introduced
with commit e65c7b5 (zdtm: Replace imp module with importlib).
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch is replacing the set_blocking() function with
os.set_blocking(). This function was introduced for compatibility with
Python 2 in commit 8094df8di (criu-ns: Add tests for criu-ns script).
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This commit removes the checks for the Python 2 binary in the makefile
and makes sure that ZDTM tests always use python3. Since support for
Python 2 has been dropped, these checks are no longer needed.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This commit removes the dependency on the __future__ module, which was
used to enable Python 3 features in Python 2 code. With support for
Python 2 being dropped, it is no longer necessary to maintain backward
compatibility.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When building with pip version 20.0.2 or older, the pip install
command creates a temporary directory and copies all files from
./crit. This results in the following error message:
ModuleNotFoundError: No module named 'pycriu'
This error appears because the symlink 'pycriu' uses a relative path
that becomes invalid '../lib/py/'.
The '--no-build-isolation' option for pip install is needed to enable
the use of pre-installed dependencies (e.g., protobuf) during build.
The '--ignore-installed' option for pip is needed to avoid an error when
crit is already installed. For example, crit is installed in the GitHub
CI environment as part of the criu OBS package as a dependency for
podman.
Distributions such as Arch Linux have adopted an externally managed
python installation in compliance with PEP 668 [1] that prevents pip
from breaking the system by either installing packages to the system or
locally in the home folder. The '--break-system-packages' [2] option
allows pip to modify an externally managed Python installation.
[1] https://peps.python.org/pep-0668/
[2] https://pip.pypa.io/en/stable/cli/pip_uninstall/
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch reverts changes introduced with the following commits:
4feb07020d
crit: enable python2 or python3 based crit
b78c4e071a
test: fix crit test and extend it
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch reverts changes introduced for Python 2 compatibility
in commits:
1c866db (Add new files for running criu-coredump via python 2 or 3)
3180d35 (Add support for python3 in criu-coredump).
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
We have disabled CentOS 7 tests in CI. This patch reverts the
changes introduced in the following commit:
24bc083653
ci: disable some tests on CentOS 7
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Instead of opening the image directly, the commit refactors the
asciinema image embedded link to redirect users to the corresponding
video.
Signed-off-by: Abhishek Guleri <abhishekguleri24@gmail.com>
With the parasite socket clash now guaranteed not to happen,
the comment becomes obsolete. netns is steel needed though, so
update the comment to point at the requirement.
Change-Id: I3cfb253cd5c53b91b955fcb001530b4aee5129f4
Signed-off-by: Michał Mirosław <emmir@google.com>
Instead of relying on chance of CLOCK_MONOTONIC reading being unique,
use pid namespace ID that combined with the process ID will make it
unique on the machine level.
If pidns is not enabled on a kernel we'll get ENOENT, but then CRIU's
pid will already be unique. If there is some other error, log it but
continue, as the socket clash (if it happens) will result in a failed
run anyway.
Fixes: 45e048d77a (2022-03-31 "criu: generate unique socket names")
Fixes: 408a7d82d6 (2022-02-12 "util: add an unique ID of the current criu run")
Change-Id: I111c006e1b5b1db8932232684c976a84f4256e49
Signed-off-by: Michał Mirosław <emmir@google.com>
If not dumping netns nor connections, nsid support is not used. Don't
fail the run as if the support is needed, the dumping process will fail
later.
Change-Id: I39a086756f6d520c73bb6b21eaf6d9fb49a18879
Signed-off-by: Michał Mirosław <emmir@google.com>
kerndat_nsid() is not used outside kerndat.c. Make it static.
Change-Id: I52e518ecb7c627cc1866e373411b2be3f71a2c9d
Signed-off-by: Michał Mirosław <emmir@google.com>
If the error is ignored it is not important enough - make it a warning
instead.
From: Mian Luo <mianl@google.com>
Change-Id: If2641c3d4e0a4d57fdf04e4570c49be55f526535
Signed-off-by: Michał Mirosław <emmir@google.com>
Google's RPC client process is in a different pidns and has more privileges --
CRIU can't open its /proc/<pid>/fd/<fd>. For images_dir_fd to be useful here
it would need to refer to a passed or CRIU's fd.
From: Michał Cłapiński <mclapinski@google.com>
Change-Id: Icbfb5af6844b21939a15f6fbb5b02264c12341b1
Signed-off-by: Michał Mirosław <emmir@google.com>
New 'query-ext-files' action for `criu dump` is sent after
freezing the process tree. This allows to defer gathering
the external file list when the process tree is in a stable
state and avoids race with the process creating and deleting
files.
Change-Id: Iae32149dc3992dea086f513ada52cf6863beaa1f
Signed-off-by: Michał Mirosław <emmir@google.com>
Container runtimes commonly use CRIU with RPC. However, this prevents
the use of action-scripts set in a CRIU configuration file due to the
explicit scripts mode introduced with the following commit:
ac78f13bdf
actions: Introduce explicit scripts mode
This patch enables container checkpoint/restore with action-scripts
specified via configuration file.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
nla_get_s32() was added to libnl 3.2.7 in 2015. Remove CRIU's definition
as it breaks build when statically linking the binary.
From: Uros Prestor <urosp@google.com>
Signed-off-by: Michał Mirosław <emmir@google.com>
When trying to build CRIU with libbsd enabled the compilation fails due
to duplicate definition of __aligned macro. Other such definitions are
already wrapped with #ifndef make __aligned definition consistent and
make it easier in the future to use the libbsd features if needed.
Signed-off-by: Michał Mirosław <emmir@google.com>
$LDFLAGS can contain `-Ldir`s that are required by '-lib's in $LIBS.
Reverse the order so that `-L` options make effect.
Signed-off-by: Michał Mirosław <emmir@google.com>
`make` without `-s` option will normally show the commands executed. In
the case of detecting build environment features current makefile will
cause detected features to be seen as 'echo #define' commands, but not
detected ones will be silent. Change it so that all tried features can
be seen (outside of make's silent mode) regardless of detection result.
Signed-off-by: Michał Mirosław <emmir@google.com>
The test for HAS_MEMFD is empty and noit used. Remove it.
Fixes: 5ee1ac1f28 ("criu: remove FEATURE_TEST_MEMFD")
Change-Id: I43b8f0cfd50ce9bdf93dafb647377318df1deae8
Signed-off-by: Michał Mirosław <emmir@google.com>
During dump, CRIU stores the structs representing sockets in a statically sized
hashmap of size 32. We have some (admittedly crazy) tasks that use tens of
thousands of sockets, and seem to spend most of the dump time iterating over
the linked lists of the map.
16K is chosen arbitrarily, so that it reduces the lengths of the chains to few
elements on average, while not introducing significant memory overhead.
From: Radosław Burny <rburny@google.com>
Signed-off-by: Michał Mirosław <emmir@google.com>
The fail() macro provides a new line character at the end of the
message. This patch fixes the following lint check that currently
fails in CI:
$ git --no-pager grep -E '^\s*\<(pr_perror|fail)\>.*\\n"'
test/zdtm/static/thp_disable.c: fail("prctl(GET_THP_DISABLE) returned unexpected value: %d != 1\n", ret);
test/zdtm/static/thp_disable.c: fail("Flags changed %lx -> %lx\n", orig_flags, new_flags);
test/zdtm/static/thp_disable.c: fail("Madvs changed %lx -> %lx\n", orig_madv, new_madv);
test/zdtm/static/thp_disable.c: fail("post-migration prctl(GET_THP_DISABLE) returned unexpected value: %d != 1\n", ret);
test/zdtm/static/thp_disable.c: fail("Flags changed %lx -> %lx\n", orig_flags, new_flags);
test/zdtm/static/thp_disable.c: fail("Madvs changed %lx -> %lx\n", orig_madv, new_madv);
Fixes: #2193
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Make it possible to skip network lock to enable uses that break connections
anyway to work without iptables/nftables being present.
Signed-off-by: Michał Mirosław <emmir@google.com>
Make it clear that the option numbers are indexes not the option
identifiers ("names"). Also show the value change that prompted test
failure.
Signed-off-by: Michał Mirosław <emmir@google.com>
We don't want test framework to change its behaviour on whether we
run a single or multiple tests in a run. When we shard the test suite
it can result in some shards having a single test to run and
unexpectedly change the test output format.
Signed-off-by: Michał Mirosław <emmir@google.com>
This commit revises the error handling in the fdspy test. Previously,
a failure case could have been incorrectly reported as successful because
of a specific check `pass != 0`, leading to potential false positives
when `check_pipe_ends()` returned `-1` due to a read/write pipe error.
To improve this, we've adjusted the error handling to return `0` in case
of any error. As such, the final success condition remains unchanged. This
approach will help accurately differentiate between successful and failed
cases, ensuring the output "All OK" is printed for success, and "Something
went WRONG" for any failure.
Fixes: 5364ca3 ("compel/test: Fix warn_unused_result")
Signed-off-by: Haorong Lu <ancientmodern4@gmail.com>
Apparently Skylake uses init-optimization when saving FPU state, and ptrace()
returns XSTATE_BV[0] = 0 meaning FPU was not used by a task (in init state).
Since CRIU restore uses sigreturn to restore registers, FPU state is always
restored. Fill the state with default values on dump to make restore happy.
Signed-off-by: Michał Mirosław <emmir@google.com>
The original commit added saving THP_DISABLED flag value, but missed
restoring it. There is restoring code, but used only when --lazy_pages
mode is enabled. Restore the prctl flag always. While at it, rename the
`has_thp_enabled` -> `!thp_disabled` for consistency.
Fixes: bbbd597b41 (2017-06-28 "mem: add dump state of THP_DISABLED prctl")
Signed-off-by: Michał Mirosław <emmir@google.com>
Linux 4.15 doesn't like empty string for cgroup2 mount options.
Pass NULL then to satisfy the kernel check. Log the options for
easier debugging.
Signed-off-by: Michał Mirosław <emmir@google.com>
4.15-based kernels don't allow F_*SEAL for memfds created with MFD_HUGETLB.
Since seals are not possible in this case, fake F_GETSEALS result as if it
was queried for a non-sealing-enabled memfd.
Signed-off-by: Michał Mirosław <emmir@google.com>
This does cgroup namespace creation separately from joining task
cgroups. This makes the code more logical, because creating cgroup
namespace also involves joining cgroups but these cgroups can be
different to task's cgroups as they are cgroup namespace roots
(cgns_prefix), and mixing all of them together may lead to
misunderstanding.
Another positive thing is that we consolidate !item->parent checks in
one place in restore_task_with_children.
Signed-off-by: Valeriy Vdovin <valeriy.vdovin@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This is a patch proposed by Thomas here:
https://lore.kernel.org/all/87ilczc7d9.ffs@tglx/
It removes (created id > desired id) "sanity" check and adds proper
checking that ids start at zero and increment by one each time when we
create/delete a posix timer.
First purpose of it is to fix infinite looping in create_posix_timers on
old pre 3.11 kernels.
Second purpose is to allow kernel interface of creating posix timers
with desired id change from iterating with predictable next id to just
setting next id directly. And at the same time removing predictable next
id so that criu with this patch would not get to infinite loop in
create_posix_timers if this happens.
Thanks a lot to Thomas!
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
CentOS 7 CI environment uses Python 2. To execute criu-ns
script in CentOS 7 changing the current shebang line to
python is required.
This reverse the changes made in a15a63fce0
Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
These changes fix the `ImportError: No module named pathlib`
error when executing criu-ns tests located at criu/test/others/criu-ns
Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
These changes remove and update the changes introduced in
7177938e60 in favor of the
Python version in CI.
os.waitstatus_to_exitcode() function appeared in Python 3.9
Related to: #1909
Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
--criu-binary argument provides a way to supply the CRIU binary
location to run_criu().
Related to: #1909
Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
By default, the file name 'amdgpu_plugin.txt' is used also as the name
for the corresponding man page (`man amdgpu_plugin`). However, when
this man page is installed system-wide it would be more appropriate
to have a prefix 'criu-' (e.g., `man criu-amdgpu-plugin`).
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Using the fact that we know criu_pid and criu is a parent of restored
process we can create pidfile with pid on caller pidns level.
We need to move mount namespace creation to child so that criu-ns can
see caller pidns proc.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Newer Intel CPUs (Sapphire Rapids) have a much larger xsave area than
before. Looking at older CPUs I see 2440 bytes.
# cpuid -1 -l 0xd -s 0
...
bytes required by XSAVE/XRSTOR area = 0x00000988 (2440)
On newer CPUs (Sapphire Rapids) it grows to 11008 bytes.
# cpuid -1 -l 0xd -s 0
...
bytes required by XSAVE/XRSTOR area = 0x00002b00 (11008)
This increase the xsave area from one page to four pages.
Without this patch the fpu03 test fails, with this patch it works again.
Signed-off-by: Adrian Reber <areber@redhat.com>
The pipe_size type is unsigned int, when the fcntl call fails and
return -1, it will cause a negative rollover problem.
Signed-off-by: zhoujie <zhoujie133@huawei.com>
The TOS(type of service) field in the ip header allows you specify the
priority of the socket data.
Signed-off-by: Suraj Shirvankar <surajshirvankar@gmail.com>
The highlight feature of this release is the ability to use CRIU for
non-root users. Adrian Reber implemented the kernel part and created the
initial version of CRIU changes. Then Younes Manton joined the effort
and pushed it to the finish line.
The full change log is here: https://criu.org/Download/criu/3.18
Signed-off-by: Andrei Vagin <avagin@gmail.com>
We do kerndat_has_nspid in kerndat_init already and save result to
kerndat cache, we don't need to recheck it each time.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Previously when tv_sec>=100, the line would look like this:
(269.189615 Error [...]
Now the last char is overwritten with ')'.
Signed-off-by: Michal Clapinski <mclapinski@google.com>
In parse_pid_status there are 13 places where we do done++, so when
"done" is 13 it means that we have matched each of those 13 places and
we are ready to stop. In next lines we are not going to find anything.
So the right condition for the while loop is (done < 13).
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
During the restore process, netlink fd uses the flags in the
NetlinkSkEntry structure to restore the file state, but during
the dump process, the flags values is not saved to the structure.
Signed-off-by: zhoujie <zhoujie133@huawei.com>
Signed-off-by: hejingxian <hejingxian@huawei.com>
Previously fixup was done before threads' registers were dumped so it
didn't actually work. This commit splits rseq fixup into thread leader
fixup and other threads fixup and applies them after the entities are
seized.
Signed-off-by: Michal Clapinski <mclapinski@google.com>
Kernel shouldn't clean up rseq_cs inside a critical section.
If rseq_cs has been cleaned up, it means there is a bug in migration.
Signed-off-by: Michal Clapinski <mclapinski@google.com>
This patch adds concurrency groups to the CI workflows to automatically
cancel any in-progress workflows when a pull request has been updated.
A `concurrency` group allows to ensure that a single job or workflow
will run at a time. For example, when a pull request is updated with
a force-push, the GiHub CI workflows currently in-progress will be
automatically cancelled, and the CI would run only with the updated
commits.
https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#concurrency
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
- use exit_code instead of returning ret
- replace -errno return with -1
- move fallback to if (!kdat.sk_unix_file)
- fix readlinkat error checking (ret < 0 && ret >= PATH_MAX) by using
read_fd_link helper
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
As we now don't have any calls to free in this function we can replace
all lables with explicit returns.
While on it: Replace useless -errno and 1 returns with -1 as from the
very first implementation of unix_resolve_name (it changed name to _old
later) in [1] any non-zero return was treated as error.
6d785e6cd ("unix: resolve a socket file when a socket descriptor is
available") [1]
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
It is strange to free a pointer which is already in unix_sk_desc, either
on error path or on skip as we leave freed pointer in desc and it can
probably be used after free later and lead to some corruption. So I
would prefer not to free it as we don't have full controll over it here.
Fixes: 6d785e6cd ("unix: resolve a socket file when a socket descriptor is available")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Fix cwd freeing on error path in get_cwd_check_perm and
on non-error-path in unix_fill_sock_name.
v2: use cleanup_free attribute in unix_fill_sock_name
Signed-off-by: Yuriy Vasiliev <yuriy.vasiliev@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
First, let's move lookup_create_item-s to the end so that on pgid
replacement we don't have false positive pstree_pid_by_virt check
founding item created by sid replacement. (note: we need those
lookup_create_item-s for the sake of free pid selection mechanism)
Second, let's add checks for sid/pgid in images intersecting with
current_sid/pgid, as this would also bring problems on restore.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
In Virtuozzo tests we have seen uninformative errors:
(26.575039) 151187 fdinfo 6: pos: 0 flags: 2/0
(26.575076) sockets: Searching for socket 0x346d1 family 1
(666.230281 ----------------------------------------
(666.230586 Error (criu/cr-dump.c:1850): Dump files (pid: 151187) failed
with -1
So let's add some error messages to this stack.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
With this macro we can easily declare struct mntns_zdtm variables with
all lists properly initiallized. Let's use it in mount_complex_sharing
as without it we can have segfault on error path when accessing
uninitialized list pointers.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Currently we only allow external fuse mount itself, let's allow
bindmount for it too. Other mount code is ready for this change and will
be able to bindmount it from corresponding external mount.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
When installing packages within Archlinux container, pacman fails with
the following errors:
(3/7) Creating temporary files...
/usr/lib/tmpfiles.d/journal-nocow.conf:26: Failed to replace specifiers in '/var/log/journal/%m': No such file or directory
/usr/lib/tmpfiles.d/systemd.conf:23: Failed to replace specifiers in '/run/log/journal/%m': No such file or directory
/usr/lib/tmpfiles.d/systemd.conf:25: Failed to replace specifiers in '/run/log/journal/%m': No such file or directory
/usr/lib/tmpfiles.d/systemd.conf:26: Failed to replace specifiers in '/run/log/journal/%m/*.journal*': No such file or directory
/usr/lib/tmpfiles.d/systemd.conf:29: Failed to replace specifiers in '/var/log/journal/%m': No such file or directory
/usr/lib/tmpfiles.d/systemd.conf:30: Failed to replace specifiers in '/var/log/journal/%m/system.journal': No such file or directory
/usr/lib/tmpfiles.d/systemd.conf:32: Failed to replace specifiers in '/var/log/journal/%m': No such file or directory
/usr/lib/tmpfiles.d/systemd.conf:33: Failed to replace specifiers in '/var/log/journal/%m/system.journal': No such file or directory
To solve this problem we need to initialize the machine ID.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch optimizes shell code as reading a single file as input using a 'cat' command to a program.
It is considered to be a Useless Use of Cat (UUOC).
It's more efficient to simply use redirection.
However, in some cases, even using the redirection operator '<' seems unnecessary.
Signed-off-by: KKrypt <sankalpacharya1211@gmail.com>
When we collect external mount namespace we don't want to dump mounts in
it, so lets remove this flag. This way we can e.g. use for_dump in
->parse() callbacks to separate in-container mounts from others.
This only affects rare case of `--ext-mount-map auto` but to be
absolutely correct let's fix it anyway.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The new field cg_set is currently marked as required which causes backward
compatibility problem when using newer CRIU version to restore dumped image
from older version. This commit makes this field optional and reworks the
logic to fallback to use cg_set from task_core when it is not in
thread_core.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
The new field is_threaded is currently marked as required which causes
backward compatibility problem when using newer CRIU version to restore
dumped image from older version. This commit makes this field optional and
reworks the logic the skip fixing up threaded cgroup controllers if there
is no information in dumped image.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
The patch is similar to what has been done in linux kernel, as this
warning effectively prevents us from adding list elements to local list
head. See 49beadbd47
Else we have:
CC criu/mount.o
In file included from criu/include/cr_options.h:7,
from criu/mount.c:13:
In function '__list_add',
inlined from 'list_add' at include/common/list.h:41:2,
inlined from 'mnt_tree_for_each' at criu/mount.c:1977:2:
include/common/list.h:35:19: error: storing the address of local variable 'postpone' in
'((struct list_head *)((char *)start + 8))[24].prev' [-Werror=dangling-pointer=]
35 | new->prev = prev;
| ~~~~~~~~~~^~~~~~
criu/mount.c: In function 'mnt_tree_for_each':
criu/mount.c:1972:19: note: 'postpone' declared here
1972 | LIST_HEAD(postpone);
| ^~~~~~~~
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Setting all supported by CPU features in xstate_bv may bring it into
dirty-upper-state as documented in specs, resulting in lower
performance. Let's not do this and set only those have been used by
dumpee.
P.S.
Off course it has to be a one-liner!
Fixes: #1171
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
This patch documents how do we use `make lint` and `make indent` and
adds a note about their integration with CI.
Co-authored-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Nothing serious since OS will close it anyway but still to be precise.
Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
New message:
ERROR: Required file /usr/lib64/libcrypto.so.3.0.1 not found.
Exiting
Old message:
File "/home/criu/coredump/criu_coredump/coredump.py", line 693, in _gen_mem_chunk
f = open(fname, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/usr/lib64/libcrypto.so.3.0.1'
Signed-off-by: Adrian Reber <areber@redhat.com>
This fixes errors with long command-lines:
File "/home/criu/coredump/criu_coredump/coredump.py", line 320, in _gen_prpsinfo
prpsinfo.pr_psargs = self._gen_cmdline(pid)
^^^^^^^^^^^^^^^^^^
ValueError: bytes too long (88, maximum length 80)
Signed-off-by: Adrian Reber <areber@redhat.com>
Refactor lib/py/images/images.py to reduce code duplication
by extracting repetitive code into helper functions and
private methods. This improves code readability and maintainability,
as well as reducing the risk of bugs caused by duplicated code.
Additionally, in Makefile, lib/py/images/images.py is added to the
list of files to run by flake8 during CI.
Fixes: #340
Signed-off-by: Kouame Behouba Manasse <behouba@gmail.com>
In a previous commit, we set the default runtime to runc and
"manage-cgroups" to ignore. We remove the installation script
for crun as it is not used with this test.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch disables the checkpoint/restore of cgroups for
the tests using Podman as a temporary workaround for
https://github.com/checkpoint-restore/criu/issues/2091
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This addresses Andrei comments from
https://github.com/checkpoint-restore/criu/pull/2064
- Add comment about '\n' fixing
- Replace ret with more self explainting is_read
- Print warings if we failed to print comm for some reason
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
In Python 3 b'' == '' is False. This causes the info action to fail with
File "/usr/lib/python3.11/site-packages/crit-3.17-py3.11.egg/pycriu/images/images.py", line 178, in count
size, = struct.unpack('i', buf)
^^^^^^^^^^^^^^^^^^^^^^^
struct.error: unpack requires a buffer of 4 bytes
Reported-by: Sankalp Acharya (@sankalp-12)
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When error happens on file dumping stage the only information about the
task we dumping is its PID. For debug purpose show task's @comm early.
It proves useful when trying to understand which of dumped applications
is "guilty" in brokern dump when pid is not there anymore.
Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
If we build tags for our repo:
[criu]$ make tags
GEN tags
And then run codespell, we get an error:
[criu]$ codespell
./tags:3755: struc ==> struct
Let's exclude tags file from codespell search, this would add usability
to `make lint`.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The --ghost-fiemap option was introduced with #1963.
It enables an optimized algorithm based on fiemap ioctl that can reduce
the number of syscalls used to checkpoint highly sparse ghost files. This
option is enabled by default. It can be disabled with --no-ghost-fiemap
when using SEEK_HOLE/SEEK_DATA is preferred. In addition, an automatic
fallback to SEEK_HOLE/SEEK_DATA is used for filesystems that do not
supporting fiemap.
Co-authored-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Just creates ipv4/ipv6 raw/dgram sockets with IP_PKTINFO and IP_FREEBIND
socket options enabled/disabled and checks that these options persist.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We see systemd-resolved relying on these options, and after migration
the options are lost and systemd-resolved stops serving dns requests.
The socket options make kernel add cmsg with destination address to
packets, see more how systemd-resolved uses them:
00a60eaf5f/src/resolve/resolved-manager.c (L826)
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The IP_FREEBIND option is supported for RAW sockets, why not save it
while we do this for other ip sockets anyway?
One difference is that for SOCK_RAW there is no fallback between
IP_FREEBIND and IPV6_FREEBIND, see:
ef4d3ea405/net/ipv6/ipv6_sockglue.c (L1497)
So let's have explicit IPV6_FREEBIND for ipv6.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
If we can't access a map_files entry directly and instead have to follow
the link and access the file via a filesystem path we need to properly
deal with files on btrfs subvolumes.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
CAP_CHECKPOINT_RESTORE does not give access to /proc/$pid/map_files in
user namespaces. In order to test that CRIU in unprivileged mode can
dump and restore anonymous shared memory pages we will run the maps00
tests in a user namespace.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
If we don't have access to map_files and instead have to get the data
from /proc/$pid/mem we can close and reset the fd before passing it to
do_dump_one_shmem() which can then check it before trying to seek past
holes, eliminating the need for a separate seek_data_supported boolean.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
This is done to follow 'Linux kernel coding style', same change was
added to .clang-format in linux kernel source recently:
d7f6604341
We don't change it in current code base but let's follow it in all
future uses.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Simplify code a bit: make exit codes of those functions more
transparent, rename ret to exit_code.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Checking errno in outer function is really strange, also saving errno of
mount syscall after calling pr_perror is completely wrong. So let's try
to simplify things.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We see that when lint is called for push action git has only one last
commit which makes make indent with git-clang-format fail to operate.
Fix it by increasing fetch depth to one more commit.
Fixes: #2066
Fixes: d6db3333a ("clang-format: rework make indent to check specific commits")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
If trying to open /proc/$pid/map_files/x-x for a given VMA fails with
EPERM (can happen in unprivileged mode when running in a non-init user
ns), fall back to reading the content from /proc/$pid/mem.
Co-authored-by: Ivanq <imachug@yandex.ru>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
This patch sets VMA_AREA_REGULAR on hugetlb and anon shmem VMAs since
they can be handled the same way as other kinds of regular memory.
Co-authored-by: Ivanq <imachug@yandex.ru>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
We see that libbsd redefines __has_include to be always true, which
breaks such checks for rseq. The idea behind this patch is to put all
uses of libbsd functions to separate c files and only export wrapper
functions for them.
Using __setproctitle and __setproctitle_init everywhere in existing
code:
git grep --files-with-matches "setproctitle" | xargs sed -i 's/setproctitle/__setproctitle/g'
git grep --files-with-matches "setproctitle_init" | xargs sed -i 's/setproctitle_init/__setproctitle_init/g'
Fixes: #2036
Suggested-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We see that libbsd redefines __has_include to be always true, which
breaks such checks for rseq. The idea behind this patch is remove the
use of libbsd functions and always export our replacement functions.
Using __strlcat and __strlcpy everywhere in existing code:
git grep --files-with-matches "strlcat" | xargs sed -i 's/strlcat/__strlcat/g'
git grep --files-with-matches "strlcpy" | xargs sed -i 's/strlcpy/__strlcpy/g'
Fixes: #2036
Suggested-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
As our pr_* functions are complex and can call different system calls
inside before actual printing (e.g. gettimeofday for timestamps) actual
errno at the time of printing may be changed.
Let's just use %s + strerror(errno) instead of %m with pr_* functions to
be explicit that errno to string transformation happens before calling
anything else.
Note: tcp_repair_off is called from pie with no pr_perror defined due to
CR_NOGLIBC set and if I use errno variable there I get "Unexpected
undefined symbol: `__errno_location'. External symbol in PIE?", so it
seems there is no way to print errno there, so let's just skip it.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
As our pr_* functions are complex and can call different system calls
inside before actual printing (e.g. gettimeofday for timestamps) actual
errno at the time of printing may be changed.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This kernel feature contained some bugs initially. Those logs are useful in identifing what the
underlaying issue is and which kernel patch to backport.
Signed-off-by: Michal Clapinski <mclapinski@google.com>
This way we can check that mount tree topology (including sharing
groups) is the same before and after c/r.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Now we can compare mount tree and sharing group tree topology before and
after c/r with mntns_compare() helper.
Algorithm here is:
1) build mount tree based on mnt_id and parent_mnt_id from mountinfo
2) sort mount tree children based on path comparison
3) at the same time set topology_id for mounts by DFS order and order
mounts in list accordingly
4) build shared groups tree based on sharing_id and master_id
5) at the same time set topology_id for sharings as smallest topology_id
of its mounts, also sharings are put in their list in order of
their topology_id
6) walk sorted mounts lists for both namespaces simultaneously each
pair of moutns should have matching ids and parent ids
7) walk sorted sharings lists for both namespaces simultaneously each
pair of sharings should have matching ids and parent ids
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
For mount testing it is nice to be able to parse mountinfo from zdtm
test itself, for instance to be able to compare mountinfo topology
before and after c/r, or for anything else. So let's add a helper
mntns_parse_mountinfo() which parses current mount namespace mountinfo.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Need it to use linux lists in zdtm.
Also copy container_of from comiler.h to zdtmtst.h like we already do
for e.g. __stack_aligned__ macro.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Previousely "make indent" checked all files in criu source directory for
codding style flaws. We have several problems with it:
- clang-format default format sometimes changes in new versions of the
package and we need to reformat all our code base each time it happens
- on different systems we may have different versions of clang-format
and on latest criu-dev "make indent" may be still unhappy on your system
- when we want to update clang-format rules ourselves we need to update
all our code base each time
- sometimes clang-format rules are not fitting all our cases, (e.g.: an
option IndentGotoLabels works nice for simple C code, but is a no go for
assembler and C macros) and putting "clang-format off" everywhere is a
mess
- sometimes we intentionally want to break clang-format rules (e.g.:
we want to put function arguments on a new line separating them
"logically" not "mechanically" following 120-char rule like clang-format
does).
This adds a BASE option for "make indent" where all commits in range
BASE..HEAD would be checked with git-clang-format for codding style
flaws. For instance when developing on top of criu-dev, one can use
"make BASE=origin/criu-dev indent" to check all their commits for
compliance with the clang-format rules. Default base is HEAD~1 to make
last commit checked when "make indent" is called. The closest thing to
the old behaviour would then be "make indent BASE=init", note that only
commited files would be checked.
Extra options to git-clang-format may be passed through OPTS variable.
Also reuse "make indent" in github lint workflow.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The command ./zdtm.py list currently fails with
if opts['rootless']:
~~~~^^^^^^^^^^^^
KeyError: 'rootless'
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
memory.kmem.limit_in_bytes has been deprecated. Look at e7c4184164f7
("memcg, kmem: further deprecate kmem.limit_in_bytes") for more details.
Signed-off-by: Andrei Vagin <avagin@google.com>
Restoring SO_MARK requires root or CAP_NET_ADMIN. If the value
is 0 we will avoid dumping it so that we don't need to do a
privileged call on restore.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
SO_SNDBUFFORCE/SO_RCVBUFFORCE require root or CAP_NET_ADMIN.
We can use SO_SNDBUF/SO_RCVBUF in some cases and avoid
needing elevated privileges.
This patch renames sk_setbufs() to sk_setbufs_ns() and
makes sk_setbufs() a general helper that sets socket
send and receive buffer sizes. The helper tries to use
SO_SNDBUFFORCE/SO_RCVBUFFORCE first and falls back to
SO_SNDBUF/SO_RCVBUF if we're in unprivileged mode.
The existing sk_setbufs_ns() which takes a pid parameter
and is intended to be called via userns_call() is rewritten
to call sk_setbufs().
Existing code that sets buffer sizes via setsockopt() is
modified to call sk_setbufs() instead.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
ghost_multi_hole00 and ghost_multi_hole01 are tests which create a ghost file
with a lot of holes, there are 4K data and 4K hole inside every 8K length.
The only difference between them is ghost-fiemap option, 01 is a
test for the fiemap dumping algorithm, and we want to test the
behavior of EXTENT_MAX_COUNT part, so the file size should be 8M, thus there
will be 1024 chunks in the ghost file.
In some file system, such as xfs, we somehow can not easily create highly sparse
file as in ext4 or btrfs, therefore we need `fallocate` to forcibly create holes.
Signed-off-by: Liang-Chun Chen <featherclc@gmail.com>
In order to reduce the frequency of using system call, based on
https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/tree/misc/create_inode.c#n519,
I created a new algorithm of dumping chunk via fiemap.(copy_file_to_chunks_fiemap)
Also, I added another BOOL_OPT for users to determine which algorithm they
want to use. Moreover, for those filesystem not supporting fiemap, criu
will fall back to the original algorithm(SEEK_HOLE/SEEK_DATA).
v2: don't call copy_chunk_from_file on outstanding extent; rearange
headers to workaround "redeclaration of ‘enum fsconfig_command’" problem
Signed-off-by: Liang-Chun Chen <featherclc@gmail.com>
This patch fixes applies the changes required by clang-format v15.0.5
for `make indent`.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The python3 package in Alpine has recently been updated to install
symbolic link for /usr/bin/python.
https://git.alpinelinux.org/aports/commit/main/python3?id=d91da210b1614eb75517d59b7f348fee01699f35
This causes the following error in CI:
Step 10/11 : RUN ln -s /usr/bin/python3 /usr/bin/python
---> Running in a5a94be9dc93
ln: failed to create symbolic link '/usr/bin/python': File exists
The command '/bin/sh -c ln -s /usr/bin/python3 /usr/bin/python' returned a non-zero code: 1
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The way ShellCheck is installed was changed in commit c056f99
(ci/gha/lint: install a recent shellcheck) to use the latest version
v0.8.0 and remove some of the "shellcheck disable=..." annotations.
Since then, Fedora 37 has been released and the ShellCheck package
has been updated to v0.8.0.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
While building on a machine that has a HOL clang compiler,
I ran into warnings regarding the changed line. It appears
this warning is on by default because of anticipated changes
to the C standard.
Signed-off-by: Drew Wock <ajwock@gmail.com>
This patch adds a missing definition for `__nmk_dir` in the Makefile
for the amdgpu plugin. This definition is required, for example, when
building the `test_topology_remap` target:
make -C plugins/amdgpu/ test_topology_remap
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Zombie tasks are dumped in dump_zombies() so it is redundant to handle them
in dump_one_task().
Deprecate cg_set in task_core_entry as this field must be per thread now.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Some users on Raspberry Pi report that the kerndat checking for
memfd_create(MFD_HUGETLB) support returns ENOSYS even when memfd_create
syscall is available. We currently treat this error as unexpected and
return error. This commit marks the memfd_create(MFD_HUGETLB) as
unavailable when ENOSYS is returned.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
A previous commit added a cgroup cpuset unmounting to
scripts/ci/Makefile. We are sometimes running in a container without the
necessary privileges to unmount certain cgroups.
This commit moves the cgroup unmounting to a place in run-ci-tests.sh
which already requires privileged access and does not break unprivileged
build-only CI runs.
Signed-off-by: Adrian Reber <areber@redhat.com>
As cgroupv2_00, cgroupv2_01 need cpuset in cgroup-v2 hierarchy to check CRIU
handle cgroup-v2 properly, umount cpuset in cgroup-v1 to make it move to
cgroup-v2.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
This test creates a process with 2 threads in different threaded controllers and
check if CRIU restores these threads' cgroup controllers properly.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
As threads in a process may be in different threaded controllers, we need to
move thoses threads to the correct controllers.
Because the threads of a process are restored in later stage in restorer.c, we
need to create a cgroupd service to help to move those threads into correct
controllers when they are restored. We cannot use usernsd as the code in
restorer does not know the address of outside function to pass to userns_call.
However, this cgroupd service still reuses a lot of code from usernsd.
The main logic is that restored threads receive the cg_set number they belong to
before restorer stage in case their cg_set are different from main thread. When
these threads are restored, they send the cg_set number and their thread ids
through unix socket to cgroupd. cgroupd receives the cg_set number and thread
ids and moves those threads into correct controllers. Thread ids are sent
through SCM_CREDENTIALS of unix socket so they are translated into correct
thread ids in the receiving end.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Currently, we assume all threads in process are in the same cgroup controllers.
However, with threaded controllers, threads in a process may be in different
controllers. So we need to dump cgroup controllers of every threads in process
and fixup the procfs cgroup parsing to parse from self/task/<tid>/cgroup.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
This commit supports checkpoint/restore some new global properties in cgroup-v2
cgroup.subtree_control
cgroup.max.descendants
cgroup.max.depth
cgroup.freeze
cgroup.type
Only cgroup.subtree_control, cgroup.type need some more code to handle.
cgroup.subtree_control value needs to be set with "+", "-" prefix and
cgroup.type can only be written with value "threaded" if we want to make this
controller threaded. cgroup.type is a special property because this property
must be restored before any processes can move into this controller.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
It seems like drone.io no longer provides free aarch64/armhf CI runs.
This switches the aarch64 CI runs to Cirrus CI. armhf CI runs have been
dropped for now as they are not directly supported.
Signed-off-by: Adrian Reber <areber@redhat.com>
Since commit 5563cabdde, user with
enough capability can open IPC sysctl files and write to them. Therefore, we
don't need to use usernsd process in the outside user namespace to help with
that anymore. Furthermore, some later commits:
1f5c135ee5,
0889f44e28 bind the IPC namespace to
the opened file descriptor of IPC sysctl at the open() time, the changed value
does not depend on the IPC namespace of write() time anymore. This breaks the
current usernsd approach.
So, we prioritize opening/writing IPC sysctl files in the context of restored
process directly without usernsd help. This approach succeeds in the newer
kernel since the restored process has enough capabilities at this restore stage.
With older kernel, the open() fails and we fallback to the usernsd approach.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
In Virtuozzo we've faced out-of-bound access when calling this function
on short path string, which corrupted other memory and lead to
segmentation fault. So it may be useful to have this comment in code to
avoid such a missuse of this function in future.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
These are the minimal changes to make zdtm.py successfully run the
env00 and pthread test case as non-root using the '--rootless' zdtm option.
Co-authored-by: Younes Manton <ymanton@ca.ibm.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
This adds the non-root section and information about the parameter
--unprivileged to the man page.
Co-authored-by: Anna Singleton <annabeths111@gmail.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Anna Singleton <annabeths111@gmail.com>
This patch modifies how kerndat is handled in unprivileged mode.
Initialization and functionality that can only be done as root is
made separate from common code. The kerndat file's location is
defined as $XDG_RUNTIME_DIR/criu.kdat in unprivileged mode. Since
we expect that directory to be on tmpfs we maintain the same behavior
as the root-mode kerndat which lives in /run.
Co-authored-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
This commit enables checkpointing and restoring of applications as
non-root.
First goal was to enable checkpoint and restore of the env00 and
pthread00 test case.
This uses the information from opts.unprivileged and opts.cap_eff to
skip certain code paths which do not work as non-root.
Co-authored-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
This adds the function check_caps() which checks if CRIU is running
with at least CAP_CHECKPOINT_RESTORE. That is the minimum capability
CRIU needs to do a minimal checkpoint and restore from it.
In addition helper functions are added to easily query for other
capability for enhanced checkpoint/restore support.
Co-authored-by: Younes Manton <ymanton@ca.ibm.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
The idea behind the rootless CRIU code is, that CRIU reads out its
effective capabilities and stores that in the global opts structure.
Different parts of CRIU can then, based on the existing capabilities,
automatically enable or disable certain code paths.
Currently at least CAP_CHECKPOINT_RESTORE is required. CRIU will not
start without this capability.
Signed-off-by: Adrian Reber <areber@redhat.com>
python2-future, python2-junit_xml, python-flake8 and libbsd-devel are
now provided from EPEL.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The ppc64le ABI allows functions to store data in caller frames.
When initializing the stack pointer prior to executing parasite code
we need to pre-allocating the minimum sized stack frame before
jumping to the parasite code.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
Some ABIs allow functions to store data in caller frame, which
means that we have to allocate an initial stack frame before
executing code on the parasite stack.
This test saves the contents of writable memory that follows the stack
after the victim has been infected but before we start using the
parasite stack. It later checks that the saved data matches the
current contents of the two memory areas. This is done while the
victim is halted so we expect a match unless executing parasite code
caused memory corruption. The test doesn't detect cases where we
corrupted memory by writing the same value.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
return zero on chk success
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Co-authored-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Starting the daemon is the first time we run code in the victim
using the parasite stack.
It's useful for testing to be able to infect the victim without starting
the daemon so that we can inspect the victim's state, set up stack
guards, and so on before stack-related corruption can happen.
Add compel_infect_no_daemon() to infect the victim but not start the
daemon and compel_start_daemon() to start the daemon after the victim
is infected.
Add compel_get_stack() to get the victim's main and thread parasite
stacks.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
In fact an array (aptly named array) is already used in run_test2,
so let's just make it an array right from the start.
While at it, remove ls invocation.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This basically replaces
for x in $(sed ...); do
with
sed ... | while IFS= read -r x; do
The only caveat is, sed program was amended to remove empty lines
(there was one right above the PB_AUTOGEN_STOP).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is a preferred way of fixing SC2086 shellcheck warning.
Note that since ZDTM_OPTS is passed as a string (via make or docker),
we are converting it to an array using read -a.
Remove all "shellcheck disable=SC2086" annotations.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Instead of using shellcheck v0.7.2 from fedora repo,
let's install the latest version (v0.8.0).
This allows to remove some "shellcheck disable=..." annotations,
and (I hope) better checking quality overall.
While at it, remove findutils from dnf install as this package is
already installed.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When we restore a shell-job we would inherit tty-s, so even if we don't
have a right mount for it in container on dump, on restore it should
just be right.
Else when dumping second time via criu-ns we get:
(00.005678) Error (criu/files-reg.c:1710): Can't lookup mount=29 for fd=0 path=/dev/pts/20
Fixes: #1893
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
When we are restoring in new pidns we specifically do setsid() from
criu-ns init so that sids of restored tasks are non-zero in this pidns
and on next dump CRIU would not have problems with zero sids, see [1].
But after this CRIU tries to inherit and setup a tty for the restored
process, and it fails to set it's process group via TIOCSPGRP to be a
foreground group for it's tty, because tty already is a controlling tty
for other session (which we had before setsid).
So to make it restore we need to reset tty to be a controlling tty of
criu-ns init via TIOCSCTTY before calling criu.
Else when restoring first time via criu-ns (from criu-ns dump) we get:
Error (criu/tty.c:689): tty: Failed to set group 40816 on 0: Inappropriate ioctl for device
https://github.com/checkpoint-restore/criu/issues/232 [1]
v2: add why and what comment in code, set controlling tty only for
--shell-job and fail if stdin is not a tty.
Fixes: #1893
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
A recent change in glibc introduced `enum fsconfig_command` [1] and as a
result the compilation of criu fails with the following errors
In file included from criu/pie/util.c:3:
/usr/include/sys/mount.h:240:6: error: redeclaration of 'enum fsconfig_command'
240 | enum fsconfig_command
| ^~~~~~~~~~~~~~~~
In file included from /usr/include/sys/mount.h:32:
criu/include/linux/mount.h:11:6: note: originally defined here
11 | enum fsconfig_command {
| ^~~~~~~~~~~~~~~~
/usr/include/sys/mount.h:242:3: error: redeclaration of enumerator 'FSCONFIG_SET_FLAG'
242 | FSCONFIG_SET_FLAG = 0, /* Set parameter, supplying no value */
| ^~~~~~~~~~~~~~~~~
criu/include/linux/mount.h:12:9: note: previous definition of 'FSCONFIG_SET_FLAG' with type 'enum fsconfig_command'
12 | FSCONFIG_SET_FLAG = 0, /* Set parameter, supplying no value */
| ^~~~~~~~~~~~~~~~~
/usr/include/sys/mount.h:244:3: error: redeclaration of enumerator 'FSCONFIG_SET_STRING'
244 | FSCONFIG_SET_STRING = 1, /* Set parameter, supplying a string value */
| ^~~~~~~~~~~~~~~~~~~
criu/include/linux/mount.h:14:9: note: previous definition of 'FSCONFIG_SET_STRING' with type 'enum fsconfig_command'
14 | FSCONFIG_SET_STRING = 1, /* Set parameter, supplying a string value */
| ^~~~~~~~~~~~~~~~~~~
/usr/include/sys/mount.h:246:3: error: redeclaration of enumerator 'FSCONFIG_SET_BINARY'
246 | FSCONFIG_SET_BINARY = 2, /* Set parameter, supplying a binary blob value */
| ^~~~~~~~~~~~~~~~~~~
criu/include/linux/mount.h:16:9: note: previous definition of 'FSCONFIG_SET_BINARY' with type 'enum fsconfig_command'
16 | FSCONFIG_SET_BINARY = 2, /* Set parameter, supplying a binary blob value */
| ^~~~~~~~~~~~~~~~~~~
/usr/include/sys/mount.h:248:3: error: redeclaration of enumerator 'FSCONFIG_SET_PATH'
248 | FSCONFIG_SET_PATH = 3, /* Set parameter, supplying an object by path */
| ^~~~~~~~~~~~~~~~~
criu/include/linux/mount.h:18:9: note: previous definition of 'FSCONFIG_SET_PATH' with type 'enum fsconfig_command'
18 | FSCONFIG_SET_PATH = 3, /* Set parameter, supplying an object by path */
| ^~~~~~~~~~~~~~~~~
/usr/include/sys/mount.h:250:3: error: redeclaration of enumerator 'FSCONFIG_SET_PATH_EMPTY'
250 | FSCONFIG_SET_PATH_EMPTY = 4, /* Set parameter, supplying an object by (empty) path */
| ^~~~~~~~~~~~~~~~~~~~~~~
criu/include/linux/mount.h:20:9: note: previous definition of 'FSCONFIG_SET_PATH_EMPTY' with type 'enum fsconfig_command'
20 | FSCONFIG_SET_PATH_EMPTY = 4, /* Set parameter, supplying an object by (empty) path */
| ^~~~~~~~~~~~~~~~~~~~~~~
/usr/include/sys/mount.h:252:3: error: redeclaration of enumerator 'FSCONFIG_SET_FD'
252 | FSCONFIG_SET_FD = 5, /* Set parameter, supplying an object by fd */
| ^~~~~~~~~~~~~~~
criu/include/linux/mount.h:22:9: note: previous definition of 'FSCONFIG_SET_FD' with type 'enum fsconfig_command'
22 | FSCONFIG_SET_FD = 5, /* Set parameter, supplying an object by fd */
| ^~~~~~~~~~~~~~~
/usr/include/sys/mount.h:254:3: error: redeclaration of enumerator 'FSCONFIG_CMD_CREATE'
254 | FSCONFIG_CMD_CREATE = 6, /* Invoke superblock creation */
| ^~~~~~~~~~~~~~~~~~~
criu/include/linux/mount.h:24:9: note: previous definition of 'FSCONFIG_CMD_CREATE' with type 'enum fsconfig_command'
24 | FSCONFIG_CMD_CREATE = 6, /* Invoke superblock creation */
| ^~~~~~~~~~~~~~~~~~~
/usr/include/sys/mount.h:256:3: error: redeclaration of enumerator 'FSCONFIG_CMD_RECONFIGURE'
256 | FSCONFIG_CMD_RECONFIGURE = 7, /* Invoke superblock reconfiguration */
| ^~~~~~~~~~~~~~~~~~~~~~~~
criu/include/linux/mount.h:26:9: note: previous definition of 'FSCONFIG_CMD_RECONFIGURE' with type 'enum fsconfig_command'
26 | FSCONFIG_CMD_RECONFIGURE = 7, /* Invoke superblock reconfiguration */
This patch adds definition for FSOPEN_CLOEXEC to solve this problem. In particular,
sys/mount.h includes ifndef check for FSOPEN_CLOEXEC surrounding `enum fsconfig_command`.
[1] https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=7eae6a91e9b1670330c9f15730082c91c0b1d570
Reported-by: Younes Manton (@ymanton)
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch changes top-level OpenJ9 filename and data references to Java
to make them generic and launches tests against both HotSpot and OpenJ9
JVMs.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
Semeru builds (which use OpenJ9 instead of HotSpot) are the successors
of AdoptOpenJDK's OpenJ9 builds.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
We used to pull AdoptOpenJDK's OpenJ9 builds but switched to
Eclipse Temurin, which uses the HotSpot VM instead of OpenJ9.
Rename the corresponding Dockerfiles to hotspot.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
The entry "build/" will ignore any directory named "build" at any level
of the source tree, including our scripts/build directory. We only want
to ignore the top-level build directory created by `make install`.
As the git manpage suggests, entries with slashes at the start or in the
middle will only match at the same level as the .gitignore, hence use
build/** instead.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
This allows to make test code more compact:
if (ret == -1) {
pr_perror("XXX");
return 1;
}
vs
if (ret == -1)
return pr_perror("XXX");
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Before this change, CRIU would just lose that data upon migration. So
it's better to fail migration in this case.
To reproduce the bug one can:
1. Create an AF_UNIX socket and call listen on it.
2. Create a second AF_UNIX socket and call connect to the first one.
3. Send the data to the second socket.
4. Migrate.
5. Call accept on the first socket and then read. There would be no data
available.
It should be even possible to close the second socket before migration.
This would cause accept to hang because CRIU totally misses a closed
in-flight socket.
Signed-off-by: Michal Clapinski <mclapinski@google.com>
The x86 implement hardware breakpoint to accelerate the tracing syscall
procedure instead of `ptrace(PTRACE_SYSCALL)`. The arm64 has the same
capability according to <<Learn the architecture: Armv8-A self-hosted
debug>>[[1]].
<<Arm Architecture Reference Manual for A-profile architecture>[[2]]
illustrates the usage detailly:
- D2.8 Breakpoint Instruction exceptions
- D2.9 Breakpoint exceptions
- D13.3.2 DBGBCR<n>_EL1, Debug Breakpoint Control Registers, n
Note:
[1]: https://developer.arm.com/documentation/102120/0100
[2]: https://developer.arm.com/documentation/ddi0487/latest
Signed-off-by: fu.lin <fulin10@huawei.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Breakpoints are used to stop as close as possible to a target system call.
First, we don't need it after this point.
Second, PTRACE_CONT can't pass through a breakpoint on arm64.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
When delivering system call traps, set bit 7 in the signal number (i.e.,
deliver SIGTRAP|0x80). This makes it easy for the tracer to distinguish
normal traps from those caused by a system call.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
1. Rename CentOS 8 to CentOS Stream 8 (which it is).
2. Install junit_xml from the repo rather than via pip.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Mostly a copy-paste from the CentOS 8 task, with a few differences:
- Use dnf instead of yum
- Enable crb instead of powertools
- Different way of installing EPEL
- No need to switch to python3 as this is the default
- junit_xml is now available as an rpm
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There is a race condition in docker/containerd that causes docker to
occasionally fail when starting a container from a checkpoint immediately
after the checkpoint has been created.
This problem is unrelated to criu and has been reported in
https://github.com/moby/moby/issues/42900
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Let's use dynamic approach to detect built-in *libc rseq in all cases,
and "old" static approach as a fallback path if the user kernel
lacks support of ptrace_get_rseq_conf feature.
Suggested-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Before this patch we assumed that CRIU is compiled against
the same GLibc as it runs with. But as we see from real
world examples like #1935 it's not always true.
The idea of this patch is to detect rseq configuration
for the main CRIU process and use it to unregister
rseq for all further child processes. It's correct,
because we restore pstree using clone*() syscalls,
don't use exec*() (!) syscalls, so rseq gets inherited
in the kernel and rseq configuration remains the same
for all children processes.
This will prevent issues like this:
https://github.com/checkpoint-restore/criu/issues/1935
Suggested-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
The result of check_aa_ns_dumping() is stored in kdat. Instead of doing
the same check twice - once on kerndat_init(), and again in
check_apparmor_stacking(), we can check the stored value.
Suggested-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The feature check for AppArmor stacking was introduced in
commit:
8723e3f998
check: add a feature test for apparmor_stacking
However, on systems that don't support AppArmour, this check always
fails. As a result, `criu check --all` shows the following message:
Looks good but some kernel features are missing
which, depending on your process tree, may cause
dump or restore failure.
Reported-by: André Rösti (@andrej)
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
In commits [1, 2] the version of containerd installed by default in the
GitHub CI virtual environment was replaced with the latest release from
GitHub as a workaround to a bug in containerd. This bug has been fixed
sometime ago and the current default version of containerd (1.6.6) does
not require this workaround. However, with the latest release, the
containerd binaries uploaded on GitHub have been built for Ubuntu 22.04
[3]. Our tests are still running on Ubuntu 20.04 and this results in the
following error:
/usr/bin/containerd: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /usr/bin/containerd)
/usr/bin/containerd: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /usr/bin/containerd)
[1] 046cad8
[2] 81a68ad
[3] 6b2dc9a37
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
There are several changes in glibc 2.36 that make sys/mount.h header
incompatible with kernel headers:
https://sourceware.org/glibc/wiki/Release/2.36#Usage_of_.3Clinux.2Fmount.h.3E_and_.3Csys.2Fmount.h.3E
This patch removes conflicting includes for `<linux/mount.h>` and
updates the content of `criu/include/linux/mount.h` to match
`/usr/include/sys/mount.h`. In addition, inline definitions sys_*()
functions have been moved from "linux/mount.h" to "syscall.h" to
avoid conflicts with `uapi/compel/plugins/std/syscall.h` and
`<unistd.h>`. The include for `<linux/aio_abi.h>` has been replaced
with local include to avoid conflicts with `<sys/mount.h>`.
Fixes: #1949
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
We need to pass environment variables from the CI environment to
distinguish between CI environments. However, when `sudo -E` is
used to run Podman it results in the XDG_RUNTIME_DIR environment
variable being set incorrectly that prevents Podman from running.
This patch fixes the following error in the GitHub Action virtual
environment:
error running container: error from /usr/bin/crun creating
container for [/bin/sh -c /bin/prepare-for-fedora-rawhide.sh]:
sd-bus call: Connection reset by peer
Fixes: #1942
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
I've been contributing to CRIU for sometime and I'm hoping that my
familiarity with the project would be sufficient to self-nominate as a
maintainer. I would like to help with code reviews, submitting patches,
implementing new features, and maintaining the project in general.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
ghost_holes_large00 is a test which creates a large ghost sparse file with 1GiB
hole(pwrite can only handle 2GiB maximum on 32-bit system) and 8KiB data, criu
should be able to handle this kind of situation.
ghost_holes_large01 is a test which creates a large ghost sparse file with 1GiB
hole and 2MiB data, since 2MiB is larger than the default ghost_limit(1MiB),
criu should fail on this test.
v2: fix overflow on 32-bit arch.
Signed-off-by: Liang-Chun Chen <featherclc@gmail.com>
unlink_largefile test
In the past, the unlink_largefile test should be fail on large ghost file.
However, it used sparse file, it will pass in current criu, since the large
ghost sparse file issue was fixed.
So the crfail flag of this test should be removed.
Signed-off-by: Liang-Chun Chen <featherclc@gmail.com>
files-reg.c checks whether the file size is larger than ghost_limit with st_size
(in dump_ghost_remap), which can not deal with large ghost sparse file, since
its actual file size is not the same as what st_size shows.
Therefore, in this commit, I replace st_size with st_blocks, which shows the
actual file size. (1 block = 512B), thus criu can deal with large ghost sparse
file.
Signed-off-by: Liang-Chun Chen <featherclc@gmail.com>
This test specifically wants to create external bind-mount of "/" from
criu mntns to test mntns, and it wants "/" in criu mntns to be a shared
mount so that "external" mount in the test mntns is it's slave. This is
to triger specific dirname() resolution which happens only when sharing
restore is involved for external mounts, and only if rootfs is involved.
But initially I missed that when we create external mount in test's
temporary mntns it creates a propagation in criu mntns on top of root
mount. This mount may influence other tests restore as child mount in
root mount converts to locked child mount in criu service mntns (for uns
flavour) and when criu would restore root container mount it would fail
with EINVAL on non recursive bind with locked children.
To fix this mess we just need to prohibit propagating from tests
temporary mntns to criu mntns by making mounts slave.
Fixes: #1941
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
If root mount in criu mntns is slave, it would be slave of host mount
where criu is stored, so if someone mounts something in subdir of
{criu-dir}/test/ on host while tests are running this mount can
influence the test as it appears on top of root mount in criu mntns.
1) With mount-compat this mount can get into restored test mntns, which
means wrong restore, as this mount was not there on dump.
2) With mount-v2 this mount would just fail container restore, as root
container mount is mounted non-recursively to protect from unexpected
mounts appear after restore.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
On Arch Linux with 5.18.3-zen1-1-zen kernel, the vdso's size is 3 pages which
exceeds the current 2-page reserved buffer. This commit simply increases the
reserved buffer size to 4 pages.
Fixes: https://github.com/checkpoint-restore/criu/issues/1916
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Normally, vsyscall vma has VM_READ, VM_EXEC permission. However, when
CONFIG_LEGACY_VSYSCALL_XONLY=y, that vma only has VM_EXEC. This commit removes
the permission part when checking to skip vsyscall vma in x32 tests.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Error from:
./test/zdtm.py run -t zdtm/static/fpu00 --fault 134 -f h --norst
(00.003111) Dumping GP/FPU registers for 56
(00.003121) Error (compel/arch/x86/src/lib/infect.c:310): Corrupting fpuregs for 56, seed 1651766595
(00.003125) Error (compel/arch/x86/src/lib/infect.c:314): Can't set FPU registers for 56: Invalid argument
(00.003129) Error (compel/src/lib/infect.c:688): Can't obtain regs for thread 56
(00.003174) Error (criu/cr-dump.c:1564): Can't infect (pid: 56) with parasite
See also:
145e9e0d8c6 ("x86/fpu: Fail ptrace() requests that try to set invalid MXCSR values")
145e9e0d8c
We decided to move from mxcsr cleaning up scheme and use mxcsr mask
(0x0000ffbf) as kernel does. Thanks to Dmitry Safonov for pointing out.
Tested-on: Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz
Reported-by: Mr. Jenkins
Suggested-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
1. For some reason, Marier distribution headers
not correctly define __GLIBC_HAVE_KERNEL_RSEQ
compile-time constant. It remains undefined,
but in fact header files provides corresponding
rseq types declaration which leads to conflict.
2. Another issue, is that they use uint*_t types
instead of __u* types as in original rseq.h.
This leads to compile time issues like this:
format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type 'uint64_t' {aka 'long unsigned int'}
and we can't even replace %llx to %PRIx64 because it will break
compilation on other distros (like Fedora) with analogical error:
error: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 6 has type ‘__u64’ {aka ‘long long unsigned int’}
Let's use our-own struct rseq copy fully equal to the kernel one,
it's safe because this structure is a part of Linux Kernel ABI.
Fixes#1934
Reported-by: Nikola Bojanic
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Add a simple test using tail to check that processes can't be restored
by default when the r/w/x mode of an open file changes, unless
--skip-file-rwx-check is used.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
A file's r/w/x changing between checkpoint and restore does
not necessarily imply that something is wrong. For example,
if a process opens a file having perms rw- for reading and
we change the perms to r--, the process can be restored and
will function as expected.
Therefore, this patch adds an option
--skip-file-rwx-check
to disable this check on restore. File validation is unaffected
and should still function as expected with respect to the content
of files.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
stopped03 check that stopped by SIGTSTP tasks are restored correctly.
stopped04 check that stopped by SIGSTOP tasks which have blocked SIGTSTP and
have SIGTSTP pending are restored correctly.
Signed-off-by: Yuriy Vasiliev <yuriy.vasiliev@openvz.org>
Add SIGTSTP signal dump and restore. Add a corresponding field
in the image, save it only if a task is in the stopped state.
Restore task state by sending desired stop signal if it is present
in the image. Fallback to SIGSTOP if it's absent.
Signed-off-by: Yuriy Vasiliev <yuriy.vasiliev@openvz.org>
Else we trigger BUG in task_reset_dirty_track():
Error (criu/mem.c:45): BUG at criu/mem.c:45
The check in kerndat_get_dirty_track() does not work right.
https://github.com/checkpoint-restore/criu/issues/1917
Reported-by: @mrc1119
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Currently, the content of anonymous private hugetlb mapping is dumped in 2
different images: memfd approach and normal private mapping dumping. In memfd
approach, we dump the content of the backing pseudo file (/anon_hugepage). This
is incorrect and redundant since the mapping is private, the content of backing
file may differ from the content of the mapping. With this commit, we remove the
redundant memfd approach dump and only do the normal private mapping dump on
anonymous hugetlb mapping.
Run zdtm.py run -f h --keep-img always -t zdtm/static/maps09, du -h in the
dumped image directory
Before this commit
13M test/dump/zdtm/static/maps09/55/1
After this commit
8.5M test/dump/zdtm/static/maps09/55/1
The reduction in size is approximately 4MB which is the size of anonymous
private hugetlb mapping in the test.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Before this patch, if we had a unixsk with incomming scm packets (with
fds) and with the sender side fd closed, we got an error:
Error (criu/sk-unix.c:1125): unix: Can't find sender for 0x1e
First part of the problem is that unix_note_scm_rights() expects to see
a "queuer" which would send scm packets to the unixsk, and there is no
as the sender side is closed.
Second part of the problem is that we already have "fake" queuers
feature so that it already creates a unix socket pair and leaves other
end open for later queuing packets. But function add_fake_unix_queuers()
is called after unix_note_scm_rights() thus there is no chance to find
queuer at the point of failure.
Third part is that when we look for a queuer in find_queuer_for() we
actually look for a socket for which we are a queuer and not for the
socket which is a queuer for us, which is opposite to the name. For
cases where both ends are alive both are queuers for each other so this
was not important, but for our closed sender case it breaks.
So let's reorder add_fake_unix_queuers() before unix_note_scm_rights()
and make find_queuer_for() actually do what it's name implies.
This situation is started to reproduce on Virtuozzo start/stop tests
with the unixsk belonging to systemd, we suppose that this state where
the sender fd side is closed happens rarely only on systemd start/stop,
so we don't see it in regular suspend resume of long-living containers.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
criu-ns script incorrectly compares the pidns fd with mntns fd.
Also reversed the condition in is_my_namespace function to align it
with the function name.
Signed-off-by: Ashutosh Mehra <asmehra@redhat.com>
As private hugetlb mappings are not pre-mapped, the content of them is restored
in the the restorer which cannot use page_read->read_pages. As a result, we
cannot recursively read the content of pre-dumped image in the parent directory
and use preadv to read the content from the last dumped image only. Therefore,
it may freeze while restoring when the content of mapping is in pre-dumped image
in parent directory.
We need to skip pre-dumping on hugetlb mappings to resolve the issue.
Suggested-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
It can be confusing to see error from post-dump action script and non
zero return from criu though at the same time see "Dumping finished
successfully" in log. I believe it is logical to consider post-dump
action script as a part of "dump" process so fail in it means that the
whole dump failed.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
* Fixes for pre-dump read mode
* Fixes for mount-v2
* amdgpu plugin build and installation fixes
* Some minor CI related fixes
Signed-off-by: Adrian Reber <areber@redhat.com>
This test has one external mount [criumntns] /zdtm_root_ext.tmp ->
[testmntns] /mnt_root_ext.test, and it specifically gives '--external
mnt[MNT]:.zdtm_root_ext.tmp' option on restore without '/' to make
dirname on it return static '.' path (see glibc dirname() code) and
reproduce a segfault in resolve_mountpoint().
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Else we have a Segmentation fault in __move_mount_set_group() on
xfree(source_mp) if resolve_mountpoint() returned statically allocated
path.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
It's a problem when while restoring sharing group we need to copy
sharing between two mounts with non-intersecting roots, because kernel
does not allow it.
We have a case https://github.com/opencontainers/runc/pull/3442, where
runc adds different devtmpfs file-bindmounts to container and there is
no fsroot mount in container for this devtmpfs, thus mount-v2 faces the
above problem.
Luckily for the case of external mounts which are in one sharing group
and which have non-intersecting roots, these mounts likely only have
external master with no sharing, so we can just copy sharing from
external source and make it slave as a workaround.
https://github.com/checkpoint-restore/criu/issues/1886
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This helper restores master_id and shared_id of first mount in the
sharing group. It first copies sharing from either external source or
internal parent sharing group and makes master_id from shared_id. Next
it creates new shared_id when needed.
All other mounts except first are just copied from the first one.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Building the criu packages for Ubuntu/Debian fails with:
mkdir: cannot create directory '/var/lib/criu': Permission denied
This patch updates PLUGINDIR with the value /usr/lib/criu
Fixes: #1877
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When building packages for CRIU the source directory might have a
name different than 'criu'.
Fixes: #1877
Reported-by: @siris
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
* handle unexpected errors of process_vm_readv
* adjust riovs in analyze_iov
* call handle_faulty_iov only if process_vm_readv returns EFAULT.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
But actually, 5a92f100b8 probably has to be reverted as a whole.
PIPE_MAX_SIZE is the hard limit to avoid PAGE_ALLOC_COSTLY_ORDER
allocations in the kernel. But F_SETPIPE_SZ rounds up a requested pipe
size to a power-of-2 pages. It means that when we request PIPE_MAX_SIZE
that isn't a power-of-2 number, we actually request a pipe size greater
than PIPE_MAX_SIZE.
Fixes: 5a92f100b8 ("page-pipe: Resize up to PIPE_MAX_SIZE")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Name collision with an abandoned project named 'crit' in pypi causes pip
to show crit (CRiu Image Tool) as outdated. This patch updates crit to
use the same version and license as criu.
Fixes#1878
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Amongst a huge number of fixes all over the place this release introduces:
* mount-v2 engine
* support for MAP_HUGETLB mappings
* support for Linux Restartable Sequences
* support for SOCK_SEQPACKET unix sockets
* CRIU AMD GPU plugin
* setsockopt(SO_BUF_LOCK) support for tcp sockets
Signed-off-by: Adrian Reber <areber@redhat.com>
Currently we check memfd_hugetlb by doing memfd_create("", MFD_HUGETLB).
If we see EINVAL we report that it's not supported, but we can also
get ENOENT error in such case in hugetlb_file_setup() while trying
to find proper hugetlbfs mount.
Reference:
06fb4ecfea/fs/hugetlbfs/inode.c (L1465)
Fixes: 4245e6b02f ("check: Add a check for using memfd with hugetlb")
Reported-by: Mr. Jenkins (ppc64le)
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
GitHub Actions comes with pre-installed criu in /usr. configure scripts
looking for CRIU will pickup the pre-installed version in /usr if we do
not install CI criu also in /usr.
Signed-off-by: Adrian Reber <areber@redhat.com>
bind_on_delete() return code is only used for setting errno for pr_perror()
This is mostly useless since a lot of syscalls already set it. All of
non-syscall errors already have prints in case of failure.
Fix bind_on_deleted() always returning 0 and simplify error juggling to
returning -1 in case of errors.
Fixes: #1771
Fixes: d0308e5ecc ("sk-unix: make criu respect existing files while restoring ghost unix socket fd")
Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
The map_extra field has been introduced in Linux Kernel release 5.16
and does not exist in older kernel versions. The current parsing
implementation fails when map_extra is missing.
In particular, it tries to parse the `memlock` field as `map_extra` and
fails but it does not exit with an error because map_extra is marked as
"optional". It then tries to parse the `map_id` field as `memlock` and
fails with an error because map_id is not optional:
Error (criu/proc_parse.c:2161): parse_fdinfo_pid_s: error parsing [map_type:\t2] for 19: Success'
To correctly handle this, we should try to parse again the next field
when parsing of `map_extra` fails, without reading the next line from
the bpfmap.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
bpf_create_map_xattr() has been replaced with bpf_map_create()
6cfb97c
DECLARE_LIBBPF_OPTS has been renamed to LIBBPF_OPTS
ea6c242
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
It looks like we've got broken fhandles from fdinfo
for inotifies/fanotifies for btrfs. I will look into that.
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
We have a separate target for alpine in script/ci/Makefile
which defines some extra opts for zdtm using ZDTM_OPTIONS
variable. But really it doesn't work. First of all, variable
should be named as ZDTM_OPTS and also we have to specify
it directly in the CONTAINER_RUNTIME cmdline to make it work.
I've also changed variable value just to make it consistent
with docker.env value which was really used.
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
That's strange but rseq02 test fails with:
09:06:16.222: 51: exit 555f52082120 555f52082120
09:06:16.282: 51: exit 555f52082120 555f52082120
09:06:16.340: 51: exit 555f52082120 555f52082120
09:06:16.397: 51: exit 555f52082120 555f52082120
09:06:16.503: 51: exit 0 555f52082120
09:06:16.503: 51: FAIL: rseq02.c:235: Failed to increment per-cpu counter (errno = 2 (No such file or directory))
09:06:16.503: 51: FAIL: rseq02.c:246: (errno = 16 (Device or resource busy))
It means that rseq_cs pointer was cleaned up by the kernel despite of
NO_RESTART* flags. That's a hardly reproducible and I will investigate that.
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Userspace may configure rseq cs abort policy by
setting RSEQ_CS_FLAG_NO_RESTART_ON_* flags.
In ("cr-dump: fixup thread IP when inside rseq cs") we have supported
the case when process was caught by CRIU during rseq cs execution by
fixing up IP to abort_ip. Thats a common case, but there is special flag
called RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL, in this case we have to leave
process IP as it was before CRIU seized it. Unfortunately, that's not
all that we need here. We also must preserve (struct rseq)->rseq_cs field.
You may ask like "why we need to preserve it by hands? CRIU is dumping
all process memory and restores it". That's true. But not so easy. The problem
here is that the kernel performs this field cleanup when it realized that
the process gets out of rseq cs. But during dump/restore procedures we are
executing parasite/restorer from the process context. It means that process
will get out of rseq cs in any case and (struct rseq)->rseq_cs will be cleared
by the kernel. So we need to restore this field by hands at the *last* stage
of restore just before releasing processes.
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
If we caught the process when it's inside rseq
critical section we have to handle it properly.
From the kernel side of view, if the process
is executing inside the rseq cs and gets a signal,
rseq critical section execution will be interrupted
and after signal handler execution, we will proceed
to rseq cs abort handler instead of continuing normal
rseq cs execution (if RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
isn't set).
When CRIU seizes processes that's the same thing as
getting signal from the rseq point of view. So we need
to fixup instruction pointer to rseq cs abort handler
address.
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Fresh Glibc does rseq() register by default. We need to unregister
rseq before registering our own.
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Let's take thread_pointer() implementation from Glibc.
It will be useful in the further because Glibc stores
struct rseq on the TLS. Absolute address can be calculated
as __criu_thread_pointer() + __rseq_offset.
__rseq_offset is an exported symbol from Glibc itself.
We need to have an ability to determine where struct
rseq is stored to unregister it in CRIU during the restore
stage.
For different libc like musl-libc we will have to handle
rseq separately depends on how struct rseq is stored.
Right now that's not a problem because musl-libc has no
rseq support, so we don't need to unregister it.
https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=8dbeb0561eeb876f557ac9eef5721912ec074ea5https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=cb976fba4c51ede7bf8cee5035888527c308dfbc
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
We have ability to use nested virtualization on
Cirrus, and already have "Vagrant Fedora based test (no VDSO)"
test, let's do analogical for Fedora Rawhide to get fresh kernel.
Suggested-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Here we just want to check that if rseq was registered before C/R
it remains registered after it.
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
A lot of kernel versions lacks support for ptrace(PTRACE_GET_RSEQ_CONFIGURATION).
But the userspace may be fresh (for instance containers with fresh Fedora runs
on CentOS 7 host). Consider two scenarious:
- kernel has no ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support
1. there is a process which use rseq => fail dump
2. there is no process which use rseq => we can dump without any problems
But how to determine if process use rseq or not without get_rseq_conf feature?
Let's just try to do rseq registration from the parasite. If rseq is already
registered then we'll got EBUSY error. If not we'll success in registration.
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Support basic rseq C/R scenario. Assume that:
- there are no processes with IP inside the rseq critical section (CS)
- kernel has ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support
On dump:
1. use ptrace(PTRACE_GET_RSEQ_CONFIGURATION) to get
struct rseq pointer, rseq size and signature from the kernel.
2. save to the image
On restore:
1. get rseq ptr, size, signature from the image
2. register it back using rseq() from the restorer parasite
Fixes: #1696
Reported-by: Radostin Stoyanov <radostin@redhat.com>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Add "get_rseq_conf" feature corresponding to the
ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support.
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
The code expected that the cgroup directory ends with a ',' and
unconditionally removes the last character. For the "unified" case this
resulted in the last 'd' being remove instead of the non existing comma.
This just adds a comma after "unified" so that the last removed
character is not the 'd'.
Suggested-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Those that codespell have a few variants for:
./soccr/soccr.c:219: thise ==> these, this
./soccr/soccr.c:444: sence ==> sense, since
./criu/net.c:665: ot ==> to, of, or
./criu/net.c:775: ot ==> to, of, or
./criu/files.c:1244: wan't ==> want, wasn't
./criu/kerndat.c:1141: happend ==> happened, happens, happen
./criu/mount-v2.c:781: carefull ==> careful, carefully
./test/zdtm/static/socket_aio.c:54: Chiled ==> Child, chilled
./test/zdtm/static/socket_listen6.c:73: Chiled ==> Child, chilled
./test/zdtm/static/socket_listen.c:73: Chiled ==> Child, chilled
./test/zdtm/static/socket_listen4v6.c:73: Chiled ==> Child, chilled
./test/zdtm/static/sk-unix-dgram-ghost.c:201: childs ==> children, child's
./test/zdtm/static/sk-unix-dgram-ghost.c:205: childs ==> children, child's
./compel/arch/x86/src/lib/infect.c:297: automatical ==> automatically, automatic, automated
While at it, do some other minor fixes in the same lines.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
I am not sure if this is going to bring any compatibility issues.
If yes, we need to remove this patch and add "useable" to the list of
ignored words instead.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Codespell thinks that tThe is a typo. Fix it by separating "\t"
which also includes readability (a bit).
[v2: run via make indent]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It is mapped, not maped. Same applies for mmap I guess.
Found by codespell, except it wants to change it to mapped,
which will make it less specific.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Codespell thinks that NODEL is a misspelled MODEL. Indeed it looks that
way. Add an underscore.
Do the same for the file names.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Codespell thinks that "inot" is a misspelled "into".
Rename to infd ("inotify fd") to make it happy.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
CRIU has a few places where it creates unix sockets and their names have to be
unique for each criu run.
Fixes: #1798
Signed-off-by: Andrei Vagin <avagin@google.com>
Since https://reviews.llvm.org/D122271, Clang -Wset-but-unused-variable
gets smarter to warn about unused post-increments.
Signed-off-by: Fangrui Song <maskray@google.com>
```
criu/apparmor.c:679:26: error: 'fscanf' may overflow; destination buffer in argument 3 has size 48, but the corresponding specifier may require size 49 [-Werror,-Wfortify-source]
ret = fscanf(f, "%48s", contents);
```
The buffer size should be at least one larger than the fscanf maximum
field width.
Fixes: 8d992a680e ("lsm: support checkpoint/restore of stacked apparmor profiles")
Signed-off-by: Fangrui Song <maskray@google.com>
The init process can exit if it doesn't have any child processes and its
pidns is destroyed in this case. CRIU dump is running in the target pid
namespace and it kills dumped processes at the end. We need to create a
holder process to be sure that the pid namespace will not be destroy
before criu exits.
Fixes: #1775
Signed-off-by: Andrei Vagin <avagin@gmail.com>
zdtm.py mounts two named controllers for tests. In CI, we run zdtm.py a few
times, so we can mount (create) these controllers once to avoid any unwanted
effects.
Signed-off-by: Andrei Vagin <avagin@google.com>
The idea that each zdtm.py should have own helder, so that two zdtm.py that are
running on the same host don't effect each other.
Fixes: #1774
Signed-off-by: Andrei Vagin <avagin@google.com>
We have three of "Can't mount at %s", let's distinguish simple mount
from bind-mount and re-mount to make log reading easier.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
On pre v5.15 kernel we don't have MOVE_MOUNT_SET_GROUP support and thus
all our ci logs are filled with "fallback" messages. Let's decrease log
level to debug, so that we don't see it in ci logs.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
[root@fedora criu]# ./test/zdtm.py run -t zdtm/static/pty-console --iters 2 --keep-going --ignore-taint
[WARNING] Option --keep-going is more useful when running multiple tests
userns is supported
=== Run 1/1 ================ zdtm/static/pty-console
====================== Run zdtm/static/pty-console in uns ======================
Start test
Test is SUID
./pty-console --pidfile=pty-console.pid --outfile=pty-console.out
Run criu dump
Run criu restore
Run criu dump
=[log]=> dump/zdtm/static/pty-console/62/2/dump.log
------------------------ grep Error ------------------------
b'(00.009325) 101 fdinfo 3: pos: 0 flags: 100000/0'
b'(00.009332) Dumping path for 3 fd via self 19 [/zdtm/static]'
b'(00.009345) 101 fdinfo 4: pos: 0 flags: 100002/0'
b'(00.009352) tty: Dumping tty 20 with id 0xc'
b"(00.009358) Error (criu/files-reg.c:1710): Can't lookup mount=1647 for fd=4 path=/ptmx"
b'(00.009361) ----------------------------------------'
b'(00.009369) Error (criu/cr-dump.c:1368): Dump files (pid: 101) failed with -1'
b'(00.009696) Running network-unlock scripts'
b'(00.012401) Unfreezing tasks into 1'
b'(00.012410) \tUnseizing 86 into 1'
b'(00.012415) \tUnseizing 101 into 1'
b'(00.012428) Error (criu/cr-dump.c:1788): Dumping FAILED.'
------------------------ ERROR OVER ------------------------
################ Test zdtm/static/pty-console FAIL at CRIU dump ################
Test output: ================================
<<< ================================
Send the 9 signal to 86
Wait for zdtm/static/pty-console(86) to die for 0.100000
##################################### FAIL #####################################
Restore on second iteration with mount-v2 fails, that is because
devpts_restore which is called from do_new_mount_v2 via fstype->restore
opens ptmx file in service mntns and saves it to fdstore for later use.
So after first c/r open ptmx fd changes mnt_id in fdinfo to a detached
mount. Let's just disable mount-v2 for this test for now.
FIXME: We should create separate fstype hook to do_mount_in_right_mntns,
so that we can open files from this hook in actual restored mntns.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Let's run zdtm in jenkins with --mntns-compat-mode option and same for
device-external mount test from others.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Now when we switched to mount-v2 by default to check old mount engine we
need to explicitly run with --mntns-compat-mode option.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We can have tracefs separate mount from debugfs and that's why the
/sys/kernel/debug external mount now has children and this thing is not
supported to be bind in container with children, because we don't wan't
external mounts to introduce some unexpected extra external mounts so we
bind them without MS_REC in mount-v2 unlike in old mount engine.
We can either bind without MS_REC when constructing test or provide all
children mount as separate external mounts to criu, let's just disable
for now.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/87875c023
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Before mounts-v2 we have seen mounts loosing their mount readonly flags
when they were in a propagation group, because CRIU "forgot" to set
them, with new mount engine it should work now as all propagations are
now created on the same path there all other normal mounts are created,
and all mount flags are restored.
This test actually mounts only one mount, other three are propagations,
lets set mount ro flag for half of them.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/22584993d
FIXME: need to check options restored right as we don't have
--check-mounts to do this job for us.
Reviewed-by: Alexander Mikhalitsyn (Virtuozzo) <alexander@mihalicyn.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Mounts-v2 engine should fix multiple problems of old engine relative to
sharing options, lets add a test for such problems.
Add all four types of shared groups: 1) private, 2) shared, 3) slave
and 4) slave+shared for mounts. Propagate them into sharing and after
propagation change sharing with four ways: 1) don't change, 2) make
private, 3) make slave and 4) make private + make shared.
This brings 16 cases of different sharing options for mount propagation,
lets check that they all are restored fine.
Lets create mounts from description to make it easier to improve this
test in future.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/8bcd0034d
FIXME: need to check options restored right as we don't have
--check-mounts to do this job for us.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
These test simply checks that sharing between two mounts in container:
1) external mount and 2) it's bind persists (case when bind has the same
mountpoint).
Note: on old mount engine mounts inside container become also shared
with mount in criu mount namespace (outside container) after c/r which
is not right.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/76a09e850
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Now when we switched to mount-v2 by default to check old mount engine we
need to explicitly run with --mntns-compat-mode option.
Note that if the feature move_mount_set_group is not supported then
regular run will just fallback to old mount engine and then we don't
need separate run with --mntns-compat-mode.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/e4a430e1f
Changes: prepend --mntns-compat-mode to r_opts in zdtm.py so that we
can disable this option with --no-mntns-compat-mode from test desc
files.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Design of mounts-v2:
As a preparation step we classify mounts in groups by (shared_id,
master_id) in new resolve_shared_mounts_v2 (just after reading images).
New function prepare_mnt_ns_v2 is our main entry point when switching
from old mount engine to new one actually happens.
First we pre-create each mount namespace nearly empty, only with root
yard in place (pre_create_mount_namespaces).
We walk the mount tree and mount each mount similar to old mount
engine but not in mount tree but as a sub-directory of root yard
(plain mountpoint) in service (criu) mount namespace. Also we
bind this mount from service mntns to real mntns just after creation.
(do_mount_in_right_mntns)
Note: this way we initially have the final mount which would be
visible to restored container user with right mnt_id for the sake of
e.g. creating unix sockets on it (for unix socket bindmounts), and
both have copy of the mount in service mntns so that old code which
accesses files on mounts through service mntns still can acces them.
New can_mount_now_v2 is now free from heuristics we had for restoring
shared groups, we will restore them later via MOVE_MOUNT_SET_GROUP,
for now everything is private.
Now when all plain mount are created in real mount namespaces, we can
move them to the tree for each namespace. Also we open fds on the
mountpoint: one mp_fd_id before moving and another mnt_fd_id after,
so that we can access each file later from final mntns via those fds.
(assemble_mount_namespaces)
New restore_mount_sharing_options walks each root sharing group and
their descendants with dfs tree walk. It creates sharing for the first
mount in the sharing group and then sets the same sharing on all other
mounts in this group.
Sharing creation for fist mount is two step:
a) If mount has master_id we either copy shared_id from parent sharing
group or from external source and then make mount slave thus
converting it to right master_id.
b) Next if mount has shared_id we just make us shared, creating right
shared_id.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/596651d02
Changes:
- Split all "exporting" to separate preparational patches
- Rework cr_time
- Switch to MOVE_MOUNT_SET_GROUP
- Use resolve_mountpoint for external mounts (for MOVE_MOUNT_SET_GROUP)
- Mounting plain mounts both in service and in restored-final mntns
- Call MOVE_MOUNT_SET_GROUP from usernsd
- Rework can_mount_now_v2 to handle bind of both root and external.
- Use sys_move_mount for mount assembling.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This is a preparation of mounts-v2 new algorithm for mount restore, we
add an alternative mountpoints to each mount, so that if we mount mounts
in these mountpoints they will be "plain": each mount in separate
sub-directory of root_yard, mounts will be mounted without tree. Tree
reconstruction will be done in separate step.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/5e6de171a
Changes: improve get_plain_mountpoint().
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We plan to switch to Mounts-v2 engine for restoring mounts by default,
this options is to allow switching to old engine. This patch only adds
an option, no engine behind it yet.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/503f9ad2c
Changes: allow --mntns-compat-mode option only on restore and only if
MOVE_MOUNT_SET_GROUP is supported (this also requires change in
unittest/mock.c), change id in rpc criu_opts.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This helper would be useful to get mountpoints of source path of
external mounts without parsing host mountinfo. When we restore
mountpoint-external mount and we need to copy sharing from source via
MOVE_MOUNT_SET_GROUP, it would require from us to give it real
mountpoint of source path to be able to copy sharing group.
This uses openat2 RESOLVE_NO_XDEV feature which detects crossing
mountpoint boundary instead of potentially slow mountinfo parsing.
v3: coverity CID 389209: close fd only when it was opened
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Will use this for cross mount namespace bindmounts.
Note: don't need separate kdat for mount-v2, as MOVE_MOUNT_SET_GROUP
were added much later than open_tree and all related fixups.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Mounts-v2 requires new kernel feature MOVE_MOUNT_SET_GROUP to be able to
restore propagation between mounts right.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/7da7f9a17
Changes: define move_mount syscall, check mainstream kernel
MOVE_MOUNT_SET_GROUP feature, use our "linux/mount.h" to overcome
possible problems of non-existing header on older kernels.
v3: coverity CID 389201: check ret of umount2 and rmdir at cleanup stage
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
While mounts-v2 mounts all mounts plain without tree in service mntns we can't
just use path relative to mntns to find remap. Make it mount related, it is
also compatible with mounts-v1.
Also we don't need openat and unlinkat here as we've opened rmntns_root
just before that, lets switch to "non-at" variants.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/dc9ac0c80
Changes: rework to skip vz-specific hunks.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
While mounts-v2 would mount all mounts plain without tree in service
mntns we can't just use path relative to mntns to find remap. Make it
mount related, it is also compatible with current mount engine.
Also handle no-mntns case separately in nomntns_create_ghost.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/9cdf0b3e4
Changes: make gf->remap.rpath always relative else we get:
Error (criu/files-reg.c:779): Couldn't unlink remap
/tmp/.criu.mntns.BCurDL/13-0000000000 /zdtm/static/cwd02.test:
No such file or directory
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Will use it to make create_ghost work with mount-v2.
Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/156fa4877
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/069bba0ad
Changes: merge fixup.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This getter should be used when we wan't to access the mount on the filesystem.
In next patches we want to be able to change the location of the mount on
restore in service mount namespace, while not changing ->mountpoint string.
All places where we don't want to access the mount but instead want to
determine relations between mounts in the initial mount tree or just print path
should use ns_mountpoint.
This change effectively brings no change of behaviour everything is the same
for now.
Still leave ->mountpoint references for remap, cr_time and initialization which
need to work with exact variable.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/235c761e0
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
On dump ->mountpoint and ->ns_mountpoint is the same, but on restore
->mountpoint can be changed by mount tree yard setup and remap (and who
knows what else =) ). It is not good to use ->mountpoint for path
comparison between mounts if we are not explictly need to compare
"changed" paths. Imagine the remap change will make two mounts have
different prefixes in ->mountpoint and we won't be able so understand
that those mounts originally were subpaths.
This patch handles 2 simple cases:
a) These functions called ONLY ON DUMP so for them there is no effective
change: fixup_overlayfs, fusectl_dump, check_one_mark, __lookup_overlayfs,
mount_resolve_path, try_resolve_ext_mount, validate_mounts (first and third),
resolve_external_mounts, get_clean_mnt, __umount_children_overmounts,
__umount_overmounts, ns_open_mountpoint, open_mountpoint, dump_one_fs,
dump_one_mountpoint, clean_cr_time_mounts, collect_unix_bindmounts.
b) In these functions ONLY LOGS changed, so no algorithm change:
always_fail, mnt_build_ids_tree, mnt_tree_show, unsupported_nfs_bindmounts,
unsupported_nfs_mount, unsupported_mount, validate_mounts (second),
__search_bindmounts, resolve_shared_mounts, mnt_tree_for_each, resolve_source,
propagate_siblings, propagate_mount, do_mount_one, get_mp_root,
collect_mnt_from_image, merge_mount_trees, ns_remount_writable,
__remount_readonly_mounts, parse_mountinfo.
All complex cases are handled in separate patches.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/4972888dd
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Function mnt_depth is only used on real mounts when building mount tree for
single namespace, thats why we can compare those mounts with ns_mountpoint
safely.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/2be0ff276
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
At this point ns_mountpoint is equal to mountpoint.
More over let's use robust is_same_path helper in should_skip_mount so
that we don't need to rely on ->mountpoint + 1 hacks.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/d4c4271a0
Changes: use is_same_path helper.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Previous code did:
1) get rpath: mount's mountpoint relative to it's parent mountpoint
2) get cut_root: parent's root relative to parent's slave root or vice
versa (will be "-" if parents root is wider of "+" if thicker)
3) return parent's slave mountpoint +/- cut_root + rpath
It can be done more robust with get_relative_path:
1) get rpath: mount's mountpoint relative to it's parent mountpoint
2) get fsrpath: add rpath to parent's root (path relative to fs root)
3) get rpath: fsrpath relative to parent's slave root
4) return parent's slave mountpoint + rpath
In the latter approach we do not need to open code workarounds for
consequent slashes in paths (get_relative_path would do this for us),
and we also do not need to have complex logic with +/-.
While on it let's also switch ->mountpoint to ->ns_mountpoint where
possible, as mountpoint can have unexpected prefixes.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/0fd09f8571
Changes: rework mnt_get_sibling_path more.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We need to skip root_yard_mp parent as it has no ns_mountpoint, it also
has no children overmounts so we are safe, all others can be compared by
ns_mountpoints.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/e5665c976
Changes: add mi->parent pre-check, reword commit message.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Fail root_path_from_parent if parent is root_yard, we want to only
lookup root path in real parent mounts.
Now it is safe to use ns_mountpoint instead of mountpoint as both
children and parent have it and they are relative.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/e58a91883
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Function validate_children_collision is both called on dump and on
restore. On dump mountpoint and ns_mountpoint are the same. On restore
as we never call validate_children_collision on helper mounts
(root_yard_mp and cr_time are not in mntinfo list), for all other mounts
strcmp results would be the same with mountpoint and ns_mountpoint.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/8f4fda5ac
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
There is no point of remaping ns root mounts they can't overmount anybody.
This also allows us to switch mnt_needs_remap from ->mountpoint to
->ns_mountpoint for mount comparison in overmount detection.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/9475bf843
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Let's use ->ns_mountpoint in comparison as ->mountpoint can change (e.g.
see how we add ns root in get_mp_mountpoint and in do_remap_mount we can
change it again). We plan to get rid of ->mountpoint everywhere where we
can use unchanged ->ns_mountpoint.
Cherry-picked hunks from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/e98e1456d
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Replace ->mountpoint with ->ns_mountpoint for determining relations
between mounts.
Also let's use get_relative_path in autofs_create_dentries as it is more
robust, before that we've missed the case where mountpoint of child of
autofs mount is multilevel subdirectory of parent mountpoint, and always
created them as single level subdirectory.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/5d5462202
Changes: skip children overmount as it does not need a subdirectory.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Put remounted_rw to it. This allows us to easily add some more of such
variables without allocating each one of them separately.
Due to existance of shfree_last shmalloc'ed region can be inherited from
the previous caller so it needs to be explicitly zero initialized.
Fixes: 0a2d380e6 ("ghost/mount: allocate remounted_rw in shmem to get
info from other processes")
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/6750e5793
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Expression (x && REMOUNTED_RW) is always same as just (x).
It should've been (x & REMOUNTED_RW) to check if mount is marked as
temporary remounted writable and requires to be switched back.
By fixing this check we eliminate excess readonly remounts.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/167f8ac67
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Let's merge mount trees under root_yard just after reading from image.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/8e8ecdfdc
Changes: split only root yard part as a separate patch, and put root
yard alloc into merge_mount_trees.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Function mnt_is_overmounted is designed to detect if mount is overmounted in
current tree using comparison of mountpoints of neighbour mounts for detection.
We want to get actual overmounts in dumped tree, we don't expect that helper
mounts we add or merging will introduce new overmounts. So let's do overmount
detection earlier before adding helpers.
Set is_overmounted = false for root yard and binfmt helper mounts.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/e98e1456d
Changes: rename set_is_overmounted to prepare_is_overmounted, move it
just after collecting mounts from images to mount tree, handle helper
mounts.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Will use it to find shared mount we can bind from and also can inherit
external slavery. Device-external can't give us external slavery.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/dcd952c4c
Changes: switch to mnt_bind_pick helper.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
There is no point to lose this information, having -1 everywhere in
mount images instead of acutall master id can be confusing.
Note that now need_master is true for bindmounts of root mounts with
same master_id as root mount, so now they are handled with a common
code, we've added can_receive_master_from_root check specially to handle
this case right. Also note that in propagate_mount we no more set ->bind
for this case, this is handled by mnt_ext_slave list related code.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/b3c9dc05e
Stripped only master_id relative part of original patch, add
preparational patches before this one.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We need to put mounts which need to inherit master_id from external
mounts or from root mount into separate list, so that we can set ->bind
on them right in propagate_siblings.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/ea592cf6e
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
If mount has external master_id it can inherit it as a bind of external
mount, but also it can inherit it as a bind of container root mount, so
let's add similar condition to allow such mounts.
Note: need_master is false for binds of root mount which can inherit
master_id from root mounts yet, this would change in next patch.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Root yard mount also has mnt_id == 0 so it will look better with a new
name. Let's explicitly initialize root yard mnt_id to HELPER_MNT_ID
for the sake of code readability.
Also in near future we might want to create additional mount helpers to support
mounts in CT with no fsroot mounted.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/45bf6f0ee
Changes: split umount hunk to previous patch, set HELPER_MNT_ID for root
yard.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
On dump, yes, mountpoint and ns_mountpoint are the same, but on restore
they don't and puting something like "<root_yard>/binfmt_misc" to
ns_mountpoint is wrong, let's leave ns_mountpoint NULL, this mount
should not be compared by ns_mountpoint with other mounts anyway.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Put our auxiliary binfmt_misc mount in "<root_yard>/binfmt_misc" instead
of "<root_yard>/<mntns>/proc/sys/fs/binfmt_misc". Thus we can restore
binfmt_misc without altering actual mount tree, which looks much more
safe.
For that we need to remove "fake top mount_info" handling from
add_cr_time_mount as now we intentionally add binfmt_misc mount as a
child of ("fake") root yard. On dump this does not change anything.
Also we need to create mountpoint for binfmt_misc in root yard.
As now mount is out of restored mount tree we don't need to umount it,
so remove corresponding CRTIME_MNT_ID umount hunk in do_new_mount.
Note: to make binfmt_misc c/r work criu should be compiled with
CONFIG_BINFMT_MISC_VIRTUALIZED and binfmt_misc should be actually
virtualized and this is only done in Virtuozzo kernel per ve.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/2eb535843
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/d79c7f441
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/34002bef4
Cherry-picked one hunk from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/45bf6f0ee
Changes: merge all fixups together to one consistent patch.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Before this change we didn't apply sb-flags if we mount the root mount of
non-root mntns. There is no point in it, if we got to do_new_mount this root
mount is not external bind, so we won't change sb-flags on host if we change it
for this mount. So we just loose sb-flags on some regular container mount for
no reason. Fix it.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/e7ffe4c60
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This creates nested mntns and does pivot_root to tmpfs mount, so that
roots of original test mntns and in nested mntns are different.
Before allowing nested mntnses with different roots in previous patch
this would fail.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Only root in root-mntns is special (see rst_mnt_is_root) all other
mounts are mounted regulary there is no difference between ns root and
any other mount or bind-mount.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/f41e41dd5
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Helper mnt_is_root_bind indicates that mount can be bind-mounted from
the root mount (which in it's turn from opts.root).
Use it in validate_mounts: we should skip unsupported mount from fsroot check
if we know it will be bindmounted from root mount, is_ns_root check was wrong.
Also fix root mount check in dump_one_fs, root mounts in non root mntns should
be dumped normally if they are not bind-mounts of root mount.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/25d078971
Changes: switch to mnt_bind_pick helper, export to mount.h, also add
mnt_get_root_bind helper for future use in mount-v2, remove excess root
yard hunk.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This test creates two mount namespaces, one "root" with external mount
at /mnt_ext_collision.test/dst and one "nested" with different internal
mount at /mnt_ext_collision.test/dst instead.
This case is important for nested containers, if we dump a container
with some external mount in /mnt we should not also replace mounts in
/mnt for nested containers with the external one. (One example is docker
containers inside Virtuozzo containers.)
Without previous patch which restricts external mounts resolution to
only root mntns of container this test fails as internal mount is
replaced by external one after migration.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We resolve mountpoint-external mounts on dump by mountpoint comparison,
so if we have other mount (other superblock e.g. in nested mntns) with
same mountpoint we would also resolve this mount as external and restore
it as external: replacing it completely with different mount... That's
wrong, so to make this interface more robust let's only resolve
mountpoint-external mounts in root mntns of container, not in all
mntnses as it was before.
Note: if actual external mount (bind of external) gets to nested mntns
it's ok not to resolve it as external as criu would bind it from the
resolved mount in root mntns. So external mounts in nested mntns are
still supported after this patch.
Cherry-picked one hunk from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/034498b28
Changes: apply mntns check only to mountpoint-external mounts.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This test simply creates a) root external mount and b) "deeper"
bindmount for it (deeper in terms of mnt_depth). Our mount restore code
tries to mount (b) first and fails (without previous patch ordering
external mounts before their binds).
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/d31954669
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The problem when we don't order these mounts we can get to mounting
non-external bind first via do_new_mount and fail c/r. For instance for
tmpfs we would fail on no image to get contents from. See the test
mnt_ext_root for more info.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/baf3f8db8
Changes: switch to mnt_bind_pick helper, export to mount.h, make check
in can_mount_now skip mounts with ->bind set.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Function dump_one_fs already has mnt_is_external_bind check inside, so
there is no point to check pm->external one more time.
Function check_bindmount is intended to check devpts bindmount's master
was opened in right mount namespace, but if bindmount is external mount
there is no point to check this. Let's also skip check for bindmounts of
external mounts.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We use mnt_is_external():
1) In validate_mounts() to skip fsroot existence check for mounts which
will be bind-mounted from external mounts.
2) In resolve_shared_mounts() to skip error on slave mounts without
master mount, if they can receive these master_id through external
mount.
3) In dump_one_fs to skip dump of mounts which will be bind-mounted from
external mounts.
Cases (1) and (3) are the same, but case (2) is quiet different. Lets
split these cases thus making things simplier.
Effectively these patch does not change criu's behaviour at all. While
I can't say that old mnt_is_external was wrong, it was too complex and
hard for understanding, so it's worth to switch to lookup across
bindmounts list via general mnt_bind_pick() helper. And now when it is
obvious that mnt_is_external looks for external bindmount, let's also
change it's name to mnt_is_external_bind.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/494b52ba8
Changes: use mnt_bind_pick helper, use is_sub_path helper to be more
robust, rename mnt_is_external to mnt_is_external_bind, fix
clang-format, export to mount.h, use mnt_is_nodev_external as we can not
inherit master from device-external mounts.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Adding different pick functions we would be able to search different
things like mounted bind with wider root, or external bind, or external
bind with same sharing group and so on and so forth.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This is a smart way of getting relative paths:
1) Always returns relative path, no unexpected starting '/';
2) Detects subpath even if path formats are different, only real directory
and file names matter;
3) No path modiffication/allocation, returns shifted pointer to the
orignal path.
We have many places where we need to cut subpath from path. Different code
blocks doing this job spread widely across the codebase for instance see:
cut_root_for_bind and root_path_from_parent. But those implementations rely on
the fact that subpath's and path's formats are the same.
When we modify or concatenate paths we can accidentally get strange
path formats, paths given by user can have strange format, and the job
to manually maintain all paths in "simple" format everywhere is too
hard. So let's just add a tool to compare "strange" paths.
E.g.:
get_relative_path("./a////.///./b//././c", "///./a/b") == "c"
Note: ".." in path is not supported, and we just can't support it right
without full filesystem tree information to resolve paths like
"../../a", so we just treat ".." as a directory name which should work
in simple cases.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/73a771348
Changes: add other useful robust path comparison helpers is_sub_path and
is_same_path based on get_relative_path, fix clang-format.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Before this patch mnt_is_external() used non-populated mnt_bind list
when called from resolve_shared_mounts(), thus it could work not as
intended.
Let's add separate helper search_bindmounts() for populating mnt_bind
list, and add mnt_bind_is_populated to differentiate between
non-populated list and just empty populated list. This way we can add a
BUG_ON to mnt_is_external to catch such order problems in future.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/e464c1c6d
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/8b22b30d5
Cherry-picked one hunk from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/ca9de41e3
Changes: simplify commit message, merge fixups: search bindmounts
earlier so that we have bindmounts info as early as possible, rename
mnt_no_bind to mnt_bind_is_populated and simplify it's logic a bit.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Fstype and source fields can be changed by resolve_external_mounts() or
by try_resolve_ext_mount() for external mounts, but we can have other
mounts from same superblock which are not detected as external, for
instance bind of subdirectory from device-external or bind of
mountpoint-external mount to other mountpoint. So we need to still be
able to find bindmounts between mounts with changed fstype or source and
unchanged mounts.
So let's make fstype/source checks in mounts_sb_equal ignored for
external mounts. Leave only fstype->sb_equal checks if have them.
Signed-off-by: Alexander Mikhalitsyn (Virtuozzo) <alexander@mihalicyn.com>
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/fadc38d84
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/f9700cb12
Changes: merge two commits in one and rework, remove ":)", reword
commit-message to make patch self-sufficient.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Previously only autodetected and mountpoint external mounts had
mount_info->external field set, let's fix this injustice so that we can
operate all external mounts in a similar manner.
Also:
Print info message when device external mount is detected similar to
mountpoint external mounts detection.
Add helper mnt_is_nodev_external to let do_mount_one, can_mount_now and
do_bind_mount handle device external mounts separately as it was before.
Handle device external mount right in get_mp_root to set ->external on
restore. (note: calling ext_mount_lookup is only meaningfull for
mountpoint external mounts)
Add helper mnt_is_dev_external to use in resolve_source to make it more
clear that it is a device external mount restore path.
All other "if (mi->external)" checks now also handle device external
mounts, but they all look safe to do so and could've done it initially,
here is a list: fusectl_dump, mnt_is_external, dump_one_mountpoint,
propagate_mount.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/afd899539
Changes: cleanup commit message, add some helpers.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Device-external mounts are restored via do_new_mount(), but function
do_new_mount only allows creating mounts with root "/", as it does
simple mount (not bind) without any later root change. Restoring
non-root mounts via do_new_mount is just imposible.
So let's detect mounts as device-external only when they have fsroot
root, all other non-fsroot binds of this device would be restored as
bindmounts of fsroot ones.
This is a cosmetic change as though non-root mounts were detected as
device-external before this patch they anyway would not be created with
do_new_mount() because of fsroot/bind check in can_mount_now orders them
to be restored as binds.
Cherry-picked one hunk from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/afd899539
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Use this helper everywhere instead of manually adding mounts to the head
of the list, this way it is much easier to track all places where we do
add to mntinfo list.
Signed-off-by: Alexander Mikhalitsyn (Virtuozzo) <alexander@mihalicyn.com>
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/7bca9397b
Changes: skip hunk adding root_yard_mp to the list because root yard has
not fully initialized mountinfo structure (can break code which uses
mntinfo fallback in lookup_nsid_by_mnt_id), let's only have real mounts
in mntinfo list. Also skip cr_time mount from mntinfo list for the same
reason.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Before these change the on-host-"zdtm_auto_ext_mnt" mount with
mountpoint "/tmp/zdtm_ext_auto.XXXXXX" was private/shared depending on
it's parent mount "/tmp". And e.g. on my setup the parent mount on
"/tmp" is private and our "host" mount becomes private too. So
in-container-"zdtm_auto_ext_mnt" external mount is also private but test
name hints it should be slave.
E.g. If I ran mnt_ext_master before this patch, in mnt_ext_master
process mntns we see that our "external" mount is private but not slave:
[root@fedora criu]# grep zdtm_auto_ext_mnt /proc/167077/mountinfo
1239 1238 0:138 /test /ext_mounts rw,relatime - tmpfs zdtm_auto_ext_mnt rw,seclabel,inode64
After this patch:
[root@fedora criu]# grep zdtm_auto_ext_mnt /proc/166385/mountinfo
1239 1238 0:138 /test /ext_mounts rw,relatime master:413 - tmpfs zdtm_auto_ext_mnt rw,seclabel,inode64
^^^^^^^^^^
So we just explicitly make on-host-"zdtm_auto_ext_mnt" shared, and this
makes in-container-"zdtm_auto_ext_mnt" external mount slave.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/a1a221fe9
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
coverity CID 389197:
CID 389197 (#1 of 1): Invalid printf format string (PRINTF_ARGS)
format_error: Length modifier L not applicable to conversion specifier in %Lu. [show details]
284 pr_err("Incompatible uffd API: expected %Lu, got %Lu\n", UFFD_API, uffdio_api.api);
Looking on C11 standard it seems that "%Lu" is undefined, we better not
use this, see:
"L Specifies that a following a, A, e, E, f, F, g, or G conversion
specifier applies to a long double argument."
http://port70.net/~nsz/c/c11/n1570.html#7.21.6.1p7
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
coverity CID 389191:
int unix_sk_id_add(unsigned int ino)
2327{
2328 char *e_str;
2329
1. alloc_fn: Storage is returned from allocation function malloc.
2. var_assign: Assigning: ___p = storage returned from malloc(20UL).
3. Condition !___p, taking false branch.
4. leaked_storage: Variable ___p going out of scope leaks the storage it points to.
5. var_assign: Assigning: e_str = ({...; ___p;}).
2330 e_str = xmalloc(20);
6. Condition !e_str, taking false branch.
2331 if (!e_str)
2332 return -1;
7. noescape: Resource e_str is not freed or pointed-to in snprintf.
2333 snprintf(e_str, 20, "unix[%u]", ino);
8. noescape: Resource e_str is not freed or pointed-to in add_external. [show details]
CID 389191 (#1 of 1): Resource leak (RESOURCE_LEAK)9. leaked_storage: Variable e_str going out of scope leaks the storage it points to.
2334 return add_external(e_str);
2335}
We should free e_str string after we finish it's use in unix_sk_id_add,
easiest way to do it is to use cleanup_free attribute.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Modifications to support criu image streamer when using amdgpu_plugin.
When running with criu image streamer, fseek/lseek is not available so
we store the file size in the first 8-bytes of the actual file.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Store BO contents directly to file (1 per GPU) instead of using
protobuf.
Bug Fix:
Fixes an issue where we could not handle BOs bigger than 4GB because
protobuf has an internal limit of 4GB for the Bytes structure.
Performance Improvements:
This significantly reduces CR duration on multi-GPU systems as it allows
reading and writing to disk in parallel. During checkpoint, instead of
waiting for all the BO contents to be read from the one protobuf file,
we can now start writing the BO contents as soon as the first BO is read
from disk. During restore, we can start writing BO contents to disk
after the first BO from VRAM. This also reduces the peak amount of
system memory used as we only need to keep 1 BO content in memory per
GPU at a time instead of all the BO contents.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
This sets up the pytorch environment for BERT Transformers and also sets
up CRIU along with all its dependencies including amdgpu plugin for
supporting CR with AMDGPUs.
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
On newer kernel's (> 5.13), KFD & DRM drivers will only allow the
/dev/renderD* file descriptors that were used during the CRIU_RESTORE
ioctl when calling mmap for the vma's.
During restore, after opening /dev/renderD*, amdgpu_plugin keeps the
FDs opened and instead returns a copy of the FDs to CRIU. The same FDs
are then returned during the UPDATE_VMAMAP hooks so that they can be
used by CRIU to call mmap. Duplicated FDs created using dup are
references to the same struct file inside the kernel so they are also
allowed to mmap.
To prevent the opened FDs inside amdgpu_plugin from conflicting with
FDs used by the target restore application, we make sure that the
lowest-numbered FD that amdgpu_plugin will use is greater than the
highest-numbered FD that is used by the target application.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
AMD Radeon GPUs have special sDMA (system dma engines) IPs that can be
used to speed up the read write operations from the VRAM and GTT memory.
Depends on:
* The kernel mode driver (kfd) creating the dmabuf objects for the kfd
BOs in both checkpoint and restore operation.
* libdrm and libdrm_amdgpu libraries
Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Libhsakmt(thunk) uses a shared memory file in /dev/shm/hsakmt_shared_mem
and its semaphore in /dev/shm/hsakmt_shared_mem. Adding a check during
checkpoint to see if these two files exist. If they exist then the
plugin will try to restore them during restore.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Implement multi-threaded code to read and write contents of each GPU
VRAM BOs in parallel in order to speed up dumping process when using
multiple GPUs.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Adding unit tests for GPU remapping code when checkpointing and
restoring on different nodes with different topologies.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Add optional parameters to override default behavior during restore.
These parameters are passed in as environment variables before executing
CRIU.
List of parameters:
KFD_FW_VER_CHECK - disable firmware version check
KFD_SDMA_FW_VER_CHECK - disable SDMA firmware version check
KFD_CACHES_COUNT_CHECK - disable caches count check
KFD_NUM_GWS_CHECK - disable num_gws check
KFD_VRAM_SIZE_CHECK - disable VRAM size check
KFD_NUMA_CHECK - preserve NUMA regions
KFD_CAPABILITY_CHECK - disable capability check
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
The device topology on the restore node can be different from the
topology on the checkpointed node. The GPUs on the restore node may
have different gpu_ids, minor number. or some GPUs may have different
properties as checkpointed node. During restore, the CRIU plugin
determines the target GPUs to avoid restore failures caused by trying
to restore a process on a gpu that is different.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Parse local system topology in /sys/class/kfd/kfd/topology/nodes/ and
store properties for each gpu in the CRIU image files. The gpu
properties can then be used later during restore to make the process is
restored on gpu's with similar properties.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
To support Checkpoint Restore with AMDGPUs for ROCm workloads, introduce
a new plugin to assist CRIU with the help of AMD KFD kernel driver. This
initial commit just provides the basic framework to build up further
capabilities. Like CRIU, the amdgpu plugin also uses protobuf to
serialize
and save the amdkfd data which is mostly VRAM contents with some
metadata.
We generate a data file "amdgpu-kfd-<id>.img" during the dump stage. On restore
this file is read and extracted to re-create various types of buffer
objects that belonged to the previously checkpointed process. Upon
restore the mmap page offset within a device file might change so we use
the new hook to update and adjust the mmap offsets for newly created
target process. This is needed for sys_mmap call in pie restorer phase.
Support for queues and events is added in future patches of this series.
With the current implementation (amdgpu_plugin), we support:
- Only compute workloads such (Non Gfx) are supported
- GPU visible inside a container
- AMD GPU Gfx 9 Family
- Pytorch Benchmarks such as BERT Base
amdgpu plugin dependes on libdrm and libdrm_amdgpu which are typically
installed with libdrm-dev package. We build amdgpu_plugin only when the
dependencies are met on the target system and when user intends to
install the amdgpu plugin and not by default with criu build.
Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
Co-authored-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
kfd_ioctl.h contains the definitions for the APIs and required arguments
to call the ioctls so simply copy the header as is for amdgpu plugin.
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
During premap phase, skip vmas that are handled by external plugins as
their offsets may change when the plugin restores them. This change is
needed when running with criu image streamer.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Adding a dedicated flag for vma's that are handled by an external plugin
as previously used VMA_UNSUPP flag depends on vma not having
VMA_FILE_SHARED flag.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Add a new global function to return unused FD based on the pid. This
function can be used in situations where we need a FD that will not
conflict with FDs used by target restore process, but
struct pstree_item is not available (e.g plugins)
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Some device drivers (e.g DRM) only allow the file descriptor that was
used to create the vma to be used when calling mmap.
In this case, instead of opening a new FD, the plugin will return a
valid FD that can be used for mmap later. The plugin needs to close the
returned FD later. Copies of the returned FD that are created using dup
or fnctl(..,F_DUPFD,..) are references to the same struct file inside
kernel so they are also allowed to mmap.
The plugin does not need to update the path anymore as the plugin can
return a FD for the correct path.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
coverity CID 389187:
3193int veth_pair_add(char *in, char *out)
3194{
3195 char *e_str;
3196
1. alloc_fn: Storage is returned from allocation function malloc.
2. var_assign: Assigning: ___p = storage returned from malloc(200UL).
3. Condition !___p, taking false branch.
4. leaked_storage: Variable ___p going out of scope leaks the storage it points to.
5. var_assign: Assigning: e_str = ({...; ___p;}).
3197 e_str = xmalloc(200); /* For 3 IFNAMSIZ + 8 service characters */
6. Condition !e_str, taking false branch.
3198 if (!e_str)
3199 return -1;
7. noescape: Resource e_str is not freed or pointed-to in snprintf.
3200 snprintf(e_str, 200, "veth[%s]:%s", in, out);
8. noescape: Resource e_str is not freed or pointed-to in add_external. [show details]
CID 389187 (#1 of 1): Resource leak (RESOURCE_LEAK)9. leaked_storage: Variable e_str going out of scope leaks the storage it points to.
3201 return add_external(e_str);
3202}
We should free e_str string after we finish it's use in veth_pair_add,
easiest way to do it is to use cleanup_free attribute.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
coverity CID 389192:
550static int parse_join_ns(const char *ptr)
551{
...
553 char *ns;
554
1. alloc_fn: Storage is returned from allocation function strdup.
2. var_assign: Assigning: ___p = storage returned from strdup(ptr).
3. Condition !___p, taking false branch.
4. leaked_storage: Variable ___p going out of scope leaks the storage it points to.
5. var_assign: Assigning: ns = ({...; ___p;}).
555 ns = xstrdup(ptr);
6. Condition ns == NULL, taking false branch.
556 if (ns == NULL)
557 return -1;
558
7. noescape: Resource ns is not freed or pointed-to in strchr.
559 aux = strchr(ns, ':');
8. Condition aux == NULL, taking true branch.
560 if (aux == NULL)
CID 389192 (#1 of 1): Resource leak (RESOURCE_LEAK)9. leaked_storage: Variable ns going out of scope leaks the storage it points to.
561 return -1;
We should free ns string after we finish it's use in parse_join_ns,
easiest way to do it is to use cleanup_free attribute.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The config_inotify_irmap test duplicates inotify_irmap with slight
change to add the --force-irmap and --irmap-scan-path options in
a configuration file.
The --criu-config option of ZDTM provides more general solution
for testing CRIU options provided in configuration files.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The --criu-config option allows to run test with CRIU options provided
via configuration files instead of command-line arguments.
Suggested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Using long-form command-line options would allows us to provide
them via config file to CRIU.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch improves the readability of zdtm by refactoring the top-level
code into a main function.
https://docs.python.org/3/library/__main__.html
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
coverity CID 389193:
CID 389193 (#1 of 1): Printf format string issue (PW.BAD_PRINTF_FORMAT_STRING)
1. bad_printf_format_string: invalid format string conversion
598 pr_warn("Can't stat socket %#x(%s), skipping: %m (err %d)\n", id, rpath, errno);
Specifier "%#x" is wrong for id as it is of type uint32_t, let's change
it to "%#" PRIx32 "" to fix the problem.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
coverity CID 389205:
452int dump_tun_link(NetDeviceEntry *nde, struct cr_imgset *fds, struct nlattr **info)
453{
...
458 struct tun_link *tl;
...
2. alloc_fn: Storage is returned from allocation function get_tun_link_fd. [show details]
3. var_assign: Assigning: tl = storage returned from get_tun_link_fd(nde->name, nde->peer_nsid, tle.flags).
475 tl = get_tun_link_fd(nde->name, nde->peer_nsid, tle.flags);
4. Condition !tl, taking false branch.
476 if (!tl)
477 return ret;
478
479 tle.vnethdr = tl->dmp.vnethdr;
480 tle.sndbuf = tl->dmp.sndbuf;
481
482 nde->tun = &tle;
CID 389205 (#1 of 1): Resource leak (RESOURCE_LEAK)5. leaked_storage: Variable tl going out of scope leaks the storage it points to.
483 return write_netdev_img(nde, fds, info);
484}
Function get_tun_link_fd() can both return tun_link entry from tun_links
list and a newly allocated one. So we should not free entry if it is
from list and should free it when it is a new one to fix leak.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
coverity CID 389202:
54int ext_mount_add(char *key, char *val)
55{
56 char *e_str;
57
1. alloc_fn: Storage is returned from allocation function malloc.
2. var_assign: Assigning: ___p = storage returned from malloc(strlen(key) + strlen(val) + 8UL).
3. Condition !___p, taking false branch.
4. leaked_storage: Variable ___p going out of scope leaks the storage it points to.
5. var_assign: Assigning: e_str = ({...; ___p;}).
58 e_str = xmalloc(strlen(key) + strlen(val) + 8);
6. Condition !e_str, taking false branch.
59 if (!e_str)
60 return -1;
...
7. noescape: Resource e_str is not freed or pointed-to in sprintf.
73 sprintf(e_str, "mnt[%s]:%s", key, val);
8. noescape: Resource e_str is not freed or pointed-to in add_external. [show details]
CID 389202 (#1 of 1): Resource leak (RESOURCE_LEAK)9. leaked_storage: Variable e_str going out of scope leaks the storage it points to.
74 return add_external(e_str);
75}
We need to free e_str after add_external used it.
v2: use cleanup_free attribute (@adrianreber)
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
During error injection tests there are random values loaded in some of
the registers. The kernel, however, has the following check:
if (mxcsr[0] & ~mxcsr_feature_mask)
return -EINVAL;
So depending on the random values loaded mxcsr might have values that
the kernel rejects with EINVAL. Setting mxcsr to zero during the tests
lets the error injection test pass.
Signed-off-by: Adrian Reber <areber@redhat.com>
There is no 'err' argument for print(), it should be in grep_errors() in
line below.
Fixes: bed670f62 ("zdtm: print tails of all logs if a test has failed")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Linux Kernel release 5.16 removed support for LOCK_MAND flock and so the
test to verify if LOCK_MAND works started to fail with 5.16.
The kernel also logs following message:
Attempt to set a LOCK_MAND lock via flock(2). This support has been removed and the request ignored.
This fixes CRIU CI using Fedora with 5.16.
See Linux Kernel commit 90f7d7a0d0d68623b5f7df5621a8d54d9518fcc4
"locks: remove LOCK_MAND flock lock support"
Signed-off-by: Adrian Reber <areber@redhat.com>
Starting with Linux Kernel release 5.16 the fdinfo proc entry contains
a map_extra field which breaks CRIU parsing of bpfmap entries.
This commit adds the map_extra as a possible field to CRIU. The value of
map_extra is not passed to the kernel on restore as it does not seem to
be evaluated in the code paths CRIU restore is using for BPF.
This fixes CRIU CI using Fedora with 5.16.
See Linux commit 9330986c03006ab1d33d243b7cfe598a7a3c1baa
"bpf: Add bloom filter map implementation"
Signed-off-by: Adrian Reber <areber@redhat.com>
Currently, hugetlb mappings is not premapped so in the restore content phase, we
skip page read these pages, enqueue the iovec for later reading in restorer and
eventually close the page read. However, image-streamer expects the whole image
to be read and the image is not re-opened, sent twice. These MAP_HUGETLB test
cases will result in EPIPE error. Temporarily disable these test cases for now.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
This commit add a test for checkpoint/restore MAP_HUGETLB memory mappings.
A new zdtm helper get_mapping_dev() is added to get the device number of
the memory mapping.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
As hugetlb mappings are not premapped, they are not registered to uffd service
in restorer code. We must not mark these mappings as PPB_LAZY in generate_iovs()
otherwise when restoring content of these mappings, we will keep looking for in
uffd and get ENOENT because they are not registered.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
As we cannot use mremap() to move the hugetlb mapping around until Linux kernel
version 5.16, we need to skip premapping hugetlb mapping.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
When memfd can be used with hugetlb, we use memfd for checkpoint/restore
anonymous shared memory. Otherwise, map_files symlinks is used for
checkpoint/restore anonymous shared memory.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Attach the System V shared memory segments to the address space via shmat() to
determine if they are backed by hugetlb and their page size. Use these
information for setting the correct flags on restore.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
These numbers are used to determine whether a memory mapping is backed by
hugetlb and its page size.
As the hugepage can be allocated more after the first time we collect kerndat,
we need to collect the missing device numbers every time we load the kerndat
cache.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
When PTRACE_GET_THREAD_AREA errors on kernels with
!CONFIG_IA32_EMULATION beacuse of missing support (-EIO), compel should
ignore uch errors in native mode.
However the check for error type uses return value of ptrace rather than
errno, which will always result in error propagation.
Use errno to detect type of error to fix this.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
As we call mmap syscall directly, the returned value in error case is the error
number not -1 like in libc wrapper. Use IS_ERR for correct checking in error
case.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
os.WEXITSTATUS() returns the process exit status and it should be used
only if WIFEXITED() is true, i.e., the process terminated normally.
os.waitstatus_to_exitcode() does the same as os.WEXITSTATUS() but it
also handles the case when the process has been terminated by a signal.
Suggested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
If we replace old_sid with current_sid we should also do same
replacement for matching pgid (=old_sid).
Reported in CRIU gitter by Younes Manton (@ymanton)
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
autofs.c:66:17: error: pointer 'str' may be used after 'realloc' [-Werror=use-after-free]
autofs.c: In function 'check_automount':
../lib/zdtmtst.h:131:9: error: pointer 'mountpoint' may be used after 'free' [-Werror=use-after-free]
131 | test_msg("ERR: %s:%d: " format " (errno = %d (%s))\n", __FILE__, __LINE__, ##arg, errno, strerror(errno))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
autofs.c:277:17: note: in expansion of macro 'pr_perror'
277 | pr_perror("%s: failed to close fd %d", mountpoint, p->fd);
| ^~~~~~~~~
autofs.c:268:9: note: call to 'free' here
268 | free(mountpoint);
| ^~~~~~~~~~~~~~~~
Fixes: #1731
v2: (@Snorch) always update `str` after successful realloc()
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Running cross compile tests with Debian unstable sometimes
fails due to missing or outdated packages.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Parasite creation started to fail with GCC 12:
On x86_64 with:
./compel/compel-host hgen -f criu/pie/restorer.built-in.o -o criu/pie/restorer-blob.h
Error (compel/src/lib/handle-elf-host.c:337): Unexpected undefined symbol: `strlen'. External symbol in PIE?
On aarch64 with:
ld: criu/pie/restorer.o: in function `lsm_set_label':
/drone/src/criu/pie/restorer.c:174: undefined reference to `strlen'
Line 174 is: "for (len = 0; label[len]; len++)"
Adding '-ffreestanding' to parasite compilation fixes these errors
because, according to GCC developers:
"strlen is a standard C function, so I don't see any bug in that being used
unless you do a freestanding compilation (-nostdlib isn't that)."
Signed-off-by: Adrian Reber <areber@redhat.com>
This fixes:
criu/config.c: In function ‘parse_statement’:
criu/config.c:232:43: error: the comparison will always evaluate as ‘true’ for the pointer operand in ‘*(configuration + (sizetype)((long unsigned int)i * 8)) + ((sizetype)offset + 1)’ must not be NULL [-Werror=address]
232 | if (configuration[i] + offset + 1 != 0 && strchr(configuration[i] + offset, ' ')) {
| ^~
Signed-off-by: Adrian Reber <areber@redhat.com>
This is a confusing change as it seems the original code was just wrong.
GCC 12 complains with:
In function ‘__conv_val’,
inlined from ‘std_strtoul’ at compel/plugins/std/string.c:202:7:
compel/plugins/std/string.c:154:24: error: array subscript 97 is above array bounds of ‘const char[37]’ [-Werror=array-bounds]
154 | return &conv_tab[__tolower(c)] - conv_tab;
| ^~~~~~~~~~~~~~~~~~~~~~~
compel/plugins/std/string.c: In function ‘std_strtoul’:
compel/plugins/std/string.c:10:19: note: while referencing ‘conv_tab’
10 | static const char conv_tab[] = "0123456789abcdefghijklmnopqrstuvwxyz";
| ^~~~~~~~
cc1: all warnings being treated as errors
Which sounds correct. The array conv_tab has just 37 elements.
If I understand the code correctly we are trying to convert anything
that is character between a-z and A-Z to a number for cases where
the base is larger than 10. For a base 11 conversion b|B should return 11.
For a base 35 conversion z|Z should return 35. This is all for a strtoul()
implementation.
The original code was:
static const char conv_tab[] = "0123456789abcdefghijklmnopqrstuvwxyz";
return &conv_tab[__tolower(c)] - conv_tab;
and that seems wrong. If conv_tab would have been some kind of hash it could
have worked, but '__tolower()' will always return something larger than
97 ('a') which will always overflow the array.
But maybe I just don't get that part of the code.
I replaced it with
return __tolower(c) - 'a' + 10;
which does the right thing: 'A' = 10, 'B' = 11 ... 'Z' = 35
Signed-off-by: Adrian Reber <areber@redhat.com>
This case sometimes will cause SIGILL signal in arm64 platform.
<<ARM Coretex-A series Programmer's Guide for ARMv8-A>> notes:
The ARM architecture does not require the hardware to ensure coherency
between instruction caches and memory, even for locations of shared
memory.
Therefore, we need flush dcache and icache for self-modifying code.
- https://developer.arm.com/documentation/den0024/a/Caches/Point-of-coherency-and-unification
Signed-off-by: fu.lin <fulin10@huawei.com>
When requested iovs are huge, criu needs to invoke more then one
preadv()s. In this situation criu truncates memory image with
offset of first preadv() and length of last one, which leads
to leakage of memory image. This patch fixs truncating with right
offset and length.
Signed-off-by: Liu Hua <weldonliu@tencent.com>
This commit adds feature check support to libcriu. It already exists in
the CLI and RPC and this just extends it to libcriu.
This commit provides one function to do all possible feature checks in
one call. The parameter to the feature check function is a structure and
the user can enable which features should be checked.
Using a structure makes the function extensible without the need to
break the API/ABI in the future.
Signed-off-by: Adrian Reber <areber@redhat.com>
A couple of months (or years) ago I looked into lgtm.com for CRIU. Today
on a pull request I saw result from lgtm.com for the first time and it
failed. Not sure what triggered the lgtm.com message into the CRIU
repository, but with the .lgtm.yml file in this commit lgtm.com can
actually build CRIU.
Signed-off-by: Adrian Reber <areber@redhat.com>
We face that btrfs returns anonymous device in stat instead of real
superblock dev for volumes, thus all btrfs volume mounts does not pass
check_mountpoint_fd due to dev missmatch between stat and mountinfo. We
can use special helper get_sdev_from_fd instead of stat to try to get
real dev of fd for btrfs.
We move check_mountpoint_fd from open_mountpoint into get_clean_fd and
ns_open_mountpoint to the point where temporary mount we open fd to is
still in mountinfo, thus get_sdev_from_fd would be able to find tmp
mount in mountinfo.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
New get_sdev_from_fd helper first gets mnt_id from fd using fdinfo and
then converts mnt_id to sdev using mountinfo.
By default mnt_id to sdev conversion only works for mounts in mntinfo.
If parse_mountinfo argument is true, will also parse current process
mountinfo when looking for mount sdev, this should be used only with
temporary mounts just created by criu in current mntns.
v3: add argument to parse self mountinfo for auxiliary mounts
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Only place where we used __open_mountpoint with non -1 mnt_fd is
open_mountpoint. Let's use check_mountpoint_fd for this case, so that we
now can remove mnt_id argument. Also now __open_mountpoint actually
always does open.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
GNUTLS_SHUT_RDWR sends an alert containing a close request and waits for
the peer to reply with the same message.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
We need to be sure that page-server doesn't wait for a new command when we
call gnutls_bye() that sends an alert containing a close request.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
This commit simply makes copies of SOCK_STREAM unix socket tests and uses
SOCK_SEQPACKET instead.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
We have multiple options which are valid only on restore or only on dump
or in any other specific criu mode, so it would be useful to have info
about current mode in opts so that we can validate other options against
current mode.
Plan is to use it for mount-v2 option as it is only valid on restore,
and this would make handling of different types mountpoints much easier.
Realization is a bit different for general code and rpc:
- When criu mode is set from main() we just parse mode from argv[optind]
just after parse_options() found optind of the command. Note that
opts.mode is available before check_options().
- For rpc service we reset opts.mode to CR_SWRK each time we restart
cr_service_work(), in the original service process we still have
CR_SERVICE to differentiate between them, and each request handling
function which does setup_opts_from_req sets opts.mode in accordance
with the processed request type. And it is also available before
check_options().
Now in check_options we can add filters on one mode only options.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Several lines above if (optind >= argc) we go to usage label and fail,
thus we don't need to check (optind < argc) here as it is always true.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Looks like in commit [1] we've non-intentionally added this tmp file to
git, let's remove it.
Fixes: 01ee29702 ("s390:zdtm: Enable zdtm for s390") [1]
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
zdtm_ct.c:44:12: error: function declaration isn’t a prototype [-Werror=strict-prototypes]
44 | static int create_timens()
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
In most cases we run tests as:
./test/zdtm.py run -a
But it's also possible to run tests from root makefile:
make test
In this case, if criu tree have no ./test/umount2 binary
built we get the error like:
make[3]: *** No rule to make target 'umount2'. Stop.
It's worth to mention this "3". That's because we have
build process tree like this:
make -> make -> make -> zdtm.py -> make umount2
and also we have MAKEFLAGS variable set to:
build=-r -R -f ...
And that's bad because "-r" option means no builtin
rules and -R means no builtin variables. That makes
`make umount2` not working. Let's just cleanup this
variable to make things work properly.
Fixes: #1699https://github.com/checkpoint-restore/criu/issues/1699
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
In contrast to the CLI it is not possible to do a single pre-dump via
RPC and thus libcriu. In cr-service.c pre-dump always goes into a
pre-dump loop followed by a final dump. runc already works around this
to only do a single pre-dump by killing the CRIU process waiting for the
message for the final dump.
Trying to implement pre-dump in crun via libcriu it is not as easy to
work around CRIU's pre-dump loop expectations as with runc that directly
talks to CRIU via RPC.
We know that LXC/LXD also does single pre-dumps using the CLI and runc
also only does single pre-dumps by misusing the pre-dump loop interface.
With this commit it is possible to trigger a single pre-dump via RPC and
libcriu without misusing the interface provided via cr-service.c. So
this commit basically updates CRIU to the existing use cases.
The existing pre-dump loop still sounds like a very good idea, but so
far most tools have decided to implement the pre-dump loop themselves.
With this change we can implement pre-dump in crun to match what is
currently implemented in runc.
Signed-off-by: Adrian Reber <areber@redhat.com>
We added cross-compile tests with testing debian release to be able to
replicate the error reported in #1653, however, installing build
dependencies in this release currently fails with the following error:
libc6-dev:armhf : Breaks: libc6-dev-armhf-cross (< 2.33~) but 2.32-1cross4 is to be installed
This is not something we can fix, therefore using the debian unstable
release (instead of testing) could be more reliable option for our CI.
This would still replicate the problem reported in #1653.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The --timeout option was introduced in [1] to prevent criu dump from
being able to hang indefinitely and allow users to adjust the time limit
in seconds for collecting tasks during the dump operation.
[1] d0ff730
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
Fixes: e2e8be37 ("x86/compel/fault-inject: Add a fault-injection for corrupting extended regset")
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Since
e2e8be37 ("x86/compel/fault-inject: Add a fault-injection for corrupting extended regset")
we doing fault-injection test for C/R of threads register set by filling tasks
xsave structures with the garbage. But there are some features for which that's not
safe. It leads to failures like described in #1635
In this particular case we meet the problem with PKRU feature, the problem
that after corrupting pkru registers we may restrict access to some vma areas,
so, after that process with the parasite injected get's segfault and crashes.
Let's manually specify which features is save to fill with the garbage by
keeping proper XFEATURE_MASK_FAULTINJ mask value.
Fixes: e2e8be37 ("x86/compel/fault-inject: Add a fault-injection for corrupting extended regset")
https://github.com/checkpoint-restore/criu/issues/1635
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
We try to disable time namespace based testing for kernels older than
5.11. But we fail to come up with the correct if condition.
This changes (major <= 5) to (major < 5). There are no kernels with
major > 5 so currently the time namespace based are never run. This
should finally change it to run time namespace based tests on kernel
versions newer than 5.10.
Signed-off-by: Adrian Reber <areber@redhat.com>
The version of ps in Alpine image by default is very limited.
It is based on the one from busybox and doesn't support options
such as '-p'.
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
Since commit 83301b5367a98 ("af_unix: Set TCP_ESTABLISHED for datagram sockets
too") in Linux kernel, SOCK_DGRAM unix sockets can have TCP_ESTABLISHED state
when connected. So we need to fix checks that assume SOCK_DRAM sockets cannot
have TCP_ESTABLISHED state.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
The function run_tcp_server() was the last place CRIU was still using
the IPv4 only function inet_ntoa(). It was only used during a print, so
that it did not really break anything, but with this commit the output
is now no longer:
Accepted connection from 0.0.0.0:58396
but correctly displaying the IPv6 address
Accepted connection from ::1:58398
if connecting via IPv6.
Signed-off-by: Adrian Reber <areber@redhat.com>
An issue with dumping deleted reg files in overlayfs:
After deleting a file originated from lower layer in merged dir,
fstat() on the /proc/$pid/map_files symlink returns st_nlink=1, while
linkat() fails with errno ENOENT.
Signed-off-by: langyenan <ianlang@tencent.com>
Looking at CI logs there are often messages like:
"[WARNING] Option --keep-going is more useful when running multiple tests"
This commit removes '--keep-going' from single zdtm test runs.
Signed-off-by: Adrian Reber <areber@redhat.com>
Starting with gcc-11, Debian's armhf compiler no longer builds with
a default -mfpu= option. Instead it enables the FPU via an extension
to the -march flag (--with-arch=armv7-a+fp). criu's Makefile explicitly
passes its own -march=armv7-a setting, which overrides the +fp default,
so we end up with no FPU:
cc1: error: '-mfloat-abi=hard': selected architecture lacks an FPU
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
Debian testing has newer compiler version and running
cross compilation tests would allow us to catch any compilation
errors early.
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
The current debian stable release is Bullseye, not Buster. However, we
can use the 'stable' release instead. This would allow the CI to
automatically pick up updates in the future.
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
When we declare struct and at the same time declare variable pointer of
this struct type, it looks like clang-format threats "*" as a
multiplication operator instead of indirection (pointer declaration)
operator and puts spaces on both sides, which looks wrong.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This is a test for "ghost/mount: allocate remounted_rw in shmem to get
info from other processes" patch, without the patch test fails with:
############# Test zdtm/static/mntns_ghost01 FAIL at result check ##############
Test output: ================================
16:15:19.607: 5: ERR: mntns_ghost01.c:95: open for write on rofs -> 7 (errno = 11 (Resource temporarily unavailable))
16:15:19.607: 4: FAIL: mntns_ghost01.c:121: Test died (errno = 11 (Resource temporarily unavailable))
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Previousely I din't mention this case because we had bad error handling
in ghost cleanup path.
Without these patch but with proper error handling for unlink we have an
error in mntns_ghost01 test:
Error (criu/files-reg.c:2269): Failed to unlink the remap file:
Read-only file system
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/151c859e1
Changes: check lookup_mnt_id return for NULL
Fixes: fd0a3cd9ef ("mount: remount ro mounts writable before
ghost-file restore")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Previousely remounted_rw was not shared between all processes on
restore, thus cleanup didn't got this info from rfi_remap and these
mounts were wrongly left writable after restore.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/3a1a592e7
Fixes: fd0a3cd9ef ("mount: remount ro mounts writable before
ghost-file restore")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
If unlinkat fails it means that fs is in "corrupted" state - spoiled
with non-unlinked auxiliary directories.
While on it add fixme note as this function can be racy and BUG_ON if
path contains double slashes.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/b7b4e69fd
Changes: simplify while loop condition, remove confusing FIXME, remove
excess !count check in favour of while loop condition check
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
1) On error paths need to close fd and unlock mutex.
2) Make rfi_remap return special return code to identify EEXIST from
linkat_hard, all other errors should be reported up.
3) Report unlinkat error as criu should not corrupt fs.
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/fe1d0be14
Changes: use close_safe(), fix order in "Fake %s -> %s link" error
message.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Always wait() for forked child processes. It avoid zombie processes in
containers that don't have an init process reaping orphans.
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
CentOS 8 goes EOL at the end of 2021. This switches our CentOS 8 based
tests to CentOS Stream 8 which should be supported until 2024.
Signed-off-by: Adrian Reber <areber@redhat.com>
Criu ignores SIGPIPE in most cases except swrk mode. And in the
following situtation criu get killed by SIGPIPE and have no chance
to do cleanup: Connection to page server is lost when we do disk-less
migration, criu send PS_IOV_FLUSH via a broken connction in
disconnect_from_page_server.
This patch let criu ignore SIGPIPE in all paths .
Signed-off-by: Liu Hua <weldonliu@tencent.com>
Now when we fixed clang-format complains in zdtm, let's switch to lates
clang-format available. This is effectively a revert of commit 07a2f0265
("ci: use Fedora 34 for lint CI runs").
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The new freezer_state is a complete equivalent of old freezer_thawed
except for the initial value. If old freezer_thawed was not initialized
it was 0 and in freezer_restore_state were threated as if we need to
freeze cgroup "back", thus before this patch if criu dump failed before
freezing dumpee, criu always freeze dumpee in cr_dump_finish which is
wrong. Switching to freezer_state initialized with FREEZER_ERROR fixes
the problem.
v2: improve description, rename to origin_freezer_state
Signed-off-by: Liu Hua <weldonliu@tencent.com>
Clang-format v13 on my Fedora 35 complains about these hunks, more over
reading the formating we had before is a pain:
} else /* comment */
if (smth) {
fail("")
return -1;
}
Let's make explicit {} braces for else, this way it looks much better.
Fixes: 93dd984ca ("Run 'make indent' on all C files")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Various I/O objects are unclosed when the object falls out of scope.
This can lead to non-deterministic behavior.
Also fixed a few missing list(). It doesn't play way with python3.
e.g., `random.shuffle(filter(...))` doesn't work.
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
We see that tests mntns_ghost01 and unlink_fstat03 can run
simultaneousely and thus the former sees leftover link_remap.* files in
the test directory created by the latter, and the latter is still
running so it's ok to have link_remap.* at this point.
Let's implicitly make all --link-remap tests exclusive (not running in
parallel).
Fixes: #1633
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We see error in centos8 ci on restore of socket-raw test:
inet: \tRestore: family AF_INET type SOCK_RAW proto 66
port 66 state TCP_CLOSE src_addr 0.0.0.0
Error (criu/sk-inet.c:834): inet: Can't create inet socket:
Protocol not supported
Centos 8 kernel replaces IPPROTO_MPTCP(262) with "in-kernel" value
IPPROTO_MPTCP_KERN(66) on inet_create(), but later shows this inkernel
value to criu when listing sockets info. Same code in inet_create()
returns EPROTONOSUPPORT on the attempr to create socket with
IPPROTO_MPTCP_KERN. So this ci error is completely rh8 kernel related.
Kernel should not show "in-kernel" value to userspace. But anyway this
is already changed in Centos 9 kernel, so we can just skip socket-raw
test on Centos 8.
v2: use cirrus.yml
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
There is no option in clang not to merge as much binary operands as it
fits in column limit, but here we need each bit on new line to make it
readable, so let's disable clang-format for x86_ins_capability_masks.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
A zombie process with 0 sid has a session leader in
outer pidns and has ignored SIGHUP. Criu has no idea
to restore this type of process, so fail the dumpping.
Signed-off-by: Liu Hua <weldonliu@tencent.com>
Automatic AlignTrailingComments fails to make those comments look right,
so let's do it manually, so that they both satisfy AlignTrailingComments
and also are human-readable.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Just set all possible values 0-3 and chack if it persists.
Reviewed-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
When one sets socket buffer sizes with setsockopt(SO_{SND,RCV}BUF*),
kernel sets coresponding SOCK_SNDBUF_LOCK or SOCK_RCVBUF_LOCK flags on
struct sock. It means that such a socket with explicitly changed buffer
size can not be auto-adjusted by kernel (e.g. if there is free memory
kernel can auto-increase default socket buffers to improve perfomance).
(see tcp_fixup_rcvbuf() and tcp_sndbuf_expand())
CRIU is always changing buf sizes on restore, that means that all
sockets receive lock flags on struct sock and become non-auto-adjusted
after migration. In some cases it can decrease perfomance of network
connections quite a lot.
So let's c/r socket buf locks (SO_BUF_LOCKS), so that sockets for which
auto-adjustment is available does not lose it.
Reviewed-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This is a new kernel feature to let criu restore sockets with kernel
auto-adjusted buffer sizes.
Reviewed-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We want to also c/r socket buf locks (SO_BUF_LOCKS) which are also
implicitly set by setsockopt(SO_{SND,RCV}BUF*), so we need to order
these two properly. That's why we need to wait for sk_setbufs to finish.
And there is no much point in seting buffer sizes asyncronously anyway.
Reviewed-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
When exceptions are raised during testing, the image streamer process
should be terminated as opposed to being left hanging.
This could lead to the whole test suite to be left hanging as it waits
for all child processes to exit.
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Fedora 35 comes with clang 13 which provides different results for
clang-format than clang 12 in Fedora 34.
Signed-off-by: Adrian Reber <areber@redhat.com>
Newer kernels (5.11) require echo 1 > /proc/sys/vm/unprivileged_userfaultfd
Without the 'echo 1' the kernel prints a message like this:
uffd: Set unprivileged_userfaultfd sysctl knob to 1 if kernel faults must be handled without obtaining CAP_SYS_PTRACE capability
Signed-off-by: Adrian Reber <areber@redhat.com>
Previously, `open_image(CR_FD_RULE, O_RSTR, pid)` was called twice.
Opening an image file twice is not allowed when streaming the image.
This commit optimizes the code to only open the image file once.
Also improved the error path in restore_ip_dump().
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
We see that on criu-ns dump/restore/dump of the process which initially
was not a session leader (with --shell-job option) we see sid == 0 for
it and fail with something like:
Error (criu/cr-dump.c:1333): A session leader of 41585(41585) is outside of its pid namespace
Note: We should not dump processes with sid 0 (even with --shell-job) as
on restore we can can put such processes from multiple sessions into
one, which is wrong.
Fixes: #232
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This simplifies the code by removing excess recursion and reusing
standard function to walk over file-tree instead of opencoding it.
This addresses problem mentioned in my review comment:
https://github.com/checkpoint-restore/criu/pull/1495#discussion_r677554523
Fixes: 0db135ac4 ("util: add rm -rf function")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
v2: split error checking from index variable initialization
v3: use PRIx64 for printing dev_t
Signed-off-by: fu.lin <fulin10@huawei.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We use here "%#x" printf specifier in pie code, but sbuf_printf core pie
printing function knows nothing about '#' specifier. More over simple
"%x" in pie does same as "%#x" in stdio printf, see print_hex* functions
add "0x" before hex numbers.
We've got this error on vzt-cpt runs in Virtuozzo:
(04.750271) pie: 158: Adjust id
Error: Unknown printf format %#
So to fix it we can just remove '#'.
Fixes: ecd432fe2 ("timerfd: Implement c/r procedure")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
ShellCheck reports the following problems:
SC2086: Double quote to prevent globbing and word splitting.
SC2035: Use ./*glob* or -- *glob* so names with dashes won't become options.
SC1091: Not following: ../env.sh was not specified as input (see shellcheck -x).
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
The shebang line in this file was removed in a previous commit and the
file should be non-executable.
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
Previous commit added support for python3 in criu-coredump. For convenience,
add two files (coredump-python2 and coredump-python3) that start
criu-coredump with respective python version. Edit env.sh accordingly.
Signed-off-by: Andrey Vyazovtsev <viazovtsev.av@phystech.edu>
Resolve the following python3 portability issues:
1) Python 3 needs explicit relative import path.
2) Coredumps are binary data, not unicode strings. Use byte strings
(b"" instead of "") and open files in binary format.
3) Some functions (for example: filter) return a list in python 2,
but an iterator in python 3. Port code to a common subset of python 2
and python 3 using itertool.
4) Division operator / changed meaning in Python 3. Use explicit
integer division (//) where appropriate.
Signed-off-by: Andrey Vyazovtsev <viazovtsev.av@phystech.edu>
The expected behavior of --tcp-close option when dumpping is to close
all established tcp connections including connection that is once
established but now closed. This adds an explicit description about
that behavior.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Since commit e42f5e0 ("tcp: allow to specify --tcp-close on dump"),
--tcp-close option can be used when checkpointing. This option skips
checkpointing established socket's state (including once established
but now closed socket). However, when restoring, we still try to
restore closed socket's state. As a result, a non-existent protobuf
image is opened.
This commit skips TCP_CLOSE socket when restoring established TCP
connection and removes the redundant check for TCP_LISTEN socket as
TCP_LISTEN socket cannot reach this function.
Suggested-by: Andrei Vagin <avagin@gmail.com>
Suggested-by: Radostin Stoyanov <radostin@redhat.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Restore operation fails when we perform CR operation of multiple
independent proceses that have device files because criu caches
the ids for the device files with same mnt_ids, inode pair. This
change ensures that even in case of a cached id found for a device, a
unique subid is generated and returned which is used for dumping.
Suggested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
This is just a placeholder dummy plugin and will be replaced by a proper
plugin that implements support for AMD GPU devices. This just
facilitates the initial pull request and CI build test trigger for early
code review of CRIU specific changes. Future PRs will bring in more
support for amdgpu_plugin to enable CRIU with AMD ROCm.
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Currently CRIU cannot handle Checkpoint Restore operations when a device
file is involved in a process, however, CRIU allows flexible extensions
via special plugins but still, for certain complex devices such as a GPU,
the existing hooks are not sufficient. This introduces few new hooks
that will be used to support Checkpoint Restore operation with AMD GPU
devices and potentially to other similar devices too.
- HANDLE_DEVICE_VMA
- UPDATE_VMA_MAP
- RESUME_DEVICES_LATE
*HANDLE_DEVICE_VMA:
Hook to detect a suitable plugin to handle device file VMA with
PF | IO mappings.
*UPDATE_VMA_MAP:
Hook to handle VMAs during a device file restore.
When restoring VMAs for the device files, criu runs sys_mmap in
the pie restore context but the offsets and file path within a
device file may change during restore operation so it needs to be
adjusted properly.
*RESUME_DEVICES_LATE:
Hook to do some special handling in late restore phase.
During criu restore phase when a device is getting restored with
the help of a plugin, some device specific operations might need
to be delayed until criu finalizes the VMA placements in address
space of the target process. But by the time criu finalizes this,
its too late since pie phase is over and control is back to criu
master process. This hook allows an external trigger to each
resuming task to check whether it has a device specific operation
pending such as issuing an ioctl call? Since this is called from
criu master process context, supply the pid of the target process
and give a chance to each plugin registered to run device
specific operation if the target pid is valid.
A future patch will add consumers for these plugin hooks to support AMD
GPUs.
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Support for external net namespaces has been introduced with
commit c2b21fbf (criu: add support for external net namespaces).
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
* Switch criu-ns from unversioned 'python' to 'python3'
for easier distribution packaging
* Add '--join-ns' interface to libcriu to allow joining
namespaces via libcriu like CLI and RPC already allow
Signed-off-by: Adrian Reber <areber@redhat.com>
run_test was trying to read criu logs on build failure
instead of runtime error.
This patch also removes the unnecessary subfolder with name "i"
and resolves some of issues reported by shellcheck.
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
This test case aims to verify that CRIU correctly
restores a process in IPC, UTS and Time namespaces
with criu_join_ns_add() libcriu API.
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
In runc we use the join-ns RPC API to enable checkpoint/restore of
containers with shared namespaces. Shared namespaces are often used
when containers run inside Kubernetes Pod.
In crun we use libcriu to interface with CRIU, however it currently
doesn't provide an API for join-ns. This patch adds the necessary
libcriu API to enable checkpoint/restore of containers with shared
namespaces with crun.
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
Python 2 has been deprecated since January 1, 2020 and linux distributions
already support Python 3. Thus, to simplify maintenance and packaging
we could support criu-ns as Python 3 only.
v2: Add a message for criu-ns installation
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
PEP 394 recommends changing python shebangs to python3 when Python 3.x
is supported. This is similar to `crit-python3`.
https://www.python.org/dev/peps/pep-0394/
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
2021-10-12 12:58:43 -07:00
961 changed files with 48796 additions and 6915 deletions
CRIU project is (almost) the never-ending story, because we have to always keep up with the
@ -13,8 +8,8 @@ Here are some useful hints to get involved.
* We have both -- [very simple](https://github.com/checkpoint-restore/criu/issues?q=is%3Aissue+is%3Aopen+label%3Aenhancement) and [more sophisticated](https://github.com/checkpoint-restore/criu/issues?q=is%3Aissue+is%3Aopen+label%3A%22new+feature%22) coding tasks;
* CRIU does need [extensive testing](https://github.com/checkpoint-restore/criu/issues?q=is%3Aissue+is%3Aopen+label%3Atesting);
* Documentation is always hard, we have [some information](https://criu.org/Category:Empty_articles) that is to be extracted from people's heads into wiki pages as well as [some texts](https://criu.org/Category:Editor_help_needed) that all need to be converted into useful articles;
* Feedback is expected on the GitHub issues page and on the [mailing list](https://lists.openvz.org/mailman/listinfo/criu);
* We accept GitHub pull requests and this is the preferred way to contribute to CRIU. If you prefer to send patches by email, you are welcome to send them to [CRIU development mailing list](https://lists.openvz.org/mailman/listinfo/criu).
* Feedback is expected on the GitHub issues page and on the [mailing list](https://lore.kernel.org/criu);
* We accept GitHub pull requests and this is the preferred way to contribute to CRIU. If you prefer to send patches by email, you are welcome to send them to [CRIU development mailing list](https://lore.kernel.org/criu).
Below we describe in more detail recommend practices for CRIU development.
* Spread the word about CRIU in [social networks](http://criu.org/Contacts);
* If you're giving a talk about CRIU -- let us know, we'll mention it on the [wiki main page](https://criu.org/News/events);
@ -32,54 +27,137 @@ The repository may contain multiple branches. Development happens in the **criu-
To clone CRIU repo and switch to the proper branch, run:
First, you need to install compile-time dependencies. Check [Installation dependencies](https://criu.org/Installation#Dependencies) for more info.
Follow these steps to compile CRIU from source code.
To compile CRIU, run:
#### Installing build dependencies
First, you need to install the required build dependencies. We provide scripts to simplify this process for several Linux distributions in [contrib/dependencies](contrib/dependencies). For a complete list of dependencies, please refer to the [installation guide](https://criu.org/Installation).
##### On Ubuntu/Debian-based systems:
```
make
./contrib/dependencies/apt-packages.sh
```
##### On Fedora/CentOS-based systems:
```
./contrib/dependencies/dnf-packages.sh
```
##### Using Nix:
```
nix develop
```
#### Compiling CRIU
Once the dependencies are installed, you can compile CRIU by running the `make` command from the root of the source directory:
```
make
```
This should create the `./criu/criu` executable.
## Edit the source code
If you use ctags, you can generate the ctags file by running
```
make tags
```
When you change the source code, please keep in mind the following code conventions:
* code is written to be read, so the code readability is the most important thing you need to have in mind when preparing patches
* we prefer tabs and indentations to be 8 characters width
* CRIU mostly follows [Linux kernel coding style](https://www.kernel.org/doc/Documentation/process/coding-style.rst), but we are less strict than the kernel community.
* we prefer line length of 80 characters or less, more is allowed if it helps with code readability
* CRIU mostly follows [Linux kernel coding style](https://www.kernel.org/doc/Documentation/process/coding-style.rst), but we are less strict than the kernel community
Other conventions can be learned from the source code itself. In short, make sure your new code
looks similar to what is already there.
Other conventions can be learned from the source code itself. In short, make sure your new code looks similar to what is already there.
## Automatic tools to fix coding-style
Important: These tools are there to advise you, but should not be considered as a "source of truth", as tools also make nasty mistakes from time to time which can completely break code readability.
The following command can be used to automatically run a code linter for Python files (ruff), Shell scripts (shellcheck),
text spelling (codespell), and a number of CRIU-specific checks (usage of print macros and EOL whitespace for C files).
```
make lint
```
In addition, we have adopted a [clang-format configuration file](https://www.kernel.org/doc/Documentation/process/clang-format.rst)
based on the kernel source tree. However, compliance with the clang-format autoformat rules is optional. If the automatic code formatting
results in decreased readability, we may choose to ignore these errors.
Run the following command to check if your changes are compliant with the clang-format rules:
```
make indent
```
This command is built upon the `git-clang-format` tool and supports two options `BASE` and `OPTS`. The `BASE` option allows you to
specify a range of commits to check for coding style issues. By default, it is set to `HEAD~1`, so that only the last commit is checked.
If you are developing on top of the criu-dev branch and want to check all your commits for compliance with the clang-format rules, you
can use `BASE=origin/criu-dev`. The `OPTS` option can be used to pass additional options to `git-clang-format`. For example, if you want
to check the last *N* commits for formatting errors, without applying the changes to the codebase you can use the following command.
```
make indent OPTS=--diff BASE=HEAD~N
```
Note that for pull requests, the "Run code linter" workflow runs these checks for all commits. If a clang-format error is detected
we need to review the suggested changes and decide if they should be fixed before merging.
Here are some bad examples of clang-format-ing:
* if clang-format tries to force 120 characters and breaks readability - it is wrong:
```
@@ -58,8 +59,7 @@ static int register_membarriers(void)
If you get tired of typing `--to=criu@openvz.org` all the time,
If you get tired of typing `--to=criu@lists.linux.dev` all the time,
you can configure that to be automatically handled as well:
```
git config sendemail.to criu@openvz.org
git config sendemail.to criu@lists.linux.dev
```
If a developer is sending another version of the patch (e.g. to address
@ -320,7 +398,7 @@ version if needed though).
### Mail patches
The patches should be sent to CRIU development mailing list, `criu AT openvz.org`. Note that you need to be subscribed first in order to post. The list web interface is available at https://openvz.org/mailman/listinfo/criu; you can also use standard mailman aliases to work with it.
The patches should be sent to CRIU development mailing list, `criu AT lists.linux.dev`. Note that you need to be subscribed first in order to post. The list web interface is available at https://lore.kernel.org/criu; you can also use standard mailman aliases to work with it.
Please make sure the email client you're using doesn't screw your patch (line wrapping and so on).
@ -337,5 +415,3 @@ sometimes a patch may fly around a week before it gets reviewed.
Wiki article: [Continuous integration](https://criu.org/Continuous_integration)
CRIU tests are run for each series sent to the mailing list. If you get a message from our patchwork that patches failed to pass the tests, you have to investigate what is wrong.
We also recommend you to [enable Travis CI for your repo](https://criu.org/Continuous_integration#Enable_Travis_CI_for_your_repo) to check patches in your git branch, before sending them to the mailing list.
- [A simple example of usage](http://criu.org/Simple_loop)
- [Examples of more advanced usage](https://criu.org/Category:HOWTO)
- Troubleshooting can be hard, some help can be found [here](https://criu.org/When_C/R_fails), [here](https://criu.org/What_cannot_be_checkpointed) and [here](https://criu.org/FAQ)
- Troubleshooting can be hard, some help can be found [here](https://criu.org/When_C/R_fails), [here](https://criu.org/What_cannot_be_checkpointed) and [here](https://criu.org/index.php?title=FAQ)