This will be required for page-cache and page-proxy set.
Signed-off-by: Rodrigo Bruno <rbruno at gsd.inesc-id.pt>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
A freezer cgroup can contain tasks which will be not dumped,
criu unfreezes the group, so we need to freeze all extra
task with ptrace like we do for target tasks.
Currently we attache and send an interrupt signals to these tasks,
but we don't call waitpid() for them, so then waitpid(-1, ...)
returns these tasks where we don't expect to see them.
v2: execute freezer_detach() only if opts.freeze_cgroup is set
calculate extra tasks in a freezer cgroup correctly
v3: s/frozen_processes/processes_to_wait/
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It will be used to mount AutoFS, because context creation is required in
addition to actual mount operation.
Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch introduces three helpers:
1) pstree_item_by_real() - search for pstree item by real pid.
2) pstree_item_by_virt() - search for pstree item by virtual pid.
3) pid_to_virt() - return virtual pis by real one.
Note: pstree_item_by_virt() and pid_to_virt() will be used to migrate AutoFS.
Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently we wait when a namespace will be restored to get its root.
We need to open a namespace root to open a file to restore a memory mapping.
A process restores mappings and only then forks children. So we can have
a situation, when we need to open a file from a namespace, which will be
"restored" by one of our children.
The root task restores all mount namespaces and opens a file descriptor
for each of them. In this patch we open root for each mntns in the root
task.
If we neeed to get root of a namespace which isn't populated, we can get
it from the root task. After the CR_STATE_FORKING stage, the root task
closes all namespace descriptors ane we know that all namespaces are
populated at this moment.
v2: don't close root_fd for root ns, because it was not opened
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We need to perform dirty page tracking when dumping shmem but there
we have only const vmas so we need pmc to work with them. Also pmc concept
implies that it won't change its vmas so it would be natural to declared
them as const.
Signed-off-by: Fyodor Bocharov <fbocharov@yandex.ru>
Signed-off-by: Eugene Batalov <eabatalov89@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In LXD, we use the container name in the LSM profile. If the container name
is changed on migrate (on the host side), we want to use a different LSM
profile name (a. la. --cgroup-root). This flag adds that support.
v2: remove unused field, add comment about double detection in
kerndat_lsm()
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When we're restoring fsnotify watchees we need to resolve
path to a handle at some mountpoint referred by @s_dev
member (device ID) which is saved inside image. This
ID actually may be changed at the every mount (say
one restores container after machine reboot) or in
case of container's migration.
Thus the test for overmounting in __open_mountpoint
will fail and we get an error.
Lets do a trick: introduce @s_dev_rt member which
is supposed to carry run-time device ID. When dumping
this member simply equal to traditional @s_dev fetched
from the procfs, but when restoring we fetch it from
stat call once mountpoint become alive.
https://jira.sw.ru/browse/PSBM-41610
v2:
- predefine MOUNT_INVALID_DEV
- use fetch_rt_stat instead of assigning device in restore_shared_options
- copy @s_dev_rt in propagate_siblings and propagate_mount
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch implements checkpoint/restore functionality
for binfmt_misc mounts. Both magic and extension types
and "disabled" state are supported.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Due to security reasons the systemd-spawn mode is no longer
supported in service.
Also fix the default binding address to be in local cwd not
to start global service by chance.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We found that we want to know whether SIGSTOP is queue
in both or is in one of this queues.
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Also define some constants for people who don't have them in their headers.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's required to check the SIGSTOP signal, which can't be blocked.
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
So we keep it and dont close inside close_old_fds()
helper but pass into veth creation so the kernel
can fetch the net namespace of the veth peer.
v2 (by avagin@):
- don't forget to close opened descriptor
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
v2: use a cached value to dump ipv6 interface addesses
call get_ipv6() from kerndat_init_rst too
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It returns EINTR, so we need to handle it.
$ bash test/zdtm.sh --restore-sibling ns/static/env00
...
futex(0x7fc20ec92010, FUTEX_WAIT, 1, {120, 0}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This commit adds basic support for dumping and restoring seccomp filters
via the new ptrace interface. There are two current known limitations with
this approach:
1. This approach doesn't support restoring tasks who first do a seccomp()
and then a setuid(); the test elaborates on this and I don't think it is
tough to do, but it is not done yet.
2. Filters are compared via memcmp(), so two tasks which have the same
parent task and install identical (via memory) filters will have those
filters considered to be the "same". Since we force all tasks to have
the same creds (including seccomp filters) right now, this isn't a
problem.
The approach used here is very similar to the cgroup approach: the actual
filters are stored in a seccomp.img, and each task has an id that points to
the part of the filter tree it needs to restore. This keeps us from dumping
the same filter multiple times, since filters are inherited on fork.
v2:
* remove unused seccomp_filters field from struct rst_info
* rework memory layout for passing filters to restorer blob
* add a sanity check when finding inherited filters
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
v2: add comments and rename ns_created to ns_populated.
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
close_olds_fds() knows nothing about more than one set of service file
descriptros, so it's better to call it before forking children as it was
bedore 9d60724eca ("restore: restore mntns before creating private vma-s")
The root task restores all processes and pin them with file descriptors,
then a task restores a mount namespace by opening the file descriptor of
the root task via /proc/pid/fd/X.
Reported-by: Mr Jenkins
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We need to open a file to restore a file mapping and this file
can be from a current mntns.
v2: All namespaces are resotred from the root task and then
other tasks calls setns() to set a proper mntns.
v3: fix comments from Pavel
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Grabbed from kernel. Probably worth to gather
all bits manipulators here in future.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Implementing c/r of bridges with slaves shouldn't be too hard (viz. the
comment), but this is all I need to for right now.
v2: remove extra debug statement
v3: * remember to close fd in dump_bridge
* use "known" buffer length and snprintf for spath in dump_bridge
* change brace style
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When live migrating a container with large amount of processes
inside the time to do page-server-ed dump may be up to 10 times
slower than for the local dump.
The delay is always introduced in the open_page_server_xfer()
when criu negotiates the has_parent bit on the 2nd task. This
likely happens because of the Nagel algo taking place -- after
the write() of the OPEN2 command happened kernel delays this
command sending waiting for more data.
v2:
Fix this by turning on CORK option on memory transfer sockets
on send side, and NODELAY one once on urgent data. Receive
side is always NODELAY-ed. According to Alexey Kuznetsov this
is the best mode ever for such type of transfers.
v3:
Push packets in pre-dump's check_parent_server_xfer too.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@odin.com>
Pass function name into a helper instead of pointer
wich doesn't provide much useful info.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Reading stops after an EOF or a specified charecter.
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Contrary to a popular opinion, there is no need to check
an argument for being non-NULL before calling free().
>From free(3) man page:
> > If ptr is NULL, no operation is performed.
Let's change xfree macro to be a synonym for free().
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch(set) is inspired by similar from Andrey Vagin sent
sime time earlier.
The major idea is to artificially fail criu dump or restore at
specific places and let zdtm tests check whether failed dump
or restore resulted in anything bad.
This particular patch introduces the ability to tell criu "fail
at X point". Each point is specified with a integer constant
and with the next patches there will appear places over the
code checking for specific fail code being set and failing.
Two points are introduced -- early on dump, right after loading
the parasite and right after creation of the root task.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The kernel prior 4.3 is exporting FS_EVENT_ON_CHILD
bit via procfs fdinfo interface. This bit is kernel's
internal and should not be passed in inotify_add_watch
call. Thus simply filter it out when obtain from old
images for backward compatibility reason.
More details here https://lkml.org/lkml/2015/9/21/680
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This allows the user to perform actions before dumping or restoration
occurs.
Signed-off-by: Matthew Krafczyk <krafczyk.matthew@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In commit c2271198, Laurent Dufour kindly reunified the VDSO code
that had become duplicated between architectures. Unfortunately
this introduced a regression in AArch64 where apparently due to
the scope of vdso_symbols array of pointers to characters changing
from local to global, load-time relocations became necessary.
The following thread on the GCC mailing list discusses why
load-time relocations can be necessary when pointers are used,
although it doesn't mention the potential for locally scoped
arrays to be handled differently:
https://gcc.gnu.org/ml/gcc/2004-05/msg01016.html
Because the alternatives, such as porting piegen to AArch64, are
far more involved, simply revert the change in scope.
Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In the recent VDSO code reunification, some types were changed but
a pair of necessary corresponding changes was omitted. Fix that so
the AArch64 build succeeds without type-related
warnings-turned-errors. Also move the definition to the
AArch64-specific header since it's not currently being used by any
other architectures.
Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When in a userns, tasks can't write to certain sysctl files:
(00.009653) 1: Error (sysctl.c:142): Can't open sysctl kernel/hostname: Permission denied
See inline comments for details on affected namespaces.
Mostly for my own education in what is required to port something to be
userns restorable, I ported the sysctl stuff. A potential concern for this
patch is that copying structures with pointers around is kind of gory. I
did it ad-hoc here, but it may be worth inventing some mechanisms to make
it easier, although I'm not sure what exactly that would look like
(potentially re-using some of the protobuf bits; I'll investigate this more
if it looks helpful when doing the cgroup user namespaces port?).
Another issue is that there is not a great way to return non-fd stuff in
memory right now from userns_call; one of the little hacks in this code
would be "simplified" if we invented a way to do this.
v2: coalesce the individual struct sysctl_req requests into one big
sysctl_userns_req that is in a contiguous region of memory so that we
can pass it via userns_call. Hopefully nobody finds my little ascii
diagram too offensive :)
v3: use the fork/setns trick to change the syctl values in the right ns for
IPC/UTS nses; see inline comment for details
v4: only use sysctl_userns_req when actually doing a userns_call.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
"ip route dump" dumps only ipv4 routes.
Reported-by: Ross Boucher <boucher@gmail.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
v2: use struct irmap directly in irmap_path_opt
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Issue #18. When restore fails ghost files remain there. And
to remove them we have to know their list, paths to original
files (to construct the ghost name) and the namespace ghost
lives in.
For the latter we keep the restore task namespace at hands
till the final stage and setns into it to kill ghosts.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Info about ghosts presence and paths will be needed to
remove the ghosts itself and thus are needed in criu.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
First -- avoid two memory copies by printing ns root directly, and
second -- remove extra argument from create_ghost, the mnt_id value
we need there can be found on the ghost_file object.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>