This option is used to mark external resources on dump.
Currently it's going to be used to handle external tty-s,
but in a future it can be used to any type of resources.
We can have a few ways to restore external resources and
we will have a separate options to say how to restore each type.
For example, we can use --inherit-fd to restore external
file descriptors.
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We can't use only a terminal device, because we can not distinguish
two pty-s from different mounts in this case.
$ mount -t devpts -o newinstance xxx pts1
$ mount -t devpts -o newinstance xxx pts2
$ stat pts1/0
Device: 27h/39d Inode: 3 Links: 1 Device type: 88,0
$ stat pts2/0
Device: 28h/40d Inode: 3 Links: 1 Device type: 88,0
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
timer_t is (void *) in glibc, but timer_t is (int) in kernel.
When we call system calls, we need to use timer_t from kernl.
https://github.com/xemul/criu/issues/98
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This value will differ on C/R:
- on checkpoint it means that it's possible to dump logiuid values;
- on restore it means that it's possible to unset loginuid and write
saved value to unsetted loginuid.
Signed-off-by: Dmitry Safonov <dsafonov@odin.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We use page frame number to detect vDSO which has been remapped
in-place from runtime vDSO during restore. In such case if the
kernel is younger than 3.16 the "[vdso]" mark won't be reported
in procfs output.
Still to address recently reported CVEs and be able to run CRIU
in unprivileged mode we need to handle vDSO without pagemap access
and here is the deal -- when we find VMA which "looks like" vDSO
we try to scan it for vDSO symbols and if it matches we restore
its status without PFN access.
Here is some details on @pagemap access in-kernel history:
- @pagemap introduced in commit 85863e475e59 where anyone
which can attach to a task via ptrace is allowed to read
data from @pagemap (Feb 4 2008, v2.6.25-rc1)
- in commit 006ebb40d3d65 ptrace attach rule has been changed
into ptrace read permission (May 19 2008, v2.6.27-rc1)
- in commit ab676b7d6fbf4 opening of @pagemap become guarded
with CAP_SYS_ADMIN because of leak of physical addresses
into userspace (Mar 9 2015, v4.0-rc5)
- in commit 1c90308e7a77a opening of @pagemap become available
for regular users again (with ptrace read permission) but
physical addresses of pages are hidden from non-privileged
userd (Sep 8 2015, v4.3-rc1)
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Looks-good-to-me: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When run from regular user criu will get EACCES/EPERM from
opening proc, but in some situations criu will now how to
deal with it. So this patch makes it possible not to print
error message in logs for such cases.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Looks-good-to-me: Andrew Vagin <avagin@virtuozzo.com>
We no longer support root-mode service and suid binaries, so
any artificial restrictions no longer make sense.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Looks-good-to-me: Andrew Vagin <avagin@virtuozzo.com>
This as well as restore requires several steps to reach per-thread
support during dump stage
- @creds area to be fetched from the parasite is embedded into
parasite_dump_structure
- when test for task to be dumpable we no longer compare caps
because we now allow them to be different (and I renamed
proc_status_creds_eq to proc_status_creds_dumpable for this
sake)
- have to extend dump_thread_common to support dumping of
creds (we call for dump_thread_common in several places,
in particular when we need to fetch misc params we don't
need creds, here @creds option comes into the play)
- after this patch no creds-X.img file be generated anymore,
I guess we might drop it off with time from descriptors
https://jira.sw.ru/browse/PSBM-41416
v2:
- In dump_task_creds() don't mangle the call for parasite_dump_creds
and collect_lsm_profile
- PARASITE_MAX_GROUPS takes parasite_dump_thread into account because
dump_thread_common now serves two cases: for plain misc parameters
fetching and for creds as well (depending on the context)
- when test for dumpable we still require the seccomp filters
to match, they can be different and we need to support such
configuration too but not in this series
v3:
- Rip off dump_task_creds completely, together with PARASITE_CMD_DUMP_CREDS,
we dump creds unconditionally in dump_thread_common
- the group leader thread data is fetched via new
parasite_dump_thread_leader_seized helper
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Because the creds parameters are to be passed inside pie/restorer
code but read before thread_restore_args and task_restore_args
structures are allocated we need a small trick and prepare
creds int several stages
- collect all creds data into separate private memory blobs
- once all memory needed for restorer is allocated we relocate
pointers in this blocks and setup
thread_restore_args::thread_creds_args to appropriate
address
- restorer works as usual and setup creds parameters as before
v2:
- fix addressing in positioning of rst_ memory (I've occasionally
zap pointers and when been sending patches forgot to merge changes
back, so while I've the series successfully restoring containers
with different creds, if been merged the series won't work. So
all changes are merged as appropriate)
- drop module's global @cap_last_cap from pie/restorer.c
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
For easier comparision which gonna be addressed in next patch.
https://jira.sw.ru/PSBM-41416
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Similar to devtmpfs and devpts, skip binfmt_misc
mount if it's not virtual.
Signed-off-by: Kirill Tkhai <ktkhai@odin.com>
Acked-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently criu dump may hang indefinitely. E.g. in wait for task
that blocked in vfork() or task could be in D state for some other
reason. This patch adds time limit on collecting tasks during the
dump operation. If collecting processes takes too long, the dump
process will be terminated. Timeout is 5 seconds by default, but
it could be changed via parameter.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch brings add_to_string() and construct_string() helpers.
They allow to create a string with variable amount of parameters in sprintf()
manner, but supporting string allocation (and reallocation if necessary)
v2:
1) Helpers were renamed to xstrcat() and xsprintf() respectively.
2) Added printf attributes to force compiler check
Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Patch restores freezer cgroup state between finalize_restore stages.
It should be done after first stage because we cannot unmap restorer blob
from frozen process, and before second stage because we must freeze processes
before they continue run.
We also need to move fini_cgroup between these stages to provide freezer
cgroup state restorer access to cgroup mount directories.
Error handlers contains fini_cgroup, so we are sure that fini_cgroup call
won't be missed.
Patch restores state only for one freezer cgroup from --freeze-cgroup option,
not all states from whole hierarchy, because CRIU supports checkpoint from
freezer cgroup hierarchy only with THAWED state, except root cgroup from
--freeze-cgroup option.
Signed-off-by: Evgeniy Akimov <geka666@gmail.com>
Signed-off-by: Eugene Batalov <eabatalov89@gmail.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
CRIU sets freezer.state to "THAWED" during process tree dumping. That's why
we can't simply save freezer.state file contents to cgroups image. New
special function get_real_freezer_state() returns freezer cgroup state
observed before CRIU dumping start. Patch puts its return value to dump file.
Signed-off-by: Evgeniy Akimov <geka666@gmail.com>
Signed-off-by: Eugene Batalov <eabatalov89@gmail.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This will be required for page-cache and page-proxy set.
Signed-off-by: Rodrigo Bruno <rbruno at gsd.inesc-id.pt>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
A freezer cgroup can contain tasks which will be not dumped,
criu unfreezes the group, so we need to freeze all extra
task with ptrace like we do for target tasks.
Currently we attache and send an interrupt signals to these tasks,
but we don't call waitpid() for them, so then waitpid(-1, ...)
returns these tasks where we don't expect to see them.
v2: execute freezer_detach() only if opts.freeze_cgroup is set
calculate extra tasks in a freezer cgroup correctly
v3: s/frozen_processes/processes_to_wait/
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It will be used to mount AutoFS, because context creation is required in
addition to actual mount operation.
Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch introduces three helpers:
1) pstree_item_by_real() - search for pstree item by real pid.
2) pstree_item_by_virt() - search for pstree item by virtual pid.
3) pid_to_virt() - return virtual pis by real one.
Note: pstree_item_by_virt() and pid_to_virt() will be used to migrate AutoFS.
Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently we wait when a namespace will be restored to get its root.
We need to open a namespace root to open a file to restore a memory mapping.
A process restores mappings and only then forks children. So we can have
a situation, when we need to open a file from a namespace, which will be
"restored" by one of our children.
The root task restores all mount namespaces and opens a file descriptor
for each of them. In this patch we open root for each mntns in the root
task.
If we neeed to get root of a namespace which isn't populated, we can get
it from the root task. After the CR_STATE_FORKING stage, the root task
closes all namespace descriptors ane we know that all namespaces are
populated at this moment.
v2: don't close root_fd for root ns, because it was not opened
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We need to perform dirty page tracking when dumping shmem but there
we have only const vmas so we need pmc to work with them. Also pmc concept
implies that it won't change its vmas so it would be natural to declared
them as const.
Signed-off-by: Fyodor Bocharov <fbocharov@yandex.ru>
Signed-off-by: Eugene Batalov <eabatalov89@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In LXD, we use the container name in the LSM profile. If the container name
is changed on migrate (on the host side), we want to use a different LSM
profile name (a. la. --cgroup-root). This flag adds that support.
v2: remove unused field, add comment about double detection in
kerndat_lsm()
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When we're restoring fsnotify watchees we need to resolve
path to a handle at some mountpoint referred by @s_dev
member (device ID) which is saved inside image. This
ID actually may be changed at the every mount (say
one restores container after machine reboot) or in
case of container's migration.
Thus the test for overmounting in __open_mountpoint
will fail and we get an error.
Lets do a trick: introduce @s_dev_rt member which
is supposed to carry run-time device ID. When dumping
this member simply equal to traditional @s_dev fetched
from the procfs, but when restoring we fetch it from
stat call once mountpoint become alive.
https://jira.sw.ru/browse/PSBM-41610
v2:
- predefine MOUNT_INVALID_DEV
- use fetch_rt_stat instead of assigning device in restore_shared_options
- copy @s_dev_rt in propagate_siblings and propagate_mount
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch implements checkpoint/restore functionality
for binfmt_misc mounts. Both magic and extension types
and "disabled" state are supported.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Due to security reasons the systemd-spawn mode is no longer
supported in service.
Also fix the default binding address to be in local cwd not
to start global service by chance.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We found that we want to know whether SIGSTOP is queue
in both or is in one of this queues.
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Also define some constants for people who don't have them in their headers.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's required to check the SIGSTOP signal, which can't be blocked.
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
So we keep it and dont close inside close_old_fds()
helper but pass into veth creation so the kernel
can fetch the net namespace of the veth peer.
v2 (by avagin@):
- don't forget to close opened descriptor
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
v2: use a cached value to dump ipv6 interface addesses
call get_ipv6() from kerndat_init_rst too
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It returns EINTR, so we need to handle it.
$ bash test/zdtm.sh --restore-sibling ns/static/env00
...
futex(0x7fc20ec92010, FUTEX_WAIT, 1, {120, 0}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This commit adds basic support for dumping and restoring seccomp filters
via the new ptrace interface. There are two current known limitations with
this approach:
1. This approach doesn't support restoring tasks who first do a seccomp()
and then a setuid(); the test elaborates on this and I don't think it is
tough to do, but it is not done yet.
2. Filters are compared via memcmp(), so two tasks which have the same
parent task and install identical (via memory) filters will have those
filters considered to be the "same". Since we force all tasks to have
the same creds (including seccomp filters) right now, this isn't a
problem.
The approach used here is very similar to the cgroup approach: the actual
filters are stored in a seccomp.img, and each task has an id that points to
the part of the filter tree it needs to restore. This keeps us from dumping
the same filter multiple times, since filters are inherited on fork.
v2:
* remove unused seccomp_filters field from struct rst_info
* rework memory layout for passing filters to restorer blob
* add a sanity check when finding inherited filters
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
v2: add comments and rename ns_created to ns_populated.
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
close_olds_fds() knows nothing about more than one set of service file
descriptros, so it's better to call it before forking children as it was
bedore 9d60724eca ("restore: restore mntns before creating private vma-s")
The root task restores all processes and pin them with file descriptors,
then a task restores a mount namespace by opening the file descriptor of
the root task via /proc/pid/fd/X.
Reported-by: Mr Jenkins
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We need to open a file to restore a file mapping and this file
can be from a current mntns.
v2: All namespaces are resotred from the root task and then
other tasks calls setns() to set a proper mntns.
v3: fix comments from Pavel
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Grabbed from kernel. Probably worth to gather
all bits manipulators here in future.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Implementing c/r of bridges with slaves shouldn't be too hard (viz. the
comment), but this is all I need to for right now.
v2: remove extra debug statement
v3: * remember to close fd in dump_bridge
* use "known" buffer length and snprintf for spath in dump_bridge
* change brace style
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When live migrating a container with large amount of processes
inside the time to do page-server-ed dump may be up to 10 times
slower than for the local dump.
The delay is always introduced in the open_page_server_xfer()
when criu negotiates the has_parent bit on the 2nd task. This
likely happens because of the Nagel algo taking place -- after
the write() of the OPEN2 command happened kernel delays this
command sending waiting for more data.
v2:
Fix this by turning on CORK option on memory transfer sockets
on send side, and NODELAY one once on urgent data. Receive
side is always NODELAY-ed. According to Alexey Kuznetsov this
is the best mode ever for such type of transfers.
v3:
Push packets in pre-dump's check_parent_server_xfer too.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@odin.com>
Pass function name into a helper instead of pointer
wich doesn't provide much useful info.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>