We have a use-after-free in predump code:
1st the free_pstree() is called in pre_dump_tasks(), then we
go to irmap_predump_run() which may call the lookup_irmap()
which, in turn, dereferences the root_item to get the root
mount ns fd.
But the problem is bigger than that. After we've released the
tasks (done before freeing pstree on predump) we can no longer
access them by PIDs, so keeping the root-item after irmap
scan is not a fix.
Fix is to get the root fd before releasing the tasks and using
one in irmap scanner.
Caught recently on iterative inotify_irmap test.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
In cr_dump_tasks() we expect restore_root_task to return < 0 if
error ocures.
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When we compare sets in cg_set_compare() we presume that controller
names are properly sorted but because of use of strcmp(cc->path, path)
it's not true. In particular in case if there are two same sets which
differ in paths only
(00.126812) cg: `- New css ID 2
(00.127051) cg: `- [memory] -> [/vz-1]
(00.127079) cg: `- [name=systemd] -> [/vz-1]
(00.127108) cg: `- [net_cls] -> [/vz-1]
(00.239829) cg: `- New css ID 3
(00.240067) cg: `- [memory] -> [/vz-1]
(00.240096) cg: `- [net_cls] -> [/vz-1]
(00.240154) cg: `- [name=systemd] -> [/vz-1/system.slice/dbus.service]
we currently refuse to dump such configuretion. Thus remove
path comparision from the first place.
CC: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
$ cat /proc/self/mountinfo
...
1 1 0:2 / / rw - rootfs rootfs rw,size=373396k,nr_inodes=93349
...
You can see that mnt_id and parent_mnt_id are equals here.
This patch interpretes this case as a root mount in a tree.
0'th mount is rootfs, which is mounted in init_mount_tree().
We don't see it in cases when system makes chroot, because of
static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt)
...
/* mountpoints outside of chroot jail will give SEQ_SKIP on this */
err = seq_path_root(m, &mnt_path, &root, " \t\n\\");
Cc: beproject criu <beprojectcriu@gmail.com>
Cc: Christopher Covington <cov@codeaurora.org>
Reported-by: beproject criu <beprojectcriu@gmail.com>
Reviewed-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If @ticks is zero the kernel returns error
because on creation the @ticks is already zero,
so simply setup @ticks if real value present.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
My debian testing produces the following output for uname:
$ uname -r
3.14-2-amd64
and so:
$ set -- `uname -r | sed 's/\./ /g'`
$ echo $1
3
$ echo $2
14-2-amd64
this causes zdtm.sh to fail for me on line 293:
[ $1 -eq 3 -a $2 -ge 11 ] && return 0
because "14-2-amd64 -ge 11" is false.
Signed-off-by: Matthias Neuer <matthias.neuer@uni-ulm.de>
Reviewed-by: Christopher Covington <cov@codeaurora.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If there is no separator in first place we should
avoid implicit + 1 which make @name = 1 in worst case.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently here is a bug, because when we see criu's mount namespace,
we go to the "out" mark and don't validate mounts.
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
mntinfo contains mounts from all namespaces, so we can validate it only
once after collecting mounts.
v2: add a fake comment about goto
v3: add a real comment about goto
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a couple of problems missed in 1.3, so here's
the -stable .1 for that release.
First of all, we nail down the way CRIU decides whether
restore root task as child or as sibling with the
explicit API switch.
And there are two nasty issues in how CRIU dumps mountpoints.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently we stript options only one of brothers, but
mount_equal() thinks that two brothers should have the same options.
Execute zdtm/live/static/mountpoints
./mountpoints --pidfile=mountpoints.pid --outfile=mountpoints.out
Dump 2737
WARNING: mountpoints returned 1 and left running for debug needs
Test: zdtm/live/static/mountpoints, Result: FAIL
==================================== ERROR ====================================
Test: zdtm/live/static/mountpoints, Namespace:
Dump log : /root/git/criu/test/dump/static/mountpoints/2737/1/dump.log
--------------------------------- grep Error ---------------------------------
(00.146444) Error (mount.c:399): Two shared mounts 50, 67 have different sets of children
(00.146460) Error (mount.c:402): 67:./zdtm_mpts/dev/share-1 doesn't have a proper point for 54:./zdtm_mpts/dev/share-3/test.mnt.share
(00.146820) Error (cr-dump.c:1921): Dumping FAILED.
------------------------------------- END -------------------------------------
================================= ERROR OVER =================================
Reported-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Tested-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
"continue" is called by mistake, so we skip a few checks for shared
mounts without siblings.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a slight mess with how criu restores root task.
Right now we have the following options.
1) CLI
a) Usually
task calling criu
`- criu
`- root restored task
b) when --restore-detached AND root has pdeath_sig
task calling criu
`- criu
`- root restored task
2) Library/SWRK
task using lib/swrk
`- criu
`- root restored task
3) Standalone service
a) Usually
service
`- service sub task
`- root restored task
b) when root has pdeath_sig
criu service
`- criu sub task
`- root restored task
It would be better is CRIU always restored the root task as sibling,
but we have 3 constraints:
First, the case 1.a is kept for zdtm to run tests in pid namespaces
on 3.11, which in turn doesn't allow CLONE_PARENT | CLONE_NEWPID.
Second, CLI w/o --restore-detach waits for the restored task to die and
this behavior can be "expected" already.
Third, in case of standalone service tasks shouldn't become service's
children.
And I have one "plan". The p.haul project while live migrating tasks
on destination node starts a service, which uses library/swrk mode. In
this case the restored processes become p.haul service's kids which is
also not great.
That said, here's the option called --restore-child that pairs the
--restore-detach like this:
* detached AND child:
task
`- criu restore (exits at the end)
`- root task
The root task will become task's child.
This will be default to library/swrk.
This is what LXC needs.
* detach AND !child
task
`- criu restore (exits at the end)
`- root task
The root task will get re-parented to init.
This will be compatible with 1.3.
This will be default to standalone service and
to my wish with the p.haul case.
* !detach AND child
task
`- criu restore (waits for root task to die)
`- root task
This should be deprecated, so that criu restore doesn't mess
with task <-> root task signalling.
* !detach AND !child
task
`- criu restore (waits for root task to die)
`- root task
This is how plain criu restore works now.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
root_as_sibling was used in criu_signals_setup(), but was only defined later
(when forking the root task for the first time). This meant that the
SA_NOCLDSTOP was never masked off, which meant SIGCHLD was never delivered
after ptracing the root task. Thus, when the a child of the root task died
(e.g. from cr_system), the root task sat in PTRACE_STOP, and the restore task
never PTRACE_CONT'd, resulting in a deadlock.
Instead, we only unmask SA_NOCLDSTOP right before we PTRACE_SEIZE, after the
value is defined.
v2: re-work the condition for CLONE_PARENT
v3: move unmasking of SA_NOCLDSTOP to restore_root_task
v4: keep all the comments in the original code
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's been a long delay since 1.2, but we did it :)
The greatest new acheivement is finally support for Docker
and LXC on CRIU side. Some work is still to be don on the
other, but here in CRIU everything is ready.
Another notable things are AArch64 support and, of course,
a lot of bugfixes.
Further plan is to make releases be not so rare :)
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is really just the last bit of c32046c9; if restore_one_task() fails, we
need to do the same futex wakeup we do everywhere else in this function.
v2: use err instead of err_fini_mnt after mount has been finalized normally
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Once the task restore has failed, we can just abort, no need to restore the cg
props.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When in --restore-detached (i.e. root_as_sibling) mode, we ptrace(PTRACE_SEIZE)
the root task to receive its SIGCHLD in case one of its child tasks dies.
However, we don't receive a SIGCHLD if the root task itself dies, so we must
explicitly abort.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
A kernel without that option configured does not have /dev/pts/ptmx, so
fallback to the previous way of creating it using mknod instead.
The previous code was trying to bind mount ptmx on top of a symlink, which does
not actually work... Keep only the symlink call and use a relative symlink
instead. Adjust the error message of the symlink case to mention symlink()
instead of mknod() and also /dev/ptmx instead of /dev/pts.
Tested:
- zdtm test suite runs on ^ns/static/.* before and after the change.
- Same on a kernel with CONFIG_DEVPTS_MULTIPLE_INSTANCES unset.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When dumping Docker containers using the AUFS graph driver, we can
use the --root option instead of --aufs-root for specifying the
container's root. This patch obviates the need for --aufs-root
and makes dump CLI more consistent with restore CLI.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
rmdir is executed for non-existent directories, so we don't check
an exit code of this operation.
This patch executs rmdir only for existent directories and check
an exit code of rmdir.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In this job tests are dumped and resumed. The cgroup02 test checks,
that it is moved in another set of cgroups, but this is done on restore.
Output file: test/zdtm/live/static/cgroup02.out>
------------------------------------------------------------------------------
14:35:55.127: 85: found cgroup at cgroup02.test/zdtmtst>
14:35:55.127: 85: found cgroup at cgroup02.test/defaultroot>
14:35:55.127: 85: FAIL: cgroup02.c:132: oldroot not rewritten to zdtmtstroot!
v2: typo fix
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Use a single awk script to parse the ldd output. Filter out other cases that
are clearly not libraries, such as static builds ("not a dynamic executable")
and linux-gate.so. Make the grep for vdso more specific into linux-vdso.so.
Tested:
- sudo test/zdtm.sh '^ns/.*'
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Unfortunately, grep -P is not ubiquitous, so use awk with two regexps to
simulate the negative forward lookup in the grep -P expression.
Using awk doesn't really make it too unreadable, as using boolean operators
such as && and || might actually make it more intuitive than the extended
regexp.
Tested:
- sudo make -C test zdtm_ns
- sudo make -C test zdtm_nons
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
From avagin@:
And here is one more problem. the newroot directory is created for all
controllers, but currently test cleans up it only for the zdtmtst
controller. We need to find a way to clean up all other conntrollers.
Tests are executed on a node, which is rebooted only for updating
kernel, so if we will not clean up all other controllers, we can eat all
memory.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Tested-by: Andrew Vagin <avagin@openvz.org>
When "make test" is executed, CFLAGS is exported from the root Makefile.
These flags define _GNU_SOURCE, so we don't need to define it in the
souce file.
In addition unistd.h isn't included, so a few functions are shown as undeclared.
make zdtm_ns
make[3]: Entering directory `/root/criu/test'
gcc -O2 -Wall -Werror -DCONFIG_X86_64 -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE zdtm_ct.c -o zdtm_ct
zdtm_ct.c:1:0: error: "_GNU_SOURCE" redefined [-Werror]
#define _GNU_SOURCE
^
<command-line>:0:0: note: this is the location of the previous definition
zdtm_ct.c: In function ‘main’:
zdtm_ct.c:21:2: error: implicit declaration of function ‘fork’ [-Werror=implicit-function-declaration]
pid = fork();
^
zdtm_ct.c:23:3: error: implicit declaration of function ‘setsid’ [-Werror=implicit-function-declaration]
if (setsid() == -1) {
^
zdtm_ct.c:49:3: error: implicit declaration of function ‘execv’ [-Werror=implicit-function-declaration]
execv(argv[1], argv + 1);
^
zdtm_ct.c:62:3: error: implicit declaration of function ‘getpid’ [-Werror=implicit-function-declaration]
kill(getpid(), WTERMSIG(status));
^
cc1: all warnings being treated as errors
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Tested-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
fini_cgroup umounts a cgyard directory, which is mounted
in prepare_cgroup().
Reported-by: Mr Jenkins
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Without this patch, we dump somethin like this:
{
cnames: "hugetlb"
dirs: {
dir_name: ""
children: {
dir_name: "ewroot"
children: <empty>
properties: <empty>
}
properties: <empty>
}
}
It's obvious, that dir_name should be newroot.
The problem is reproduced, if a task leaves in "/" and has a subgroup.
This issue was caught by a chance. The cgroup02 test doesn't clean up
controllers and leaves the "newroot" there. So when we executed a cgroup
test after cgroup02, we could find many directories like "ewroot",
"wroot", etc. This patch fixes this issue.
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The return values were getting dangerously close to the range of meaningful
values, in particular the next candidate 63 is equal to '?' which is the
typical return value in case of error.
The return values for long options may be any integer, so bump them up to
outside the ascii range, start above 1000. For ease of review this patch, keep
the existing range (41-62) and increment each value by 1000.
Tested:
- Ran "criu --help", works fine.
- Manual dump and restore with some of the options, worked fine.
- Ran the zdtm test suite, tests passed.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is to make it convenient for service to setup the same thing.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
cpuset.cpus and cpuset.mems can't be written to for the first time after they
have tasks, so the traditional mechanism of restoring properties after
restoring the tasks won't work here. Instead, we copy the parent values of the
properties into them, restore the tasks, and then restore via the traditional
mechanism the actual values of these properties.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In particular, cpuset.cpus and cpuset.mems can both be "lists" (strings), as
well as hex integers. We don't use the result of this parse, so it is fine to delete it.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The AUFS support code handles the "bad" information that we get from
the kernel in /proc/<pid>/map_files and /proc/<pid>/mountinfo files.
For details see comments in sysfs_parse.c.
The main motivation for this work was dumping and restoring Docker
containers which by default use the AUFS graph driver. For dump,
--aufs-root <container_root> should be added to the command line options.
For restore, there is no need for AUFS-specific command line options
but the container's AUFS filesystem should already be set up before
calling criu restore.
[ xemul: With AUFS files sometimes, in particular -- in case of a
mapping of an executable file (likekely the one created at elf load),
in the /proc/pid/map_files/xxx link target we see not the path
by which the file is seen in AUFS, but the path by which AUFS
accesses this file from one of its "branches". In order to fix
the path we get the info about branches from sysfs and when we
meet such a file, we cut the branch part of the path. ]
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Look at this strace output:
107 linkat(45, "", 1017, "./root/git/orig/criu/test/zdtm/live/static/unlink_fstat03.test (deleted)/link_remap.4", AT_EMPTY_PATH) = -1 ENOENT (No such file or director
It's obvious, that we didn't cat the file name.
Here is an error in calculation of offset for the last symbol.
The current version of code sets this offset in strlen(),
but it's actually strlen() - 1.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We've moved signinfos on core entry, thus the bits with
siginfo-s themselves cannot sit on stack any longer.
Otherwise we would overwritem them with next batch and
will feed stack pointer to the caller, thus causing a
data and garbage on the stack to be written into image
instead of siginfo data.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The se variable is just an array of pointers on these
objects. Need to allocate the objects themselves.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>