Commit graph

1725 commits

Author SHA1 Message Date
Kirill Tkhai
c9afd17ad6 net: Add ip rule save/restore
Add support for save and restore of ip rules. It uses new
functionality of iproute which is already in iproute git:

http://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/commit/?id=2f4e171f7df22107b38fddcffa56c1ecb5e73359

v2: Use xstrdup() instead of strdup().
v3: Use open/close instead of helper.
v4: Return -1 on empty dump.

Signed-off-by: Kirill Tkhai <ktkhai@odin.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-27 22:56:33 +03:00
Andrew Vagin
1d8fcb6b94 bfd: add breadchr
Reading stops after an EOF or a specified charecter.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-27 22:51:09 +03:00
Cyrill Gorcunov
7a99e699ce mnt: Export __open_mountpoint
We gonna need it for inotify handle testing.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-21 15:08:03 +03:00
Kir Kolyshkin
5940e3d14c xfree(): simplify
Contrary to a popular opinion, there is no need to check
an argument for being non-NULL before calling free().

>From free(3) man page:

> > If ptr is NULL, no operation is performed.

Let's change xfree macro to be a synonym for free().

Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-21 14:58:39 +03:00
Pavel Emelyanov
68baf8e77d criu: Fault injection core
This patch(set) is inspired by similar from Andrey Vagin sent
sime time earlier.

The major idea is to artificially fail criu dump or restore at
specific places and let zdtm tests check whether failed dump
or restore resulted in anything bad.

This particular patch introduces the ability to tell criu "fail
at X point". Each point is specified with a integer constant
and with the next patches there will appear places over the
code checking for specific fail code being set and failing.

Two points are introduced -- early on dump, right after loading
the parasite and right after creation of the root task.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-19 12:42:29 +03:00
Cyrill Gorcunov
61859d1176 fsnotify: Filter out internal inotify bits when restoring marks
The kernel prior 4.3 is exporting FS_EVENT_ON_CHILD
bit via procfs fdinfo interface. This bit is kernel's
internal and should not be passed in inotify_add_watch
call. Thus simply filter it out when obtain from old
images for backward compatibility reason.

More details here https://lkml.org/lkml/2015/9/21/680

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-14 15:51:55 +03:00
Matthew Krafczyk
29c08d8672 Add pre-dump and pre-restore action scripts
This allows the user to perform actions before dumping or restoration
occurs.

Signed-off-by: Matthew Krafczyk <krafczyk.matthew@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-09 18:23:41 +03:00
Christopher Covington
871da9a111 pie: Give VDSO symbol table local scope
In commit c2271198, Laurent Dufour kindly reunified the VDSO code
that had become duplicated between architectures. Unfortunately
this introduced a regression in AArch64 where apparently due to
the scope of vdso_symbols array of pointers to characters changing
from local to global, load-time relocations became necessary.

The following thread on the GCC mailing list discusses why
load-time relocations can be necessary when pointers are used,
although it doesn't mention the potential for locally scoped
arrays to be handled differently:
https://gcc.gnu.org/ml/gcc/2004-05/msg01016.html

Because the alternatives, such as porting piegen to AArch64, are
far more involved, simply revert the change in scope.

Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-05 13:21:16 +03:00
Christopher Covington
627f9a9e5f aarch64: Fix write_intraprocedure_branch types
In the recent VDSO code reunification, some types were changed but
a pair of necessary corresponding changes was omitted. Fix that so
the AArch64 build succeeds without type-related
warnings-turned-errors. Also move the definition to the
AArch64-specific header since it's not currently being used by any
other architectures.

Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-05 13:20:01 +03:00
Tycho Andersen
f79f4546cf sysctl: move sysctl calls to usernsd
When in a userns, tasks can't write to certain sysctl files:

(00.009653)      1: Error (sysctl.c:142): Can't open sysctl kernel/hostname: Permission denied

See inline comments for details on affected namespaces.

Mostly for my own education in what is required to port something to be
userns restorable, I ported the sysctl stuff. A potential concern for this
patch is that copying structures with pointers around is kind of gory. I
did it ad-hoc here, but it may be worth inventing some mechanisms to make
it easier, although I'm not sure what exactly that would look like
(potentially re-using some of the protobuf bits; I'll investigate this more
if it looks helpful when doing the cgroup user namespaces port?).

Another issue is that there is not a great way to return non-fd stuff in
memory right now from userns_call; one of the little hacks in this code
would be "simplified" if we invented a way to do this.

v2: coalesce the individual struct sysctl_req requests into one big
    sysctl_userns_req that is in a contiguous region of memory so that we
    can pass it via userns_call. Hopefully nobody finds my little ascii
    diagram too offensive :)
v3: use the fork/setns trick to change the syctl values in the right ns for
    IPC/UTS nses; see inline comment for details
v4: only use sysctl_userns_req when actually doing a userns_call.

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-05 13:16:14 +03:00
Andrew Vagin
a973e6fcb3 net: dump ipv6 routes
"ip route dump" dumps only ipv4 routes.

Reported-by: Ross Boucher <boucher@gmail.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-05 13:11:31 +03:00
Tycho Andersen
97cb181cbc irmap: don't leak irmap objects in --irmap-scan-path
v2: use struct irmap directly in irmap_path_opt

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 22:02:51 +03:00
Pavel Emelyanov
efa7dcf7c2 ghost: Remove ghost files if restore fails
Issue #18. When restore fails ghost files remain there. And
to remove them we have to know their list, paths to original
files (to construct the ghost name) and the namespace ghost
lives in.

For the latter we keep the restore task namespace at hands
till the final stage and setns into it to kill ghosts.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 22:00:37 +03:00
Pavel Emelyanov
a7c9f3011d mnt: Read mount images early
Mappings from mount id to namespace will be required to
remove ghosts on restore failure.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 22:00:36 +03:00
Pavel Emelyanov
b0e23c3d4f files: Collect ghosts and regilfes early
Info about ghosts presence and paths will be needed to
remove the ghosts itself and thus are needed in criu.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 22:00:35 +03:00
Pavel Emelyanov
152222a6b7 remap: Sanitize ghost file path printing
First -- avoid two memory copies by printing ns root directly, and
second -- remove extra argument from create_ghost, the mnt_id value
we need there can be found on the ghost_file object.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 21:59:45 +03:00
Pavel Emelyanov
6cf77f6726 remap: Rename fields for easier grep
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 21:58:28 +03:00
Pavel Emelyanov
7ca6cc1eb2 mnt: Clean roots yard from criu process
So here it is. If root task dies on restore the roots yard
dir remains unrmdired :( Since we already know its name, we
can remove one from criu. By the time we get to this place
the sub mount namespace(s) are already dead and yard dir
is empty. But umounting should be done by tasks after
successfull restore, so keep depopulation there.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 21:57:35 +03:00
Pavel Emelyanov
3e7c92ed02 mnt: Renames around roots yard
Same thing as in previous patch -- we have too many generic
clean_ and fini_ prefixes over the code. And we need more (see
next patch), so let's specify what exactly we clean or fini.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 21:57:21 +03:00
Pavel Emelyanov
c5c65fe17a mnt: Create roots in criu context
In case root task restore failure we'll have to remove the
roots yard dir from criu, so we have to create one by
criu to at least have the dit name.

It's OK to do it in criu, since the yards is created in
the opts.root which is the same for any mnt ns we deal
with on restore.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 21:56:51 +03:00
Pavel Emelyanov
e3f5ba3c37 ns: Prepare namespaces before tasks
There's already two things we do in criu namespaces before
forking the init task (start unsd and keep netnsfd for back
reference). Next patches will introduce the 3rd action for
mount namespaces, so have a special pre-call for all this
stuff.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 21:56:26 +03:00
Pavel Emelyanov
9b3189fed1 util: Add make_yard helper
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 11:32:18 +03:00
Pavel Emelyanov
9353051ba7 ns: Check ns type with type field
Actually make use of the ns->type field and remove all getpid()'s
and other strange/inconsistent checks.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-21 12:15:28 +03:00
Pavel Emelyanov
22b7256612 ns: Introduce ns type
We (may) have 3 types of namespace objects in criu -- criu's one,
root task's one and others. All of them sometimes make sense and
we differentiate them in a weird way -- by checking the ns->pid
field against getpid() or by comparing with root_item's.

The proposal is to mark ns_id objects explicitly with type field.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-21 12:14:07 +03:00
Tycho Andersen
85ebf0a83b usernsd: also pass pid of process that made the req
We'll use this in the next patch to correctly write sysctls.

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-21 12:01:01 +03:00
Tycho Andersen
72ff44d0dc usernsd: move MAX_MSG_SIZE to namespaces.h
We'll use this size in the next patch to avoid having to do some dynamic
allocation.

v2: call it MAX_UNSFD_MSG_SIZE instead
v3: fix all uses of MAX_MSG_SIZE :)

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-21 11:57:40 +03:00
Andrey Vagin
1174a2ad0f mount: handle mnt_flags and sb_flags separatly (v4)
They both can container the MS_READONLY flag. And in one case it will be
read-only bind-mount and in another case it will be read-only
super-block.

v2: set mnt and sb for one call of mount() when it's posiable
v3: return a comment which was deleted by mistake
v4: Fix the sentense about restoring mnt flags
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-21 11:55:17 +03:00
Tycho Andersen
4f2e4ab3be irmap: add --irmap-scan-path option
This option allows users to specify their own irmap paths to scan in the event
that they don't have a path in one of the hard coded hints.

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-21 11:46:12 +03:00
Andrey Vagin
d3be641acd cgroups: get controllers from /proc/self/cgroups (v2)
Some controllers can be disabled in kernel options. In this case they
are shown in /proc/cgroups, but they could not be mounted.

All enabled controllers can be collected from /proc/self/cgroup.

https://github.com/xemul/criu/issues/28

v2: ',' is used to separate controllers

Cc: Tycho Andersen <tycho.andersen@canonical.com>
Reported-by: Ross Boucher <boucher@gmail.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-16 15:46:10 +03:00
Laurent Dufour
7f01d691c7 vdso: Rework vdso processing files
There were multiple copy of the same code spread over the different
architectures handling the vDSO.

This patch is merging the duplicated code in arch/*/vdso-pie.c and
arch/*/include/asm/vdso.h in the common files and let only the architecture
specific part in the arch/*/* files.

The file are now organized this way:

include/asm-generic/vdso.h
	contains basic definition which could be overwritten by
	architectures.

arch/*/include/asm/vdso.h
	contains per architecture definitions.
	It may includes include/asm-generic/vdso.h

pie/util-vdso.c
include/util-vdso.h
	These files contains code and definitions common to both criu and
	the parasite code.
	The file include/util-vdso.h includes arch/*/include/asm/vdso.h.

pie/parsite-vdso.c
include/parasite-vdso.h
	contains code and definition specific to the parasite code handling
	the vDSO.
	The file include/parasite-vdso.h includes include/util-vdso.h.

arch/*/vdso-pie.c
	contains the architecture specific code installing the vDSO
	trampoline.

vdso.c
include/vdso.h
	contains code and definition specific to the criu code handling the
	vDSO.
	The file include/vdso.h includes include/util-vdso.h.

CC: Christopher Covington <cov@codeaurora.org>
CC: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-10 14:07:22 +03:00
Cyrill Gorcunov
80ef8fd2fb mount: Handle deleted bindmounts
To handle deleted bindmounts we simply create
the former directory bindmount lived at, mount
the target and remove the directory back.

For this sake we add @deleted entry into the image.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-21 21:26:17 +03:00
Cyrill Gorcunov
60f6ec7dd6 files-reg: Rework strip_deleted helper
Make it handle both postfixes and return
non-zero code if stipping happened.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-21 21:26:01 +03:00
Cyrill Gorcunov
4f8f97e0bd mount: mount_info -- Drop unused @is_file
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-21 21:25:14 +03:00
Cyrill Gorcunov
40ed330c88 kcmp: Stop showing ids tree
Useless, at least in the form present
now it's unreadable anyway. So stop
welling out the logs.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-18 18:17:31 +03:00
Cyrill Gorcunov
3ae67d7b3a mount: Move mount_info and ext_mount to mount.h
It's quite unclean while this structure lives
in proc_parse.h, which only have to fill this
structure on procfs read, but real handling
is inside mount.c. Move it as appropriate.

Same time ext_mount structure should be moved
into a header as well with sane @list name
used instead of @l.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-13 17:16:47 +03:00
Cyrill Gorcunov
291e632e60 mount: Reorder declarations
- gather structs on top
 - then extern vars

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-13 17:16:05 +03:00
Cyrill Gorcunov
17dc1fea57 mount: Align members of mount_info
This a way easier for reading.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-13 17:16:04 +03:00
Cyrill Gorcunov
ced8f88401 opts: Allo to specify the maximum size of ghost files
For example we hit a case where systemd carries journal
file with 4M in size.

https://jira.sw.ru/browse/PSBM-38571

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-10 16:51:11 +03:00
Cyrill Gorcunov
1bdb8298d0 util: Fix mega/giga typos
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-10 16:49:43 +03:00
Andrey Vagin
8ea020388e dump: use freezer cgroup to seize processes (v4)
Without using a freezer cgroup, we need to do a few iterations to catch
all tasks, because a new tasks can be born. If new tasks appear faster
than criu collects them, criu fails. The freezer cgroup allows to
solve this problem.

We freeze the freezer group, then attaches to tasks with ptrace and thaw
the freezer cgroup. We suppose that all tasks which are going to be
dumped in a specified freezer group.

v2: fix comments from Christopher
Reviewed-by: Christopher Covington <cov@codeaurora.org>

v3: refactor task_seize

v4: fix comments from Pavel

Cc: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-10 16:47:51 +03:00
Andrey Vagin
011231af3b util: add ability to execute programs in a specified userns
It's required for dumping tmpfs, where we use tar to save content.
If we need to execute tar from a proper userns to get right uid-s and
gid-s for files.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-07 14:42:01 +03:00
Gabriel Guimaraes
dbaab31f31 Workaround for the OverlayFS bug present before Kernel 4.2
This is here only to support the Linux Kernel between versions
3.18 and 4.2. After that, this workaround is not needed anymore,
but it will work properly on both a kernel with and without the bug.

The bug is that when a process has a file open in an OverlayFS directory,
the information in /proc/<pid>/fd/<fd> and /proc/<pid>/fdinfo/<fd>
is wrong, so we grab that information from the mountinfo table instead.

This is done every time fill_fdlink is called.
We first check to see if the mnt_id and st_dev numbers currently match
some entry in the mountinfo table. If so, we already have the correct mnt_id
and no fixup is needed.

Then we proceed to see if there are any overlayFS mounted directories
in the mountinfo table. If so, we concatenate the mountpoint with the
name of the file, and stat the resulting path to check if we found the
correct device id and node number. If that is the case, we update the
mount id and link variables with the correct values.

Signed-off-by: Gabriel Guimaraes <gabriellimaguimaraes@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-07 14:30:41 +03:00
Andrey Vagin
b9b0730cb1 ptrace: split task_seize into seize_catch_task and seize_wait_task
It's preparation to use a freezer cgroup for freezing tasks.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-07 13:47:11 +03:00
Andrey Vagin
2f172c8b24 crtools: split cr-dump.c in two files
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-06 14:31:06 +03:00
Christopher Covington
1438f013a2 Pass task_size to vma_area_is_private()
If we want one CRIU binary to work across all AArch64 kernel
configurations, a single task size value cannot be hard coded. Since
vma_area_is_private() is used by both restorer blob code and non
restorer blob code, which must use different variables for recording
the task size, make task_size a function argument and modify the call
sites accordingly. This fixes the following error on AArch64 kernels
with CONFIG_ARM64_64K_PAGES=y.

  pie: Error (pie/restorer.c:929): Can't restore 0x3ffb7e70000 mapping w>
  pie: ith 0xfffffffffffffff7

Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-03 17:14:18 +03:00
Christopher Covington
7451fc7d23 restorer: Replace most hard-coded TASK_SIZE use
If we want one CRIU binary to work across all AArch64 kernel
configurations, a single task size value cannot be hard coded.
This fixes the following error on AArch64 kernels with
CONFIG_ARM64_64K_PAGES=y.

  pie: Error (pie/restorer.c:772): Unable to unmap (-): -1211695104

Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-03 17:14:17 +03:00
Christopher Covington
c0c0546c31 kerndat: Introduce task_size variable
If we want one CRIU binary to work across all AArch64 kernel
configurations, a single task size value cannot be hard coded.

Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-03 17:14:15 +03:00
Andrew Vagin
f13ec96e58 restore: fix race in calculation of a number of zombies
Currently each task subtracts number of zombies from
task_entries->nr_threads without locks, so if two tasks will do this
operation concurrently, the result may be unpredictable.

https://github.com/xemul/criu/issues/13

Cc: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-03 17:12:10 +03:00
Andrey Vagin
7e413b0771 socket: remove unused code
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-08-03 17:06:19 +03:00
Artem Kuzmitskiy
b790b586eb Add restoring of unnamed unix sockets.
Added functionality for restoring unnamed unix sockets
using already implemented feature - inherit fd and using same command line
option.
Usage example:
criu restore -d -D images -o restore.log --pidfile restore.pid -v4 \
     -x --inherit-fd fd[3]:socket:[9677263]

Signed-off-by: Artem Kuzmitskiy <artem.kuzmitskiy@lge.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-07-29 17:53:36 +03:00