When we're restoring fsnotify watchees we need to resolve
path to a handle at some mountpoint referred by @s_dev
member (device ID) which is saved inside image. This
ID actually may be changed at the every mount (say
one restores container after machine reboot) or in
case of container's migration.
Thus the test for overmounting in __open_mountpoint
will fail and we get an error.
Lets do a trick: introduce @s_dev_rt member which
is supposed to carry run-time device ID. When dumping
this member simply equal to traditional @s_dev fetched
from the procfs, but when restoring we fetch it from
stat call once mountpoint become alive.
https://jira.sw.ru/browse/PSBM-41610
v2:
- predefine MOUNT_INVALID_DEV
- use fetch_rt_stat instead of assigning device in restore_shared_options
- copy @s_dev_rt in propagate_siblings and propagate_mount
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
close_olds_fds() knows nothing about more than one set of service file
descriptros, so it's better to call it before forking children as it was
bedore 9d60724eca ("restore: restore mntns before creating private vma-s")
The root task restores all processes and pin them with file descriptors,
then a task restores a mount namespace by opening the file descriptor of
the root task via /proc/pid/fd/X.
Reported-by: Mr Jenkins
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
First -- avoid two memory copies by printing ns root directly, and
second -- remove extra argument from create_ghost, the mnt_id value
we need there can be found on the ghost_file object.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
So here it is. If root task dies on restore the roots yard
dir remains unrmdired :( Since we already know its name, we
can remove one from criu. By the time we get to this place
the sub mount namespace(s) are already dead and yard dir
is empty. But umounting should be done by tasks after
successfull restore, so keep depopulation there.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Same thing as in previous patch -- we have too many generic
clean_ and fini_ prefixes over the code. And we need more (see
next patch), so let's specify what exactly we clean or fini.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In case root task restore failure we'll have to remove the
roots yard dir from criu, so we have to create one by
criu to at least have the dit name.
It's OK to do it in criu, since the yards is created in
the opts.root which is the same for any mnt ns we deal
with on restore.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
They both can container the MS_READONLY flag. And in one case it will be
read-only bind-mount and in another case it will be read-only
super-block.
v2: set mnt and sb for one call of mount() when it's posiable
v3: return a comment which was deleted by mistake
v4: Fix the sentense about restoring mnt flags
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
To handle deleted bindmounts we simply create
the former directory bindmount lived at, mount
the target and remove the directory back.
For this sake we add @deleted entry into the image.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's quite unclean while this structure lives
in proc_parse.h, which only have to fill this
structure on procfs read, but real handling
is inside mount.c. Move it as appropriate.
Same time ext_mount structure should be moved
into a header as well with sane @list name
used instead of @l.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is here only to support the Linux Kernel between versions
3.18 and 4.2. After that, this workaround is not needed anymore,
but it will work properly on both a kernel with and without the bug.
The bug is that when a process has a file open in an OverlayFS directory,
the information in /proc/<pid>/fd/<fd> and /proc/<pid>/fdinfo/<fd>
is wrong, so we grab that information from the mountinfo table instead.
This is done every time fill_fdlink is called.
We first check to see if the mnt_id and st_dev numbers currently match
some entry in the mountinfo table. If so, we already have the correct mnt_id
and no fixup is needed.
Then we proceed to see if there are any overlayFS mounted directories
in the mountinfo table. If so, we concatenate the mountpoint with the
name of the file, and stat the resulting path to check if we found the
correct device id and node number. If that is the case, we update the
mount id and link variables with the correct values.
Signed-off-by: Gabriel Guimaraes <gabriellimaguimaraes@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Finally add --enable-fs option to specify the comma separated list of
filesystem names which should be treated as FSTYPE_AUTO.
Note: obviously this option is not safe, use at your own risk. "dump"
will always succeed if the mntpoint is auto, but "restore" can fail or
do something wrong if mount(src, mountpoint, flags, options) can not
actually "just work" as FSTYPE_AUTO logic expects.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Preparation.
1. Add the new "bool for_dump" arg to collect/parse_mntinfo().
2. Introduce "struct collect_mntns_arg" to pass the additional
"bool for_dump" field to collect_mntinfo() and change it to
pass this boolean to collect_mntinfo()->parse_mountinfo() path.
3. Change other callers of collect_mntinfo() to pass "false".
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Since we're going to switch from int-fd-s to class-image
soon the fdset name will not fit into the new terminology.
This patch is
sed -e 's/fdset/imgset/g' -i *
sed -e 's/imgset_fd/img_from_set/g' -i *
git mv include/fdset.h include/imgset.h
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
On dump one uses one or more --ext-mount-map option with A:B arguments.
A denotes a mountpoint (as seen from the target mount namespace) criu
dumps and B is the string that will be written into the image file
instead of the mountpoint's root.
On restore one uses the same --ext-mount-map option(s) with similar
A:B arguments, but this time criu treats A as string from the image's
root field (foobar in the example above) and B as the path in criu's
mount namespace the should be bind mounted into the mountpoint.
v3:
* Added documentation
* Added RPC bits
* Changed option name into --ext-mount-map
* Use colon as key and value separator
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Nowadays this routine is mainly used for getting an
fd, rather than keeping one for future reference.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch removes the global mntinfo_tree and collect_mount_info where
it was constructed. The mntinfo list is filled from dump_mnt_ns,
rst_collect_local_mntns, collect_mnt_namespaces and read_mnt_ns_img.
A mountinfo entry contains a reference on a proper ns_id entry, so
we cau use mnt_id to look up a proper mount namespace.
v2: remove trash after rebasing.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Kernels before 3.15 doesn't show mnt_id and mnt_id isn't saved in
images, if mntns isn't dumped.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On restore all mount namespaces are restored in the root mntns and
sub-namecpeaces are restored in temorary places.
This function allows to get paths to these places.
It will be used in open_remap_ghost(), because it's called in the root
task, when other tasks are not forked yet.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to support nested mntns, so the global mntinfo_tree
variable are useless and information about tree should be connected
to a proper namespace.
But when we don't dump mntns, we need to collect mounts for the current
mntns.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to support nested mount namespaces and each NS has own
tree. The mount tree is used for checking that a file is reachable.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Known issue:
* currently only namespaces with the same root is supported
* nested namespaces can be dumped and restored only if the root task
has own mount namespace.
All nested namespaces are restored in a root namespace in temporary
directories. All mount points restored in one tree and then they are
divided into namesaces.
The task with minimal pid for each namespaces unshared mntns and
then it makes pivot_root in a proper temporary directory. All other
tasks makes setns to enter into a mount namespace of the task with
minimal pid.
v2: clean up
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
All non-root namespaces will be restored as sub-trees of the root tree.
This patch adds helpers to create a temporary directory and mount tmpfs
in it, then create directories for each non-root mount namespace.
tmpfs is quite useful here to simplify destroying this construction,
we don't need to unmount each namespace separately.
v2: add a comment why MNT_DETACH is not dangerous here
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's already used for dumping files and it will be used for restoring,
so it should be service fd to avoid intersection with restored
descriptors.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Instead of scanning btrfs subvolumes (which can be even unaccessbile
if mount point lays on directory instead of subvolume itself) we use
path resolving feature here -- once we need to figure out if some
device number need to be altered up to mount point (as we know stat()
called on subvolume returns st_dev for subvolume itself, but not
one that associated with a superblock and shown in /proc/self/mountinfo
output).
This as well implies that we need to check if device number for ghost
files are to be updated to match mountinfo, thus we use phys_stat_resolve_dev
helper here.
After this patch the previously merged btrfs engine is no longer needed
(at least it seems so) and can be dropped.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This routine is aimed to find a mount point on which
the path passed as argument is laying on. We walk over
all mount points and see which one is matching.
Once found (in worst case it will be a root mount point
so function is never failing) we're checking if this is
btrfs and then return subvolume0 device id.
See commit 921cf873f3
for details what the hell we're doing here.
v2: rewrite mount_resolve_path w/o recursion
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
For paths resolution we will need mount tree to be parsed
and built, but it's not that simple -- the current code
implies that once parsed the tree must not be re-parsed
again, so we pass @parse argument from a caller: if a task
we're restoring do not use mount namespace, we should parse
mount tree early, otherwise defer this action until mount
tree is read from the image.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This helper serves to hide fs specifics (in particular
btrfs) thus the caller won't need the details.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We really have a mess of extern/non-extern declaration
of functions in our headers. Always use extern for
unification purpose.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to replace pid on id in names of image files. The id is
uniq for each namespace, so it's more convient, if image files are
opened per namespace.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have generic do_pb_show() call and tons of show_foo
routines, that just call one with proper args. Compact
the code by putting the args into array and calling
the do_pb_show() in one place.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Information about mount points is used for dumping fanotify.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
These are structs that (now) tie together ns string
and the CLONE_ flag. It's nice to have one (some code
becomes simpler) and will help us with auto-namespaces
detection.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>