v2: add comments and rename ns_created to ns_populated.
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Pass function name into a helper instead of pointer
wich doesn't provide much useful info.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We need to use SIG_SETMASK instead of SIG_BLOCK.
SIG_SETMASK
The set of blocked signals is set to the argument set.
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In case root task restore failure we'll have to remove the
roots yard dir from criu, so we have to create one by
criu to at least have the dit name.
It's OK to do it in criu, since the yards is created in
the opts.root which is the same for any mnt ns we deal
with on restore.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
There's already two things we do in criu namespaces before
forking the init task (start unsd and keep netnsfd for back
reference). Next patches will introduce the 3rd action for
mount namespaces, so have a special pre-call for all this
stuff.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Actually make use of the ns->type field and remove all getpid()'s
and other strange/inconsistent checks.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We (may) have 3 types of namespace objects in criu -- criu's one,
root task's one and others. All of them sometimes make sense and
we differentiate them in a weird way -- by checking the ns->pid
field against getpid() or by comparing with root_item's.
The proposal is to mark ns_id objects explicitly with type field.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We'll use this in the next patch to correctly write sysctls.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We'll use this size in the next patch to avoid having to do some dynamic
allocation.
v2: call it MAX_UNSFD_MSG_SIZE instead
v3: fix all uses of MAX_MSG_SIZE :)
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's required to dump uid-s and gid-s from this userns.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When using pr_perror(), format string should not end with \n,
as it is added by the macro itself.
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Acked-by: Andrew Vagin <avagin@odin.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We no longer need to populate ext_ns->mnt.mntinfo_list until
resolve_external_mounts(). We can rely on find_ext_ns_id() which
does collect_mntinfo() on demand.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In the rest of this series we need to walk all the namespaces to autodetect
which mounts are master/shared/private bind mounts, so we need the information
from criu's namespace in the case when the namespaces are not the same.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Current code doesn't make any difference between OPT and no-OPT
except for the message is printed or not in the open_image().
So this particular change changes nothing but the availability of
this message.
In the next patches I wil introduce "empty images" to deal with
the ENOENT situation in a more graceful manner.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have collected a good set of calls that cannot be done inside
user namespaces, but we need to [1]. Some of them has already
being addressed, like prctl mm bits restore, but some are not.
I'm pretty sceptical about the ability to relax the security
checks on quite a lot of them (e.g. open-by-handle is indeed a
very dangerous operation if allowed to unpriviledged user), so
we need some way to call those things even in user namespaces.
The good news about it its that all the calls I've found operate
on file descriptors this way or another. So if we had a process,
that lived outside of user namespace, we could ask one to do the
high priority operation we need and exchange the affected file
descriptor via unix socket.
So the usernsd is the one doing exactly this. It starts before we
create the user namespace and accepts requests via unix socket.
Clients (the processes we restore) send him the functions they
want to call, the descriptor they want to operate on and the
arguments blob. Optionally, they can request some file descriptor
back after the call.
In non usernamespace case the daemon is not started and the calls
are done right in the requestor's process environment.
In the next patch there's an example of how to use this daemon
to do the priviledged SO_SNDBUFFORCE/_RCVBUFFORCE sockopt on
a socket.
[1] http://criu.org/UserNamespace
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
We enter into the target userns and try to enter in other namespaces.
The "enter" operation requires CAP_SYS_ADMIN in a user namespace,
where a taget namespace was created.
Now if one or more namespaces were created in another userns,
criu stops dumping and return an error. I want to find someone, who uses
this configuration. In this case restore will be more complicated.
Current version covers containers needs.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It is cleared when a process is forked in a new userns.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In this patch we fill /proc/PID/uid_map and /proc/PID/gid_map for the
root task.
v2: initialize groups in a new namespace.
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
v3: add a helper to initialize creds in a new userns
v4: initialize userns creds in prepare_namespaces()
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
For that we need to save per-namespace mappings of user and group IDs.
And all id-s for tasks and files are saved from the target user
namespace.
v2: move code into collect_namespaces()
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to support user namespaces and uid-s will be converted
accoding with userns mappings.
v2: conver id-s for sockets too
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
and return an error, if a proccess live in another userns,
because criu doesn't support it.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
CRIU reads /proc/pid/ns/[NS] and fails of a link is not exist.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is for two reasons. First, validation can meet external mount
and will call plugins, which is not correct on pre-dump and actually
crashes on uninitilized plugins lists. Second, even if on pre-dump
mount tree is not "supported" this can be a temporary situation (yes,
yes, unlikely, but still).
On the other hand, it's better to fail earlier, but that's another
story.
Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
On pre-dump we collect only two namespaces -- the mnt one
for criu and mnt one again for root task.
This is not correct. We need all mount namespaces to make
the irmap generation work properly and we need all net
namespaces to have parasite sockets created.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We want to have buffered images to speed up dump and,
slightly, restore. Right now we use plan file descriptors
to write and read images to/from. Making them buffered
cannot be gracefully done on plain fds, so introduce
a new class.
This will also help if (when?) we will want to do more
complex changes with images, e.g. store them all in one
file or send them directly to the network.
For now the cr_img just contains one int _fd variable.
This patch chages the prototype of open_image() to
return struct cr_img *, pb_(read|write)* to accept one
and fixes the compilation of the rest of the code :)
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Since we're going to switch from int-fd-s to class-image
soon the fdset name will not fit into the new terminology.
This patch is
sed -e 's/fdset/imgset/g' -i *
sed -e 's/imgset_fd/img_from_set/g' -i *
git mv include/fdset.h include/imgset.h
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
The main reason for this is -- dumping namespace has a lot of
points when the process just waits for something. At the same
time criu process wait for the ns dumper and doesn't dump
others.
The great example of waiting for something is setns syscall.
Very often it calls synchronize_rcu() which can be quite long.
Let other processes do smth useful while this.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Nowadays this routine is mainly used for getting an
fd, rather than keeping one for future reference.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>