Currently on dump we generate too many image files, effectively
all the stuff from the GLOB set is created. The thing is that
sometimes some of created images can be empty (just contain the
magic number at the head). Thos images are useless and just
waste the space.
When applied after the "empty images" set, this introduces the
lazy images -- when we call open_image() the actual file is
only created (and the magic number is written into it) when the
very first object goes into it.
For example for the simplest test we have, then static/env00
one, the created image files are
core-7290.img
creds-7290.img
fdinfo-2.img
fs-7290.img
ids-7290.img
inventory.img
mm-7290.img
pagemap-7290.img
pages-1.img
pstree.img
reg-files.img
sigacts-7290.img
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When an image of a certian type is not found, CRIU sometimes
fails, sometimes ignores this fact. I propose to ignore this
fact always and treat absent images and those containing no
objects inside (i.e. -- empty). If the latter code flow will
_need_ objects, then criu will fail later.
Why object will be explicitly required? For example, due to
restoring code reading the image with pb_read_one, w/o the
_eof suffix thus required the object to be in the image.
Another example is objects dependencies. E.g. fdinfo objects
require various files objects. So missing image files will
result in non-resolved searches later.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When page-read fails to open the pagemap image it reports error.
One place (stacked page-reads) need to handle the absent images
case gracefully, so fix the return codes to make this check
work.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Current code doesn't make any difference between OPT and no-OPT
except for the message is printed or not in the open_image().
So this particular change changes nothing but the availability of
this message.
In the next patches I wil introduce "empty images" to deal with
the ENOENT situation in a more graceful manner.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Force-read came from very first dev version of CRIU (even before 1.0 release)
and never been used actually in image.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The plan is to replace tons of if (type == TTY_TYPE_FOO) checks
with type->something dereferences.
To do this, start with replacing int type with struct tty_type *
in relevant places and fixing compilation.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
/dev/tty stands for current terminal which we don't yet
implemented a support for.
This is a bugfix for upcoming stable version, the proper
support of /dev/tty is gonna be implemented separately.
Reported-by: Saied Kazemi <saied@google.com>
CC: Andrew Vagin <avagin@parallels.com>
CC: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
For an established TCP connection, the send queue is restored in two
steps: in step (1), we retransmit the data that was sent before but not
yet acknowledged, and in step (2), we transmit the data that was never
sent outside before. The TCP_REPAIR option is disabled before step (2)
and re-enabled after step (2) (without this patch).
If the amount of data to be sent in step (2) is large, the TCP_REPAIR
flag on the socket can remain off for some time (O(milliseconds)). If a
listen() is called on another socket bound to the same port during this
time window, it fails. This is because -- turning TCP_REPAIR off clears
the SO_REUSEADDR flag on the socket.
This patch adds a mutex (reuseaddr_lock) per port number, so that a
listen() on a port number does not happen while SO_REUSEADDR for another
socket on the same port is off.
Thanks to Amey Deshpande <ameyd@google.com> for debugging.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We link files to each other at restore time to restore
unlinked paths. Kernel has strange secutiry restrictions
about linkat we use. If the fsuid of the caller doesn't
equals the uid of the file and the file is not "safe"
one, then only global CAP_CHOWN will be allowed to link().
This brings problems in user namespaces -- uns root is
not allowed to linkat any file, unlike global root.
Fortunately, we can change the fsuid temporarily and
still linkat the file we want. Hopefully this hack will
go away some day soon, when the kernel will have saner
checks for linkat capabilities.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
We have collected a good set of calls that cannot be done inside
user namespaces, but we need to [1]. Some of them has already
being addressed, like prctl mm bits restore, but some are not.
I'm pretty sceptical about the ability to relax the security
checks on quite a lot of them (e.g. open-by-handle is indeed a
very dangerous operation if allowed to unpriviledged user), so
we need some way to call those things even in user namespaces.
The good news about it its that all the calls I've found operate
on file descriptors this way or another. So if we had a process,
that lived outside of user namespace, we could ask one to do the
high priority operation we need and exchange the affected file
descriptor via unix socket.
So the usernsd is the one doing exactly this. It starts before we
create the user namespace and accepts requests via unix socket.
Clients (the processes we restore) send him the functions they
want to call, the descriptor they want to operate on and the
arguments blob. Optionally, they can request some file descriptor
back after the call.
In non usernamespace case the daemon is not started and the calls
are done right in the requestor's process environment.
In the next patch there's an example of how to use this daemon
to do the priviledged SO_SNDBUFFORCE/_RCVBUFFORCE sockopt on
a socket.
[1] http://criu.org/UserNamespace
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
sockets.c: In function ‘preload_socket_modules’:
sockets.c:153:36: error: ‘NETLINK_SOCK_DIAG’ undeclared (first use in this function)
sockets.c:153:36: note: each undeclared identifier is reported only once for each function it appears in
Reported-by: Mr Travis
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Right now we state that CRIU works on 3.11 and above kernels and, at the
same time, have support for a couple of new features like aio, tun, timerfd
etc. available in later kernels. Since these new features do not break
generic operations we do not require them in the kernel strictly.
However, in the zdtm tests it's very important to know exactly what can
and what cannot be tested. Right now this is done in a tough manner -- if
the kernel is not 3.11 or criu check fails for _any_ reason we treat the
kernel as being "bad" and throw out a set of tests.
I propose to test some individual features and form the list of tests
in a more fine-grained manner.
This patch only fixes the AIO, mnt_id, tun and posix-timers tests. Next
I will add checks and fixes for user-namespaces tests.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
In the next patch we will need to care about the exact error reported by
the kernel, so add the error callback for this.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When restoring a pair of veth devices that had one end inside a namespace
or container and the other end outside, CRIU creates a new veth pair,
puts one end in the namespace/container, and names the other end from
what's specified in the --veth-pair IN=OUT command line option.
This patch allows for appending a bridge name to the OUT string in the
form of OUT@<BRIDGE-NAME> in order for CRIU to move the outside veth to
the named bridge. For example, --veth-pair eth0=veth1@br0 tells CRIU
to name the peer of eth0 veth1 and move it to bridge br0.
This is a simple and handy extension of the --veth-pair option that
obviates the need for an action script although one can still do the same
(and possibly more) if they prefer to use action scripts.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a nasty issue with it. Current code allocates these
entries in shremap area one by one. We do NOT allocate any
OTHER entries in this region, but if we will this array will
be spoiled.
Fortunately we no longer need shmem-infos as plain array,
neither we need one in restorer. So just turn this into plain
shared objects and collect them in a list.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The struct and find routine used to be use by restorer code. Now
the former fully uses vmas and fd opened, so we can move the code
into .c file not to spoil global namespace.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have two places where we lookup the inherited-fd list
by name and dup() the descriptor found. I propose to factor
out this piece in a single inherited_fd() call. When
we will want to support inheritance for sockets or any
other files we'll simply add the inherited_fd() call
there.
I'm also thinking about moving the call to inherited_fd
into generic level, but the open_path() routine doesn't
allow to do it in a simple manner.
Also we have not yet finished issue with files-vs-inodes
mapping. Keeping all the logic in one function should
make the solution simpler.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In this mode we test if target cpu has all features present
in image file but do not require bit to bit match: target cpu
may be a new one with more features present.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Restoring AIO is quite simple. Once all VMAs are put in
their places we can call io_setup() to let kernel create
the context back and then move the ring into proper place.
Another thing we should "restore" is the context ID. But
the thing is, upon ring creation kernel repots the ring
start address as this ID. And there's a patch in the -next
tree that changes the ID when we remap the ring. That
said after AIO context creation and ring remap we need
to check that the new ID is seen by the kernel.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When AIO context is set up kernel does two things:
1. creates an in-kernel aioctx object
2. maps a ring into process memory
The 2nd thing gives us all the needed information
about how the AIO was set up. So, in order to dump
one we need to pick the ring in memory and get all
the information we need from it.
One thing to note -- we cannot dump tasks if there
are any AIO requests pending. So we also need to
go to parasite and check the ring to be empty.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
/dev/ttyN are the virtual terminals which are provided
by the system with major 4 and minor 1..63.
You can run some program on ttyN by pressing alt+ctrl+FN
and running it manualy or by using open(openvt nowadays).
This patch also allows us to run all our tests from a vt.
v2, style fix + using linux/vt.h for constants
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is a very common error when using criu.
The problem here is that we need to somehow transfer cr_errno
from one process to another. I suggest using pipe to give
one end to children and read cr_errno on other after restore
is finished.
v2, Pavel suggested putting errno into shared task_entries.
v3. and he also suggested using cmpxchg
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
There are cases where a process's file descriptor cannot be restored
from the checkpoint images. For example, a pipe file descriptor with
one end in the checkpointed process and the other end in a separate
process (that was not part of the checkpointed process tree) cannot be
restored because after checkpoint the pipe will be broken.
There are also cases where the user wants to use a new file during
restore instead of the original file at checkpoint time. For example,
the user wants to change the log file of a process from /path/to/oldlog
to /path/to/newlog.
In these cases, criu's caller should set up a new file descriptor to be
inherited by the restored process and specify the file descriptor with the
--inherit-fd command line option. The argument of --inherit-fd has the
format fd[%d]:%s, where %d tells criu which of its own file descriptors
to use for restoring the file identified by %s.
As a debugging aid, if the argument has the format debug[%d]:%s, it tells
criu to write out the string after colon to the file descriptor %d. This
can be used, for example, as an easy way to leave a "restore marker"
in the output stream of the process.
It's important to note that inherit fd support breaks applications
that depend on the state of the file descriptor being inherited. So,
consider inherit fd only for specific use cases that you know for sure
won't break the application.
For examples please visit http://criu.org/Category:HOWTO.
v2: Added a check in send_fd_to_self() to avoid closing an inherit fd.
Also, as an extra measure of caution, added checks in the inherit fd
look up functions to make sure that the inherit fd hasn't been reused.
The patch also includes minor cosmetic changes.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The kernel can do it better. The problem exists only for recv queues.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When we are doing pre-dump, we splice pages in pipes and only then open
images and dump pages. But when we are splicing pages, we need to know
about existence of parent images. This patch adds a new call to determin
existence of parent images.
In addition this patch fixes a following issue:
CID 83244 (#1 of 1): Uninitialized pointer read (UNINIT)
14. uninit_use: Using uninitialized value xfer.parent.
v2: initialize unused field of struct page_server_iov, because it sends
in network.
CID 83451 (#1 of 1): Uninitialized scalar variable (UNINIT)
2. uninit_use_in_call: Using uninitialized value pi. Field pi.nr_pages
is uninitialized when calling write.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Right now we push all the auxiliary arguments to parasite_infect_seized
while 2 of them are only required to calculate the size of args area.
Let's better keep track of required args size and get rid of excessive
arguments to parasite_infect_seized().
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
ispathsub("/foo", "/") reports false. This is a corner case,
as 2nd argument is not expected to end with /. Fix this and
add comment about ispathsub() arguments assumptions.
Reported-by: Andrey Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When we validate the mount tree not to have overmounts we need to
check one path to be the sub-path of another. Here's a helper for
this.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
For that we need to save per-namespace mappings of user and group IDs.
And all id-s for tasks and files are saved from the target user
namespace.
v2: move code into collect_namespaces()
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>