Commit graph

1725 commits

Author SHA1 Message Date
Pavel Emelyanov
8ce37e676a img: Don't create empty images
Currently on dump we generate too many image files, effectively
all the stuff from the GLOB set is created. The thing is that
sometimes some of created images can be empty (just contain the
magic number at the head). Thos images are useless and just
waste the space.

When applied after the "empty images" set, this introduces the
lazy images -- when we call open_image() the actual file is
only created (and the magic number is written into it) when the
very first object goes into it.

For example for the simplest test we have, then static/env00
one, the created image files are

   core-7290.img
   creds-7290.img
   fdinfo-2.img
   fs-7290.img
   ids-7290.img
   inventory.img
   mm-7290.img
   pagemap-7290.img
   pages-1.img
   pstree.img
   reg-files.img
   sigacts-7290.img

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-03-16 15:58:32 +03:00
Pavel Emelyanov
7ede4697cf bfd: Don't leak image-open flags into bfdopen
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-03-16 15:58:14 +03:00
Pavel Emelyanov
f7f76d6ba6 img: Introduce empty images
When an image of a certian type is not found, CRIU sometimes
fails, sometimes ignores this fact. I propose to ignore this
fact always and treat absent images and those containing no
objects inside (i.e. -- empty). If the latter code flow will
_need_ objects, then criu will fail later.

Why object will be explicitly required? For example, due to
restoring code reading the image with pb_read_one, w/o the
_eof suffix thus required the object to be in the image.

Another example is objects dependencies. E.g. fdinfo objects
require various files objects. So missing image files will
result in non-resolved searches later.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-03-13 14:42:54 +03:00
Pavel Emelyanov
45a0cc4234 page-read: Explicitly mark ENOENT with return code
When page-read fails to open the pagemap image it reports error.
One place (stacked page-reads) need to handle the absent images
case gracefully, so fix the return codes to make this check
work.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-03-13 14:42:11 +03:00
Pavel Emelyanov
e29c9daec2 img: Remove O_OPT and COLLECT_OPTIONAL
Current code doesn't make any difference between OPT and no-OPT
except for the message is printed or not in the open_image().
So this particular change changes nothing but the availability of
this message.

In the next patches I wil introduce "empty images" to deal with
the ENOENT situation in a more graceful manner.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-03-13 14:42:01 +03:00
Cyrill Gorcunov
19948472d9 tty: Rename tty_type to tty_driver
There are too many "type" in code.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-03-10 21:16:22 +03:00
Cyrill Gorcunov
652fbf3bd1 tty: Drop redundant constants
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-03-10 21:16:10 +03:00
Pavel Emelyanov
f32f4ffa76 img: Open images for dump in O_WRONLY mode
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-03-09 22:21:15 +03:00
Pavel Emelyanov
618c17b6f8 img: Simplify the open_image() macro
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-03-09 22:21:08 +03:00
Pavel Emelyanov
dceb6633c7 page-read: Introduce custom flags for opening
Instead of open flags and boolean is_shmem argument.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-03-04 17:50:32 +03:00
Cyrill Gorcunov
3bd6d9d7b0 image: Add comments about VMA_AREA constants and drop FORCE_READ flag
Force-read came from very first dev version of CRIU (even before 1.0 release)
and never been used actually in image.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-03-04 17:48:47 +03:00
Pavel Emelyanov
057f00ce92 tty: Make tty type be object rather than integer
The plan is to replace tons of if (type == TTY_TYPE_FOO) checks
with type->something dereferences.

To do this, start with replacing int type with struct tty_type *
in relevant places and fixing compilation.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-03-04 17:47:04 +03:00
Pavel Emelyanov
a7601d6a50 tty: Move tty_type() and is_pty() to tty.c
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-03-04 17:46:16 +03:00
Cyrill Gorcunov
bec5a023d1 tty: Fix mistyping of /dev/tty
/dev/tty stands for current terminal which we don't yet
implemented a support for.

This is a bugfix for upcoming stable version, the proper
support of /dev/tty is gonna be implemented separately.

Reported-by: Saied Kazemi <saied@google.com>
CC: Andrew Vagin <avagin@parallels.com>
CC: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-02-20 00:11:38 +03:00
Saied Kazemi
1b4e9058e8 Do not call listen() when SO_REUSEADDR is off
For an established TCP connection, the send queue is restored in two
steps: in step (1), we retransmit the data that was sent before but not
yet acknowledged, and in step (2), we transmit the data that was never
sent outside before.  The TCP_REPAIR option is disabled before step (2)
and re-enabled after step (2) (without this patch).

If the amount of data to be sent in step (2) is large, the TCP_REPAIR
flag on the socket can remain off for some time (O(milliseconds)).  If a
listen() is called on another socket bound to the same port during this
time window, it fails. This is because -- turning TCP_REPAIR off clears
the SO_REUSEADDR flag on the socket.

This patch adds a mutex (reuseaddr_lock) per port number, so that a
listen() on a port number does not happen while SO_REUSEADDR for another
socket on the same port is off.

Thanks to Amey Deshpande <ameyd@google.com> for debugging.

Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-02-16 13:18:32 +03:00
Andrey Vagin
3f23bde548 criu: print correct errno messages from pr_perror()
"%m" can't be used to print strerror(errno), because print_on_level()
calls gettimeofday() which can overwrite errno.

For example:
13486 connect(4, {sa_family=AF_INET, sin_port=htons(8880), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ENETUNREACH (Network is unreachable)
13486 gettimeofday({1423756664, 717423}, NULL) = 0
13486 open("/etc/localtime", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
13486 write(2, "15:57:44.717:     4: ERR: socket_udp.c:73: Can't connect (errno = 101 (Permission denied))\n", 91) = 91

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-02-13 15:14:44 +03:00
Pavel Emelyanov
9a392dff3a reg-files: Do not try to linkat with wrong user
We link files to each other at restore time to restore
unlinked paths. Kernel has strange secutiry restrictions
about linkat we use. If the fsuid of the caller doesn't
equals the uid of the file and the file is not "safe"
one, then only global CAP_CHOWN will be allowed to link().

This brings problems in user namespaces -- uns root is
not allowed to linkat any file, unlike global root.

Fortunately, we can change the fsuid temporarily and
still linkat the file we want. Hopefully this hack will
go away some day soon, when the kernel will have saner
checks for linkat capabilities.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
2015-02-13 16:11:38 +04:00
Pavel Emelyanov
b8556e8084 usernsd: The way to restore priviledged stuff in userns
We have collected a good set of calls that cannot be done inside
user namespaces, but we need to [1]. Some of them has already
being addressed, like prctl mm bits restore, but some are not.

I'm pretty sceptical about the ability to relax the security
checks on quite a lot of them (e.g. open-by-handle is indeed a
very dangerous operation if allowed to unpriviledged user), so
we need some way to call those things even in user namespaces.

The good news about it its that all the calls I've found operate
on file descriptors this way or another. So if we had a process,
that lived outside of user namespace, we could ask one to do the
high priority operation we need and exchange the affected file
descriptor via unix socket.

So the usernsd is the one doing exactly this. It starts before we
create the user namespace and accepts requests via unix socket.
Clients (the processes we restore) send him the functions they
want to call, the descriptor they want to operate on and the
arguments blob. Optionally, they can request some file descriptor
back after the call.

In non usernamespace case the daemon is not started and the calls
are done right in the requestor's process environment.

In the next patch there's an example of how to use this daemon
to do the priviledged SO_SNDBUFFORCE/_RCVBUFFORCE sockopt on
a socket.

[1] http://criu.org/UserNamespace

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
2015-02-13 16:11:38 +04:00
Ruslan Kuprieiev
09c3f5d0c7 security: add cr_fchown
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-02-10 16:54:31 +03:00
Ruslan Kuprieiev
df301b7eb7 security: create separate security.h header
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-02-10 16:53:54 +03:00
Pavel Emelyanov
1bbc994ccf sysctl: Remove dead CTL_PRINT|_SHOW code
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-01-27 16:18:27 +03:00
Andrey Vagin
4dbc3f093a sockets: define NETLINK_SOCK_DIAG in sockets.h
sockets.c: In function ‘preload_socket_modules’:
sockets.c:153:36: error: ‘NETLINK_SOCK_DIAG’ undeclared (first use in this function)
sockets.c:153:36: note: each undeclared identifier is reported only once for each function it appears in

Reported-by: Mr Travis
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-01-23 15:40:02 +03:00
Pavel Emelyanov
0749ef23e9 check/zdtm: Introduce fine-grained feature testing
Right now we state that CRIU works on 3.11 and above kernels and, at the
same time, have support for a couple of new features like aio, tun, timerfd
etc. available in later kernels. Since these new features do not break
generic operations we do not require them in the kernel strictly.

However, in the zdtm tests it's very important to know exactly what can
and what cannot be tested. Right now this is done in a tough manner -- if
the kernel is not 3.11 or criu check fails for _any_ reason we treat the
kernel as being "bad" and throw out a set of tests.

I propose to test some individual features and form the list of tests
in a more fine-grained manner.

This patch only fixes the AIO, mnt_id, tun and posix-timers tests. Next
I will add checks and fixes for user-namespaces tests.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
2015-01-22 18:55:34 +03:00
Pavel Emelyanov
674df19a34 nlk: Add error callback to do_rtnl_req
In the next patch we will need to care about the exact error reported by
the kernel, so add the error callback for this.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-01-22 18:54:37 +03:00
Saied Kazemi
296129295a Allow the veth-pair option to specify a bridge
When restoring a pair of veth devices that had one end inside a namespace
or container and the other end outside, CRIU creates a new veth pair,
puts one end in the namespace/container, and names the other end from
what's specified in the --veth-pair IN=OUT command line option.

This patch allows for appending a bridge name to the OUT string in the
form of OUT@<BRIDGE-NAME> in order for CRIU to move the outside veth to
the named bridge.  For example, --veth-pair eth0=veth1@br0 tells CRIU
to name the peer of eth0 veth1 and move it to bridge br0.

This is a simple and handy extension of the --veth-pair option that
obviates the need for an action script although one can still do the same
(and possibly more) if they prefer to use action scripts.

Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-01-12 14:54:18 +03:00
Pavel Emelyanov
a1b1959dd1 shmem: Turn shmem-info into shared objects from shremap ones
We have a nasty issue with it. Current code allocates these
entries in shremap area one by one. We do NOT allocate any
OTHER entries in this region, but if we will this array will
be spoiled.

Fortunately we no longer need shmem-infos as plain array,
neither we need one in restorer. So just turn this into plain
shared objects and collect them in a list.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-01-12 14:47:24 +03:00
Pavel Emelyanov
b246ccb181 shmem: Move some code to shmem.c file
The struct and find routine used to be use by restorer code. Now
the former fully uses vmas and fd opened, so we can move the code
into .c file not to spoil global namespace.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-01-12 14:47:17 +03:00
Pavel Emelyanov
455f9b564e fd: Factor out inheriting FDs code
We have two places where we lookup the inherited-fd list
by name and dup() the descriptor found. I propose to factor
out this piece in a single inherited_fd() call. When
we will want to support inheritance for sockets or any
other files we'll simply add the inherited_fd() call
there.

I'm also thinking about moving the call to inherited_fd
into generic level, but the open_path() routine doesn't
allow to do it in a simple manner.

Also we have not yet finished issue with files-vs-inodes
mapping. Keeping all the logic in one function should
make the solution simpler.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-01-12 14:46:51 +03:00
Pavel Emelyanov
8f691c40d5 fd: Mark inherit_fd_lookup_fd static
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-01-12 14:46:42 +03:00
Cyrill Gorcunov
fd07bc7791 cpu: Add 'ins' mode to --cpu-cap option
In this mode we test if target cpu has all features present
in image file but do not require bit to bit match: target cpu
may be a new one with more features present.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-12-26 18:15:46 +03:00
Pavel Emelyanov
2694a74a00 aio: Restore AIO contexts
Restoring AIO is quite simple. Once all VMAs are put in
their places we can call io_setup() to let kernel create
the context back and then move the ring into proper place.

Another thing we should "restore" is the context ID. But
the thing is, upon ring creation kernel repots the ring
start address as this ID. And there's a patch in the -next
tree that changes the ID when we remap the ring. That
said after AIO context creation and ring remap we need
to check that the new ID is seen by the kernel.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-12-26 18:13:40 +03:00
Pavel Emelyanov
08c204820f aio: Dump AIO rings
When AIO context is set up kernel does two things:

1. creates an in-kernel aioctx object
2. maps a ring into process memory

The 2nd thing gives us all the needed information
about how the AIO was set up. So, in order to dump
one we need to pick the ring in memory and get all
the information we need from it.

One thing to note -- we cannot dump tasks if there
are any AIO requests pending. So we also need to
go to parasite and check the ring to be empty.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-12-26 18:13:36 +03:00
Pavel Emelyanov
80cf042695 x86: Add io syscalls
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-12-26 18:13:33 +03:00
Pavel Emelyanov
6a6cdb8d4a proc: Drop always true last argument of parse_smaps()
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
2014-12-22 13:52:03 +03:00
Ruslan Kuprieiev
b30940eee2 cr_errno: move cr_err helpers into cr_errno.h
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-12-22 13:50:45 +03:00
Ruslan Kuprieiev
1ace257022 tty: add vt support, v2
/dev/ttyN are the virtual terminals which are provided
by the system with major 4 and minor 1..63.
You can run some program on ttyN by pressing alt+ctrl+FN
and running it manualy or by using open(openvt nowadays).

This patch also allows us to run all our tests from a vt.

v2, style fix + using linux/vt.h for constants

Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-12-22 13:48:31 +03:00
Ruslan Kuprieiev
8eaf0142ab cr-service: set cr_errno to EBADRQC if set_opts_from_req fails
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-12-19 18:59:28 +03:00
Ruslan Kuprieiev
e76749b790 cr-restore: set cr_error to EEXIST if such pid already exists, v3
This is a very common error when using criu.

The problem here is that we need to somehow transfer cr_errno
from one process to another. I suggest using pipe to give
one end to children and read cr_errno on other after restore
is finished.

v2, Pavel suggested putting errno into shared task_entries.
v3. and he also suggested using cmpxchg

Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-12-19 18:59:17 +03:00
Ruslan Kuprieiev
b09a88b5f9 util: set cr_errno to ESRCH if no PID dir in proc
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-12-19 18:59:14 +03:00
Ruslan Kuprieiev
ef283e505c cr-errno: initial commit
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-12-19 18:58:46 +03:00
Saied Kazemi
0412152fc5 Add inherit fd support
There are cases where a process's file descriptor cannot be restored
from the checkpoint images.  For example, a pipe file descriptor with
one end in the checkpointed process and the other end in a separate
process (that was not part of the checkpointed process tree) cannot be
restored because after checkpoint the pipe will be broken.

There are also cases where the user wants to use a new file during
restore instead of the original file at checkpoint time.  For example,
the user wants to change the log file of a process from /path/to/oldlog
to /path/to/newlog.

In these cases, criu's caller should set up a new file descriptor to be
inherited by the restored process and specify the file descriptor with the
--inherit-fd command line option.  The argument of --inherit-fd has the
format fd[%d]:%s, where %d tells criu which of its own file descriptors
to use for restoring the file identified by %s.

As a debugging aid, if the argument has the format debug[%d]:%s, it tells
criu to write out the string after colon to the file descriptor %d.  This
can be used, for example, as an easy way to leave a "restore marker"
in the output stream of the process.

It's important to note that inherit fd support breaks applications
that depend on the state of the file descriptor being inherited.  So,
consider inherit fd only for specific use cases that you know for sure
won't break the application.

For examples please visit http://criu.org/Category:HOWTO.

v2: Added a check in send_fd_to_self() to avoid closing an inherit fd.
    Also, as an extra measure of caution, added checks in the inherit fd
    look up functions to make sure that the inherit fd hasn't been reused.
    The patch also includes minor cosmetic changes.

Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-12-10 12:48:30 +03:00
Andrey Vagin
4bca68ba49 tcp: don't split packets for restoring a send queue
The kernel can do it better. The problem exists only for recv queues.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-12-08 15:46:44 +03:00
Andrey Vagin
71a0b5dc31 mem: check existence of parent images before dumping pages (v2)
When we are doing pre-dump, we splice pages in pipes and only then open
images and dump pages. But when we are splicing pages, we need to know
about existence of parent images. This patch adds a new call to determin
existence of parent images.

In addition this patch fixes a following issue:
CID 83244 (#1 of 1): Uninitialized pointer read (UNINIT)
14. uninit_use: Using uninitialized value xfer.parent.

v2: initialize unused field of struct page_server_iov, because it sends
in network.

CID 83451 (#1 of 1): Uninitialized scalar variable (UNINIT)
2. uninit_use_in_call: Using uninitialized value pi. Field pi.nr_pages
is uninitialized when calling write.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-11-29 19:32:40 +03:00
Pavel Emelyanov
69bffe26d3 kerndat: Make fs-virtualized check report yes/no
Right now it returns the whole struct stat which is excessive.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-11-11 20:15:09 +04:00
Pavel Emelyanov
19a76494a9 kerndat: Collect all global variables on one struct
Not to spoil the global namespace and unify the kerndat
data names.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-11-11 20:14:53 +04:00
Pavel Emelyanov
f33908a897 ns: Rename "created" futex and comment what it is
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-11-11 20:11:58 +04:00
Pavel Emelyanov
ee2e8e5bb9 parasite: Cleanup args size fetching
Right now we push all the auxiliary arguments to parasite_infect_seized
while 2 of them are only required to calculate the size of args area.

Let's better keep track of required args size and get rid of excessive
arguments to parasite_infect_seized().

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-11-11 20:11:34 +04:00
Pavel Emelyanov
1cad9b1049 util: Fix the ispathsub corner case
ispathsub("/foo", "/") reports false. This is a corner case,
as 2nd argument is not expected to end with /. Fix this and
add comment about ispathsub() arguments assumptions.

Reported-by: Andrey Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-11-09 23:26:56 +04:00
Pavel Emelyanov
32f58742ca mnt: Introduce and use issubpath helper
When we validate the mount tree not to have overmounts we need to
check one path to be the sub-path of another. Here's a helper for
this.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
2014-11-07 17:39:23 +04:00
Andrey Vagin
cb2f9223a0 dump: dump user namespaces (v2)
For that we need to save per-namespace mappings of user and group IDs.

And all id-s for tasks and files are saved from the target user
namespace.

v2: move code into collect_namespaces()
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-11-07 17:16:16 +04:00