This commit finalizes AArch64 Guarded Control Stack (GCS)
support by wiring the full dump and restore flow.
The restore path adds the following steps:
- Define shared AArch64 GCS types and constants in a dedicated header
for both compel and CRIU inclusion
- compel: add get/set NT_ARM_GCS via ptrace, enabling user-space
GCS state save and restore.
- During restore switch to the new GCS (via GCSSTR) to place capability
token sa_restorer address
- arch_shstk_trampoline() — We enable GCS in a trampoline that using
prctl(PR_SET_SHADOW_STACK_STATUS, ...) via inline SVC. The trampoline
ineeded because we can’t RET without a valid GCS.
- restorer: map the recorded GCS VMA, populate contents top-down with
GCSSTR, write the signal capability at GCSPR_EL0 and the valid token at
GCSPR_EL0-8, then switch to the rebuilt GCS (GCSSS1)
- Save and restore registers via ptrace
- Extend restorer argument structures to carry GCS state
into post-restore execution
- Add shstk_set_restorer_stack(): sets tmp_gcs to temporary restorer
shadow stack start
- Add gcs_vma_restore implementation (required for mremap of the GCS VMA)
Tested with:
GCS_ENABLE=1 ./zdtm.py run -t zdtm/static/env00
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
Add basic prerequisites for Guarded Control Stack (GCS) state on AArch64.
This adds a gcs_context to the signal frame and extends user_fpregs_struct_t to
carry GCS metadata, preparing the groundwork for GCS in the parasite.
For now, the GCS fields are zeroed during compel_get_task_regs(), technically
ignoring GCS since it does not reach the control logic yet; that will be
introduced in the next commit.
The code path is gated and does not affect normal tests. Can be explicitly
enabled and tested via:
make -C infect GCS_ENABLE=1 && make -C infect run
Signed-off-by: Igor Svilenkov Bozic <svilenkov@gmail.com>
[ alex: clean up fixes ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Mike Rapoport <rppt@kernel.org>
The stack test incorrectly assumed the page immediately
following the stack pointer could never be changed. This doesn't work,
because this page can be a part of another mapping.
This commit introduces a dedicated "stack redzone," a small guard region
directly after the stack. The stack test is modified to specifically
check for corruption within this redzone.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
PAC stands for Pointer Authentication Code. Each process has 5 PAC keys
and a mask of enabled keys. All this properties have to be C/R-ed.
As they are per-process protperties, we can save/restore them just for
one thread.
Signed-off-by: Andrei Vagin <avagin@google.com>
criu_run_id will be used in upcoming changes to create and remove
network rules for network locking. Instead of trying to come up with
a way to create unique IDs, just use an existing library.
libuuid should be installed on most systems as it is indirectly required
by systemd (via libmount).
Signed-off-by: Adrian Reber <areber@redhat.com>
1. os auto assignment vma addr maybe conflict with vma in gpu living migrate scene;
2. so, we should give choice to user;
Signed-off-by: haozi007 <liuhao27@huawei.com>
Commit fc683cb01 ("compel: shstk: save CET state when CPU supports it")
started using PTRACE_ARCH_PRCTL to query shadow stack status. While
PTRACE_ARCH_PRCTL has existed in the kernel for a long time, it was only
added to glibc in version 2.27. Amazon Linux 2 (AL2) has glibc 2.26,
which does not have this definition. As a result, build on AL2 fails
with the below error:
compel/arch/x86/src/lib/infect.c: In function ‘get_task_xsave’:
compel/arch/x86/src/lib/infect.c:276:14: error: ‘PTRACE_ARCH_PRCTL’ undeclared (first use in this function)
276 | if (ptrace(PTRACE_ARCH_PRCTL, pid, (unsigned long)&features, ARCH_SHSTK_STATUS)) {
| ^~~~~~~~~~~~~~~~~
While the definition is present on the system via the kernel headers (in
asm/ptrace-abi.h) which can be reached by including linux/ptrace.h, the
comment in compel/include/uapi/ptrace.h says:
We'd want to include both sys/ptrace.h and linux/ptrace.h, hoping
that most definitions come from either one or another. Alas, on
Alpine/musl both files declare struct ptrace_peeksiginfo_args, so
there is no way they can be used together. Let's rely on libc one.
Since including linux/ptrace.h is not an option, define
PTRACE_ARCH_PRCTL if it doesn't already exist. An interesting point to
note is that in sys/ptrace.h, PTRACE_ARCH_PRCTL is an enum value so the
preprocessor doesn't know about it. PT_ARCH_PRCTL is the preprocessor
symbol that matches the value of PTRACE_ARCH_PRCTL. So look for
PT_ARCH_PRCTL to decide if PTRACE_ARCH_PRCTL is available or not.
Another interesting point to note is that AL2 ships with GCC 7 by
default, which does not support the -mshstk option, causing other build
failures. Luckily, it also ships GCC 10 which does have the option.
Using GCC 10 lets the build succeed.
Fixes: fc683cb01 ("compel: shstk: save CET state when CPU supports it")
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
When calling sigreturn with CET enabled, the kernel verifies that the
shadow stack has proper address of sa_restorer and a "restore token".
Normally, they pushed to the shadow stack when signal processing is
started.
Since compel calls sigreturn directly, the shadow stack should be
updated to match the kernel expectations for sigreturn invocation.
Add parasite_setup_shstk() that sets up the shadow stack with the
address of __export_parasite_head_start as sa_restorer and with the
required restore token.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
As our pr_* functions are complex and can call different system calls
inside before actual printing (e.g. gettimeofday for timestamps) actual
errno at the time of printing may be changed.
Let's just use %s + strerror(errno) instead of %m with pr_* functions to
be explicit that errno to string transformation happens before calling
anything else.
Note: tcp_repair_off is called from pie with no pr_perror defined due to
CR_NOGLIBC set and if I use errno variable there I get "Unexpected
undefined symbol: `__errno_location'. External symbol in PIE?", so it
seems there is no way to print errno there, so let's just skip it.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Starting the daemon is the first time we run code in the victim
using the parasite stack.
It's useful for testing to be able to infect the victim without starting
the daemon so that we can inspect the victim's state, set up stack
guards, and so on before stack-related corruption can happen.
Add compel_infect_no_daemon() to infect the victim but not start the
daemon and compel_start_daemon() to start the daemon after the victim
is infected.
Add compel_get_stack() to get the victim's main and thread parasite
stacks.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
When delivering system call traps, set bit 7 in the signal number (i.e.,
deliver SIGTRAP|0x80). This makes it easy for the tracer to distinguish
normal traps from those caused by a system call.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Add SIGTSTP signal dump and restore. Add a corresponding field
in the image, save it only if a task is in the stopped state.
Restore task state by sending desired stop signal if it is present
in the image. Fallback to SIGSTOP if it's absent.
Signed-off-by: Yuriy Vasiliev <yuriy.vasiliev@openvz.org>
Add "get_rseq_conf" feature corresponding to the
ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support.
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
CRIU has a few places where it creates unix sockets and their names have to be
unique for each criu run.
Fixes: #1798
Signed-off-by: Andrei Vagin <avagin@google.com>
This is the workaround for #1429.
The parasite code contains instructions that trigger SIGTRAP to stop at
certain points. In such cases, the kernel sends a force SIGTRAP that
can't be ignore and if it is blocked, the kernel resets its signal
handler to a default one and unblocks it. It means that if we want to
save the origin signal handle
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Arch-dependend way to restore extended registers set.
Use it straight-away to restore per-thread registers.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Extended registers set for task is restored with rt_sigreturn() through
prepared sigframe. For threads it's currently lost.
Preserve it inside thread context to restore on thread curing.
Signed-off-by: Dmitry Safonov <dima@arista.com>
With pseudo-random garbage, the seed is printed with pr_err().
get_task_regs() is called during seizing the task and also for each
thread.
At this moment only for x86.
Signed-off-by: Dmitry Safonov <dima@arista.com>
To minimize things done in parasite, PTRACE_GET_THREAD_AREA can be
used to get remote tls. That also removes an additional compat stack
(de)allocation in the parasite (also asm-coded syscall).
In order to use PTRACE_GET_THREAD_AREA, the dumpee should be stopped.
So, let's move this from criu to compel to non-seized state and put tls
into thread info on x86.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Some kernels have W^X mitigation, which means they won't execute memory
blocks if that memory block is also writable or ever was writable. This
patch enables CRIU to run on such kernels.
1. Align .data section to a page.
2. mmap a memory block for parasite as RX.
3. mprotect everything after .text as RW.
Signed-off-by: Michał Cłapiński <mclapinski@google.com>
Previously, the GOT table was using the same memory location as the
args region, leading to difficult to debug memory corruption bugs.
We allocate the GOT table between the parasite blob and the args region.
The reason this is a good placement is:
1) Putting it after the args region is possible but a bit combersome as
the args region has a variable size
2) The cr-restore.c code maps the parasite code without the args region,
as it does not do RPC.
Another option is to rely on the linker to generate a GOT section, but I
failed to do so despite my best attempts.
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Previously, __export_parasite_cmd was located in parasite-head.S, and
__export_parasite_args located exactly at the end of the parasite blob.
This is not ideal for various reasons:
1) These two variables work together. It would be preferrable to have
them in the same location
2) This prevent us from allocating another section betweeen the parasite
blob and the args area. We'll need this to allocate a GOT table
This commit changes the allocation of these symbols from assembly/linker
script to a C file.
Moreover, the assembly entry points that invoke parasite_service()
prepares arguments with hand crafted assembly. This is unecessary.
This commit rewrite this logic with regular C code.
Note: if it wasn't for the x86 compat mode, we could remove all
parasite-head.S files and directly jump to parasite_service() via
ptrace. An int3 architecture specific equivalent could be called at the
end of parasite_service() with an inline asm statement.
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
It removes the potential confusion when it comes to virtual address vs
offsets. Further, doing so makes naming more consistent with the rest of
the parasite_blob_desc struct.
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
compel_relocs_apply() was taking arguments mostly from the struct
parasite_blob_desc. Instead of passing all the arguments, we pass a
pointer to the struct itself.
This makes the code safer, as cr-restore.c calls compel_relocs_apply().
It previously needed to poke into what can be considered private
variables of the restorer-pie.h file.
To allow the parasite_blob_desc struct to be populated without a
parasite_ctl struct, we expand the compel API to export a
parasite_setup_c_header_desc() in the generated pie.h.
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
The file only includes other headers (which may be not needed).
If we aim for one-include-for-compel, we could instead paste all
subheaders into "compel.h".
Rather, I think it's worth to migrate to more fine-grained compel
headers than follow the strategy 'one header to rule them all'.
Further, the header creates problems for cross-compilation: it's
included in files, those are used by host-compel. Which rightfully
confuses compiler/linker as host's definitions for fpu regs/other
platform details get drained into host's compel.
Signed-off-by: Dmitry Safonov <dima@arista.com>
All those compel functions can fail by various reasons.
It may be status of the system, interruption by user or anything else.
It's really desired to handle as many PIE related errors as possible
otherwise it's hard to analyze statuses of parasite/restorer
and the C/R process.
At least warning for logs should be produced or even C/R stopped.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
There are a few places where spaces have been used instead of tabs for
indentation. This patch converts the spaces to tabs for consistency
with the rest of the code base.
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
with Android P's Clang versoin: 6.0.2, and Android NDK's Clang version 8.0.2
Clang will report below error:
criu/compel/include/uapi/compel/sigframe-common.h:55:34: error: expected member name or ';' after declaration specifiers
int __unused[32 - (sizeof (k_rtsigset_t) / sizeof (int))];
~~~ ^
it takes __unused as an attribute, not a varible, chang to _unused, pass compile.
Cc: Chen Hu <hu1.chen@intel.com>
Signed-off-by: Zhang Ning <ning.a.zhang@intel.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
We need to know what are stack pointers of every thread to ensure that the
current stack page will not be treated as lazy.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Alice Frosi <alice@linux.vnet.ibm.com>
Reviewed-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Will need them to mask some of the features from
command line options.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
On fedora rawhide seccomp_metadata for some
reason is not defined (while in kernel it introduced
together with PTRACE_SECCOMP_GET_METADATA). So
lets do a trick for a while -- define own alias.
Once system headers get settled down we might find
more suitable solution. Because it's a part of kernel
API we're on the safe side.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We will use it to figure out if filter log target is used.
Metadata associated with seccomp filter is relatively new
feature which allows userspace to get and set it back.
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
For architectures like aarch64/ppc64 it's needed to propagate the size
of page inside PIEs. For the parasite page size will be defined during
seizing, and for restorer during early initialization.
Afterward we can use PAGE_SIZE in PIEs like we did before.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Ugh, I've spent 25 mins at 4 A.M. to figure out why the tests are failing.
And the reason is stupied me, who defined a new flag after 0x8
as 0x16, not as 0x10. Simplify those definitions for such simple-minded
living creatures like Dima.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
On Skylake processors and kernel older than v4.14
ptrace(PTRACE_GETREGSET, pid, NT_X86_XSTATE, iov)
may return not full xstate, ommiting FP part (that is XFEATURE_MASK_FP).
There is a patch which describes this bug:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1318800.html
Anyway, it's fixed in v4.14 kernel by (what we believe with Andrey) this:
https://patchwork.kernel.org/patch/9567939/
As we still support kernels from v3.10 and newer, we need to have a
workaround for this kernel bug on Skylake CPUs.
Big thanks to Shlomi for the reports, the effort and for providing an
Amazon VM to test this. I wish more bug reporters were like you.
Reported-by: Shlomi Matichin <shlomi@binaris.com>
Provided-test-env: Shlomi Matichin <shlomi@binaris.com>
Investigated-with: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
get_task_regs() needs to know if it needs to use workaround
for a Skylake ptrace() bug. The next patch will introduce a
new flag for that.
I also thought about making 3 versions of get_task_regs() and
adding them to ictx->get_task_regs() depending on the flags..
But get_task_regs() is a private function and infect_ctx is
a uapi.. So, let's just pass context flags to get_task_regs().
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>