mirror of
https://github.com/checkpoint-restore/criu.git
synced 2026-01-23 10:16:41 +00:00
criu/plugin: Add whitepaper document
Adding whitepaper document Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com>
This commit is contained in:
parent
99a2380fc0
commit
ecdf740fa3
1 changed files with 274 additions and 0 deletions
274
plugins/amdgpu/README.md
Normal file
274
plugins/amdgpu/README.md
Normal file
|
|
@ -0,0 +1,274 @@
|
|||
Supporting ROCm with CRIU
|
||||
=========================
|
||||
|
||||
_Felix Kuehling <Felix.Kuehling@amd.com>_<br>
|
||||
_Rajneesh Bardwaj <Rajneesh.Bhardwaj@amd.com>_<br>
|
||||
_David Yat Sin <David.YatSin@amd.com>_
|
||||
|
||||
# Introduction
|
||||
|
||||
ROCm is the Radeon Open Compute Platform developed by AMD to support
|
||||
high-performance computing and machine learning on AMD GPUs. It is a nearly
|
||||
fully open-source software stack starting from the kernel mode GPU driver,
|
||||
including compilers and language runtimes, all the way up to optimized
|
||||
mathematics libraries, machine learning frameworks and communication libraries.
|
||||
|
||||
Documentation for the ROCm platform can be found here:
|
||||
https://rocmdocs.amd.com/en/latest/
|
||||
|
||||
CRIU is a tool for freezing and checkpointing running applications or
|
||||
containers and later restoring them on the same or a different system. The
|
||||
process is transparent to the application being checkpointed. It is mostly
|
||||
implemented in user mode and relies heavily on Linux kernel features, e.g.
|
||||
cgroups, ptrace, vmsplice, and more. It can checkpoint and restore most
|
||||
applications relying on standard libraries. However, it is not able to
|
||||
checkpoint and restore applications using device drivers, with their own
|
||||
per-application kernel mode state, out of the box. This includes ROCm
|
||||
applications using the KFD device driver to access GPU hardware resources. CRIU
|
||||
includes some plugin hooks to allow extending it to add such support in the
|
||||
future.
|
||||
|
||||
A common environment for ROCm applications is in data centers and compute
|
||||
clusters. In this environment, migrating applications using CRIU would be
|
||||
beneficial and desirable. This paper outlines AMDs plans for adding ROCm
|
||||
support to CRIU.
|
||||
|
||||
# State associated with ROCm applications
|
||||
|
||||
ROCm applications communicate with the kernel mode driver “amdgpu.ko” through
|
||||
the Thunk library “libhsakmt.so” to enumerate available GPUs, manage
|
||||
GPU-accessible memory, user mode queues for submitting work to the GPUs, and
|
||||
events for synchronizing with GPUs. Many of those APIs create and manipulate
|
||||
state maintained in the kernel mode driver that would need to be saved and
|
||||
restored by CRIU.
|
||||
|
||||
## Memory
|
||||
|
||||
ROCm manages memory in the form of buffer objects (BOs). We are also working on
|
||||
a new memory management API that will be based on virtual address ranges. For
|
||||
now, we are focusing on the buffer-object based memory management.
|
||||
|
||||
There are different types of buffer objects supported:
|
||||
|
||||
* VRAM (device memory managed by the kernel mode driver)
|
||||
* GTT (system memory managed by the kernel mode driver)
|
||||
* Userptr (normal system memory managed by user mode driver or application)
|
||||
* Doorbell (special aperture for sending signals to the GPU for user mode command submissions)
|
||||
* MMIO (special aperture for accessing GPU control registers, used for certain cache flushing operations)
|
||||
|
||||
All these BOs are typically mapped into the GPU page tables for access by GPUs.
|
||||
Most of them are also mapped for CPU access. The following BO properties need
|
||||
to be saved and restored for CRIU to work with ROCm applications:
|
||||
|
||||
* Buffer type
|
||||
* Buffer handle
|
||||
* Buffer size (page aligned)
|
||||
* Virtual address for GPU mapping (page aligned)
|
||||
* Device file offset for CPU mapping (for VRAM and GTT BOs)
|
||||
* Memory contents (for VRAM and GTT BOs)
|
||||
|
||||
## Queues
|
||||
|
||||
ROCm uses user mode queues to submit work to the GPUs. There are several memory
|
||||
buffers associated with queues. At the language runtime or application level,
|
||||
they expose the ring buffer as well as a signal object to tell the GPU about
|
||||
new commands added to the queue. The signal is mapped to a doorbell (a 64-bit
|
||||
entry in the doorbell aperture mapped by the doorbell BO). Internally there are
|
||||
other buffers needed for dispatch completion tracking, shader state saving
|
||||
during queue preemption and the queue state itself. Some of these buffers are
|
||||
managed in user mode, others are managed in kernel mode.
|
||||
|
||||
When an application is checkpointed, we need to preempt all user mode queues
|
||||
belonging to the process, and then save their state, including:
|
||||
|
||||
* Queue type (compute or DMA)
|
||||
* MQD (memory queue descriptor managed in kernel mode), with state such as
|
||||
* ring buffer address
|
||||
* read and write pointers
|
||||
* doorbell offset
|
||||
* pointer to AQL queue data structure
|
||||
* Control stack (kernel-managed piece of state needed for resuming preempted queue)
|
||||
|
||||
The rest of the queue state is contained in user-managed buffer objects that
|
||||
will be saved by the memory state handling described above:
|
||||
|
||||
* Ring buffer (userptr BO containing commands sent to the GPU)
|
||||
* AQL queue data structure (userptr BO containing `struct hsa_queue_t`)
|
||||
* EOP buffer (VRAM BO used for dispatch completion tracking by the command processor)
|
||||
* Context save area (userptr BO for saving shader state of preempted wavefronts)
|
||||
|
||||
## Events
|
||||
|
||||
Events are used to implement interrupt-based sleeping/waiting for signals sent
|
||||
from the GPU to the host. Signals are represented by some data structures in
|
||||
KFD and an entry in a user-allocated, GPU-accessible BO with event slots. We
|
||||
need to save the allocated set of event IDs and each event’s signaling state.
|
||||
The contents of the event slots will be saved by the memory state handling
|
||||
described above.
|
||||
|
||||
## Topology
|
||||
|
||||
When ROCm applications are started, they enumerate the device topology to find
|
||||
available GPUs, their capabilities and connectivity. An application can be
|
||||
checkpointed at any time, so it will not be at a safe place to re-enumerate the
|
||||
topology when it is restored. Therefore, we can only support restoring
|
||||
applications on systems with a very similar topology:
|
||||
|
||||
* Same number of GPUs
|
||||
* Same type of GPUs (i.e. instruction set, cache sizes, number of compute units, etc.)
|
||||
* Same or larger memory size
|
||||
* Same VRAM accessibility by the host
|
||||
* Same connectivity and P2P memory support between GPUs
|
||||
|
||||
At the KFD ioctl level, GPUs are identified by GPUIDs, which are unique
|
||||
identifiers created by hashing various GPU properties. That way a GPUID will
|
||||
not change during the lifetime of a process, even in a future where GPUs may be
|
||||
added or removed dynamically. When restoring a process on a different system,
|
||||
the GPUID may have changed. Or it may be desirable to restore a process using a
|
||||
different subset of GPUs on the same system (using cgroups). Therefore, we will
|
||||
need a translation of GPUIDs for restored processes that applies to all KFD
|
||||
ioctl calls after an application was restored.
|
||||
|
||||
# CRIU plugins
|
||||
|
||||
CRIU provides plugin hooks for device files:
|
||||
|
||||
int cr_plugin_dump_file(int fd, int id);
|
||||
int cr_plugin_restore_file(int id);
|
||||
|
||||
In a ROCm process, it will be invoked for `/dev/kfd` and `/dev/dri/renderD*`
|
||||
device nodes. `/dev/kfd` is used for KFD ioctl calls to manage memory, queues,
|
||||
signals and other functionality for all GPUs through a single device file
|
||||
descriptor. `/dev/dri/renderD*` are per GPU device files, called render nodes,
|
||||
that are used mostly for CPU mapping of VRAM and GTT BOs. Each BO is given a
|
||||
unique offset in the render node of the corresponding GPU at allocation time.
|
||||
|
||||
Render nodes are also used for memory management and command submission by the
|
||||
Mesa user mode driver for video decoding and post processing. These use cases
|
||||
are relevant even in data centers. Support for this is not an immediate
|
||||
priority but planned for the future. This will require saving additional state
|
||||
as well as synchronization with any outstanding jobs. For now, there is no
|
||||
kernel-mode state associated with `/dev/renderD*`.
|
||||
|
||||
The two existing plugins can be used for saving and restoring most state
|
||||
associated with ROCm applications. We are planning to add new ioctl calls to
|
||||
`/dev/kfd` to help with this.
|
||||
|
||||
## Dumping
|
||||
|
||||
At the “dump” stage, the ioctl will execute in the context of the CRIU dumper
|
||||
process. But the file descriptor (fd) is “drained” from the process being saved
|
||||
by the parasite code that CRIU injects into its target. This allows the plugin
|
||||
to make an ioctl call with enough context to allow KFD to access all the kernel
|
||||
mode state associated with the target process. CRIU is ptrace attached to the
|
||||
target process. KFD can use that fact to authorize access to the target
|
||||
process' information.
|
||||
|
||||
The contents of GTT and VRAM BOs are not automatically saved by CRIU. CRIU can
|
||||
only support saving the contents of normal pageable mappings. GTT and VRAM BOs
|
||||
are special device file IO mappings. Therefore, our dumper plugin will need to
|
||||
save the contents of these BOs. In the initial implementation they can be
|
||||
accessed through `/proc/<pid>/mem`. For better performance we can use a DMA
|
||||
engine in the GPU to copy the data to system memory.
|
||||
|
||||
## Restoring
|
||||
|
||||
At the “restore” stage we first need to ensure that the topology of visible
|
||||
devices (in the cgroup) is compatible with the topology that was saved. Once
|
||||
this is confirmed, we can use a new ioctl to load the saved state back into
|
||||
KFD. This ioctl will run in the context of the process being restored, so no
|
||||
special authorization is needed. However, some of the data being copied back
|
||||
into kernel mode could have been tampered with. MQDs and control stacks provide
|
||||
access to privileged GPU registers. Therefore, the restore ioctl will only be
|
||||
allowed to run with root privileges.
|
||||
|
||||
## Remapping render nodes and mmap offsets
|
||||
|
||||
BOs are mapped for CPU access by mmapping the GPU's render node at a specific
|
||||
offset. The offset within the render node device file identifies the BO.
|
||||
However, when we recreate the BOs, we cannot guarantee that they will be
|
||||
restored with the same mmap offset that was saved, because the mmap offset
|
||||
address space per device is shared system wide.
|
||||
|
||||
When a process is restored on a different GPU, it will need to map the BOs from
|
||||
a different render node device file altogether.
|
||||
|
||||
A new plugin call will be needed to translate device file names and mmap
|
||||
offsets to the newly allocated ones, before CRIU's PIE code restores the VMA
|
||||
mappings. Fortunately, ROCm user mode does not remember the file names and mmap
|
||||
offsets after establishing the mappings, so changing the device files and mmap
|
||||
offsets under the hood will not be noticed by ROCm user mode.
|
||||
|
||||
*This new plugin is enabled by the new hook `__UPDATE_VMA_MAP` in our RFC patch
|
||||
series.*
|
||||
|
||||
## Resuming GPU execution
|
||||
|
||||
At the time of running the `cr_plugin_restore_file` plugin, it is too early to
|
||||
restore userptr GPU page table mappings and their MMU notifiers. These mappings
|
||||
mirror CPU page tables into GPU page tables using the HMM mirror API in the
|
||||
kernel. The MMU notifiers notify the driver when the virtual address mapping
|
||||
changes so that the GPU mapping can be updated.
|
||||
|
||||
This needs to happen after the restorer PIE code has restored all the VMAs at
|
||||
their correct virtual addresses. Otherwise, the HMM mirroring will simply fail.
|
||||
Before all the GPU memory mappings are in place, it is also too early to resume
|
||||
the user mode queue execution on the GPUs.
|
||||
|
||||
Therefore, a new plugin is needed that runs in the context of the master
|
||||
restore process after the restorer PIE code has restored all the VMAs and
|
||||
returned control to all the restored processes via sigreturn. It needs to be
|
||||
called once for each restored target process to finalize userptr mappings and
|
||||
to resume execution on the GPUs.
|
||||
|
||||
*This new plugin is enabled by the new hook `__RESUME_DEVICES_LATE` in our RFC
|
||||
patch series.*
|
||||
|
||||
## Other CRIU changes
|
||||
|
||||
In addition to the new plugins, we need to make some changes to CRIU itself to
|
||||
support device file VMAs. Currently CRIU will simply fail to dump a process
|
||||
that has such PFN or IO memory mappings. While CRIU will not need to save the
|
||||
contents of those VMAs, we do need CRIU to save and restore the VMAs
|
||||
themselves, with translated mmap offsets (see “Remapping mmap offsets” above).
|
||||
|
||||
## Security considerations
|
||||
|
||||
The new “dump” ioctl we are adding to `/dev/kfd` will expose information about
|
||||
remote processes. This is a potential security threat. CRIU will be
|
||||
ptrace-attached to the target process, which gives it full access to the state
|
||||
of the process being dumped. KFD can use ptrace attachment to authorize the use
|
||||
of the new ioctl on a specific target process.
|
||||
|
||||
The new “restore” ioctl will load privileged information from user mode back
|
||||
into the kernel driver and the hardware. This includes MQD contents, which will
|
||||
eventually be loaded into HQD registers, as well as a control stack, which is a
|
||||
series of low-level commands that will be executed by the command processor.
|
||||
Therefore, we are limiting this ioctl to the root user. If CRIU restore must be
|
||||
possible for non-root users, we need to sanitize the privileged state to ensure
|
||||
it cannot be used to circumvent system security policies (e.g. arbitrary code
|
||||
execution in privileged contexts with access to page tables etc.).
|
||||
|
||||
Modified mmap offsets could potentially be used to access BOs belonging to
|
||||
different processes. This potential threat is not new with CRIU. `amdgpu.ko`
|
||||
already implements checking of mmap offsets to ensure a context (represented by
|
||||
a render node file descriptor) is only allowed access to its own BOs.
|
||||
|
||||
# Glossary
|
||||
|
||||
Term | Definition
|
||||
--- | ---
|
||||
CRIU | Checkpoint/Restore In Userspace
|
||||
ROCm | Radeon Open Compute Platform
|
||||
Thunk | User-mode API interface to interact with amdgpu.ko
|
||||
KFD | AMD Kernel Fusion Driver
|
||||
Mesa | Open source OpenGL implementation
|
||||
GTT | Graphis Translation Table, also used to denote kernel-managed system memory for GPU access
|
||||
VRAM | Video RAM
|
||||
BO | Buffer Object
|
||||
HMM | Heterogenous Memory Management
|
||||
AQL | Architected Queueing Language
|
||||
EOP | End of pipe (event indicating shader dispatch completion)
|
||||
MQD | Memory Queue Descriptors
|
||||
HQD | Hardware Queue Descriptors
|
||||
PIE | Position Independent Executable
|
||||
Loading…
Add table
Add a link
Reference in a new issue