From ecdf740fa3815392202eb32ff10e95aec98e9732 Mon Sep 17 00:00:00 2001 From: Felix Kuehling Date: Fri, 30 Apr 2021 04:20:48 -0400 Subject: [PATCH] criu/plugin: Add whitepaper document Adding whitepaper document Signed-off-by: Felix Kuehling Signed-off-by: Rajneesh Bhardwaj Signed-off-by: David Yat Sin --- plugins/amdgpu/README.md | 274 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 274 insertions(+) create mode 100644 plugins/amdgpu/README.md diff --git a/plugins/amdgpu/README.md b/plugins/amdgpu/README.md new file mode 100644 index 000000000..030029e7a --- /dev/null +++ b/plugins/amdgpu/README.md @@ -0,0 +1,274 @@ +Supporting ROCm with CRIU +========================= + +_Felix Kuehling _
+_Rajneesh Bardwaj _
+_David Yat Sin _ + +# Introduction + +ROCm is the Radeon Open Compute Platform developed by AMD to support +high-performance computing and machine learning on AMD GPUs. It is a nearly +fully open-source software stack starting from the kernel mode GPU driver, +including compilers and language runtimes, all the way up to optimized +mathematics libraries, machine learning frameworks and communication libraries. + +Documentation for the ROCm platform can be found here: +https://rocmdocs.amd.com/en/latest/ + +CRIU is a tool for freezing and checkpointing running applications or +containers and later restoring them on the same or a different system. The +process is transparent to the application being checkpointed. It is mostly +implemented in user mode and relies heavily on Linux kernel features, e.g. +cgroups, ptrace, vmsplice, and more. It can checkpoint and restore most +applications relying on standard libraries. However, it is not able to +checkpoint and restore applications using device drivers, with their own +per-application kernel mode state, out of the box. This includes ROCm +applications using the KFD device driver to access GPU hardware resources. CRIU +includes some plugin hooks to allow extending it to add such support in the +future. + +A common environment for ROCm applications is in data centers and compute +clusters. In this environment, migrating applications using CRIU would be +beneficial and desirable. This paper outlines AMDs plans for adding ROCm +support to CRIU. + +# State associated with ROCm applications + +ROCm applications communicate with the kernel mode driver “amdgpu.ko” through +the Thunk library “libhsakmt.so” to enumerate available GPUs, manage +GPU-accessible memory, user mode queues for submitting work to the GPUs, and +events for synchronizing with GPUs. Many of those APIs create and manipulate +state maintained in the kernel mode driver that would need to be saved and +restored by CRIU. + +## Memory + +ROCm manages memory in the form of buffer objects (BOs). We are also working on +a new memory management API that will be based on virtual address ranges. For +now, we are focusing on the buffer-object based memory management. + +There are different types of buffer objects supported: + +* VRAM (device memory managed by the kernel mode driver) +* GTT (system memory managed by the kernel mode driver) +* Userptr (normal system memory managed by user mode driver or application) +* Doorbell (special aperture for sending signals to the GPU for user mode command submissions) +* MMIO (special aperture for accessing GPU control registers, used for certain cache flushing operations) + +All these BOs are typically mapped into the GPU page tables for access by GPUs. +Most of them are also mapped for CPU access. The following BO properties need +to be saved and restored for CRIU to work with ROCm applications: + +* Buffer type +* Buffer handle +* Buffer size (page aligned) +* Virtual address for GPU mapping (page aligned) +* Device file offset for CPU mapping (for VRAM and GTT BOs) +* Memory contents (for VRAM and GTT BOs) + +## Queues + +ROCm uses user mode queues to submit work to the GPUs. There are several memory +buffers associated with queues. At the language runtime or application level, +they expose the ring buffer as well as a signal object to tell the GPU about +new commands added to the queue. The signal is mapped to a doorbell (a 64-bit +entry in the doorbell aperture mapped by the doorbell BO). Internally there are +other buffers needed for dispatch completion tracking, shader state saving +during queue preemption and the queue state itself. Some of these buffers are +managed in user mode, others are managed in kernel mode. + +When an application is checkpointed, we need to preempt all user mode queues +belonging to the process, and then save their state, including: + +* Queue type (compute or DMA) +* MQD (memory queue descriptor managed in kernel mode), with state such as + * ring buffer address + * read and write pointers + * doorbell offset + * pointer to AQL queue data structure +* Control stack (kernel-managed piece of state needed for resuming preempted queue) + +The rest of the queue state is contained in user-managed buffer objects that +will be saved by the memory state handling described above: + +* Ring buffer (userptr BO containing commands sent to the GPU) +* AQL queue data structure (userptr BO containing `struct hsa_queue_t`) +* EOP buffer (VRAM BO used for dispatch completion tracking by the command processor) +* Context save area (userptr BO for saving shader state of preempted wavefronts) + +## Events + +Events are used to implement interrupt-based sleeping/waiting for signals sent +from the GPU to the host. Signals are represented by some data structures in +KFD and an entry in a user-allocated, GPU-accessible BO with event slots. We +need to save the allocated set of event IDs and each event’s signaling state. +The contents of the event slots will be saved by the memory state handling +described above. + +## Topology + +When ROCm applications are started, they enumerate the device topology to find +available GPUs, their capabilities and connectivity. An application can be +checkpointed at any time, so it will not be at a safe place to re-enumerate the +topology when it is restored. Therefore, we can only support restoring +applications on systems with a very similar topology: + +* Same number of GPUs +* Same type of GPUs (i.e. instruction set, cache sizes, number of compute units, etc.) +* Same or larger memory size +* Same VRAM accessibility by the host +* Same connectivity and P2P memory support between GPUs + +At the KFD ioctl level, GPUs are identified by GPUIDs, which are unique +identifiers created by hashing various GPU properties. That way a GPUID will +not change during the lifetime of a process, even in a future where GPUs may be +added or removed dynamically. When restoring a process on a different system, +the GPUID may have changed. Or it may be desirable to restore a process using a +different subset of GPUs on the same system (using cgroups). Therefore, we will +need a translation of GPUIDs for restored processes that applies to all KFD +ioctl calls after an application was restored. + +# CRIU plugins + +CRIU provides plugin hooks for device files: + + int cr_plugin_dump_file(int fd, int id); + int cr_plugin_restore_file(int id); + +In a ROCm process, it will be invoked for `/dev/kfd` and `/dev/dri/renderD*` +device nodes. `/dev/kfd` is used for KFD ioctl calls to manage memory, queues, +signals and other functionality for all GPUs through a single device file +descriptor. `/dev/dri/renderD*` are per GPU device files, called render nodes, +that are used mostly for CPU mapping of VRAM and GTT BOs. Each BO is given a +unique offset in the render node of the corresponding GPU at allocation time. + +Render nodes are also used for memory management and command submission by the +Mesa user mode driver for video decoding and post processing. These use cases +are relevant even in data centers. Support for this is not an immediate +priority but planned for the future. This will require saving additional state +as well as synchronization with any outstanding jobs. For now, there is no +kernel-mode state associated with `/dev/renderD*`. + +The two existing plugins can be used for saving and restoring most state +associated with ROCm applications. We are planning to add new ioctl calls to +`/dev/kfd` to help with this. + +## Dumping + +At the “dump” stage, the ioctl will execute in the context of the CRIU dumper +process. But the file descriptor (fd) is “drained” from the process being saved +by the parasite code that CRIU injects into its target. This allows the plugin +to make an ioctl call with enough context to allow KFD to access all the kernel +mode state associated with the target process. CRIU is ptrace attached to the +target process. KFD can use that fact to authorize access to the target +process' information. + +The contents of GTT and VRAM BOs are not automatically saved by CRIU. CRIU can +only support saving the contents of normal pageable mappings. GTT and VRAM BOs +are special device file IO mappings. Therefore, our dumper plugin will need to +save the contents of these BOs. In the initial implementation they can be +accessed through `/proc//mem`. For better performance we can use a DMA +engine in the GPU to copy the data to system memory. + +## Restoring + +At the “restore” stage we first need to ensure that the topology of visible +devices (in the cgroup) is compatible with the topology that was saved. Once +this is confirmed, we can use a new ioctl to load the saved state back into +KFD. This ioctl will run in the context of the process being restored, so no +special authorization is needed. However, some of the data being copied back +into kernel mode could have been tampered with. MQDs and control stacks provide +access to privileged GPU registers. Therefore, the restore ioctl will only be +allowed to run with root privileges. + +## Remapping render nodes and mmap offsets + +BOs are mapped for CPU access by mmapping the GPU's render node at a specific +offset. The offset within the render node device file identifies the BO. +However, when we recreate the BOs, we cannot guarantee that they will be +restored with the same mmap offset that was saved, because the mmap offset +address space per device is shared system wide. + +When a process is restored on a different GPU, it will need to map the BOs from +a different render node device file altogether. + +A new plugin call will be needed to translate device file names and mmap +offsets to the newly allocated ones, before CRIU's PIE code restores the VMA +mappings. Fortunately, ROCm user mode does not remember the file names and mmap +offsets after establishing the mappings, so changing the device files and mmap +offsets under the hood will not be noticed by ROCm user mode. + +*This new plugin is enabled by the new hook `__UPDATE_VMA_MAP` in our RFC patch +series.* + +## Resuming GPU execution + +At the time of running the `cr_plugin_restore_file` plugin, it is too early to +restore userptr GPU page table mappings and their MMU notifiers. These mappings +mirror CPU page tables into GPU page tables using the HMM mirror API in the +kernel. The MMU notifiers notify the driver when the virtual address mapping +changes so that the GPU mapping can be updated. + +This needs to happen after the restorer PIE code has restored all the VMAs at +their correct virtual addresses. Otherwise, the HMM mirroring will simply fail. +Before all the GPU memory mappings are in place, it is also too early to resume +the user mode queue execution on the GPUs. + +Therefore, a new plugin is needed that runs in the context of the master +restore process after the restorer PIE code has restored all the VMAs and +returned control to all the restored processes via sigreturn. It needs to be +called once for each restored target process to finalize userptr mappings and +to resume execution on the GPUs. + +*This new plugin is enabled by the new hook `__RESUME_DEVICES_LATE` in our RFC +patch series.* + +## Other CRIU changes + +In addition to the new plugins, we need to make some changes to CRIU itself to +support device file VMAs. Currently CRIU will simply fail to dump a process +that has such PFN or IO memory mappings. While CRIU will not need to save the +contents of those VMAs, we do need CRIU to save and restore the VMAs +themselves, with translated mmap offsets (see “Remapping mmap offsets” above). + +## Security considerations + +The new “dump” ioctl we are adding to `/dev/kfd` will expose information about +remote processes. This is a potential security threat. CRIU will be +ptrace-attached to the target process, which gives it full access to the state +of the process being dumped. KFD can use ptrace attachment to authorize the use +of the new ioctl on a specific target process. + +The new “restore” ioctl will load privileged information from user mode back +into the kernel driver and the hardware. This includes MQD contents, which will +eventually be loaded into HQD registers, as well as a control stack, which is a +series of low-level commands that will be executed by the command processor. +Therefore, we are limiting this ioctl to the root user. If CRIU restore must be +possible for non-root users, we need to sanitize the privileged state to ensure +it cannot be used to circumvent system security policies (e.g. arbitrary code +execution in privileged contexts with access to page tables etc.). + +Modified mmap offsets could potentially be used to access BOs belonging to +different processes. This potential threat is not new with CRIU. `amdgpu.ko` +already implements checking of mmap offsets to ensure a context (represented by +a render node file descriptor) is only allowed access to its own BOs. + +# Glossary + +Term | Definition +--- | --- +CRIU | Checkpoint/Restore In Userspace +ROCm | Radeon Open Compute Platform +Thunk | User-mode API interface to interact with amdgpu.ko +KFD | AMD Kernel Fusion Driver +Mesa | Open source OpenGL implementation +GTT | Graphis Translation Table, also used to denote kernel-managed system memory for GPU access +VRAM | Video RAM +BO | Buffer Object +HMM | Heterogenous Memory Management +AQL | Architected Queueing Language +EOP | End of pipe (event indicating shader dispatch completion) +MQD | Memory Queue Descriptors +HQD | Hardware Queue Descriptors +PIE | Position Independent Executable