From ecdf740fa3815392202eb32ff10e95aec98e9732 Mon Sep 17 00:00:00 2001
From: Felix Kuehling <Felix.Kuehling@amd.com>
Date: Fri, 30 Apr 2021 04:20:48 -0400
Subject: [PATCH] criu/plugin: Add whitepaper document

Adding whitepaper document

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
---
 plugins/amdgpu/README.md | 274 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 274 insertions(+)
 create mode 100644 plugins/amdgpu/README.md
diff --git a/plugins/amdgpu/README.md b/plugins/amdgpu/README.md
new file mode 100644
index 000000000..030029e7a
--- /dev/null
+++ b/plugins/amdgpu/README.md
@@ -0,0 +1,274 @@
+Supporting ROCm with CRIU
+=========================
+
+_Felix Kuehling <Felix.Kuehling@amd.com>_<br>
+_Rajneesh Bardwaj <Rajneesh.Bhardwaj@amd.com>_<br>
+_David Yat Sin <David.YatSin@amd.com>_
+
+# Introduction
+
+ROCm is the Radeon Open Compute Platform developed by AMD to support
+high-performance computing and machine learning on AMD GPUs. It is a nearly
+fully open-source software stack starting from the kernel mode GPU driver,
+including compilers and language runtimes, all the way up to optimized
+mathematics libraries, machine learning frameworks and communication libraries.
+
+Documentation for the ROCm platform can be found here:
+https://rocmdocs.amd.com/en/latest/
+
+CRIU is a tool for freezing and checkpointing running applications or
+containers and later restoring them on the same or a different system. The
+process is transparent to the application being checkpointed. It is mostly
+implemented in user mode and relies heavily on Linux kernel features, e.g.
+cgroups, ptrace, vmsplice, and more. It can checkpoint and restore most
+applications relying on standard libraries. However, it is not able to
+checkpoint and restore applications using device drivers, with their own
+per-application kernel mode state, out of the box. This includes ROCm
+applications using the KFD device driver to access GPU hardware resources. CRIU
+includes some plugin hooks to allow extending it to add such support in the
+future.
+
+A common environment for ROCm applications is in data centers and compute
+clusters. In this environment, migrating applications using CRIU would be
+beneficial and desirable. This paper outlines AMDs plans for adding ROCm
+support to CRIU.
+
+# State associated with ROCm applications
+
+ROCm applications communicate with the kernel mode driver “amdgpu.ko” through
+the Thunk library “libhsakmt.so” to enumerate available GPUs, manage
+GPU-accessible memory, user mode queues for submitting work to the GPUs, and
+events for synchronizing with GPUs. Many of those APIs create and manipulate
+state maintained in the kernel mode driver that would need to be saved and
+restored by CRIU.
+
+## Memory
+
+ROCm manages memory in the form of buffer objects (BOs). We are also working on
+a new memory management API that will be based on virtual address ranges. For
+now, we are focusing on the buffer-object based memory management.
+
+There are different types of buffer objects supported:
+
+* VRAM (device memory managed by the kernel mode driver)
+* GTT (system memory managed by the kernel mode driver)
+* Userptr (normal system memory managed by user mode driver or application)
+* Doorbell (special aperture for sending signals to the GPU for user mode command submissions)
+* MMIO (special aperture for accessing GPU control registers, used for certain cache flushing operations)
+
+All these BOs are typically mapped into the GPU page tables for access by GPUs.
+Most of them are also mapped for CPU access. The following BO properties need
+to be saved and restored for CRIU to work with ROCm applications:
+
+* Buffer type
+* Buffer handle
+* Buffer size (page aligned)
+* Virtual address for GPU mapping (page aligned)
+* Device file offset for CPU mapping (for VRAM and GTT BOs)
+* Memory contents (for VRAM and GTT BOs)
+
+## Queues
+
+ROCm uses user mode queues to submit work to the GPUs. There are several memory
+buffers associated with queues. At the language runtime or application level,
+they expose the ring buffer as well as a signal object to tell the GPU about
+new commands added to the queue. The signal is mapped to a doorbell (a 64-bit
+entry in the doorbell aperture mapped by the doorbell BO). Internally there are
+other buffers needed for dispatch completion tracking, shader state saving
+during queue preemption and the queue state itself. Some of these buffers are
+managed in user mode, others are managed in kernel mode.
+
+When an application is checkpointed, we need to preempt all user mode queues
+belonging to the process, and then save their state, including:
+
+* Queue type (compute or DMA)
+* MQD (memory queue descriptor managed in kernel mode), with state such as
+  * ring buffer address
+  * read and write pointers
+  * doorbell offset
+  * pointer to AQL queue data structure
+* Control stack (kernel-managed piece of state needed for resuming preempted queue)
+
+The rest of the queue state is contained in user-managed buffer objects that
+will be saved by the memory state handling described above:
+
+* Ring buffer (userptr BO containing commands sent to the GPU)
+* AQL queue data structure (userptr BO containing `struct hsa_queue_t`)
+* EOP buffer (VRAM BO used for dispatch completion tracking by the command processor)
+* Context save area (userptr BO for saving shader state of preempted wavefronts)
+
+## Events
+
+Events are used to implement interrupt-based sleeping/waiting for signals sent
+from the GPU to the host. Signals are represented by some data structures in
+KFD and an entry in a user-allocated, GPU-accessible BO with event slots. We
+need to save the allocated set of event IDs and each event’s signaling state.
+The contents of the event slots will be saved by the memory state handling
+described above.
+
+## Topology
+
+When ROCm applications are started, they enumerate the device topology to find
+available GPUs, their capabilities and connectivity. An application can be
+checkpointed at any time, so it will not be at a safe place to re-enumerate the
+topology when it is restored. Therefore, we can only support restoring
+applications on systems with a very similar topology:
+
+* Same number of GPUs
+* Same type of GPUs (i.e. instruction set, cache sizes, number of compute units, etc.)
+* Same or larger memory size
+* Same VRAM accessibility by the host
+* Same connectivity and P2P memory support between GPUs
+
+At the KFD ioctl level, GPUs are identified by GPUIDs, which are unique
+identifiers created by hashing various GPU properties. That way a GPUID will
+not change during the lifetime of a process, even in a future where GPUs may be
+added or removed dynamically. When restoring a process on a different system,
+the GPUID may have changed. Or it may be desirable to restore a process using a
+different subset of GPUs on the same system (using cgroups). Therefore, we will
+need a translation of GPUIDs for restored processes that applies to all KFD
+ioctl calls after an application was restored.
+
+# CRIU plugins
+
+CRIU provides plugin hooks for device files:
+
+    int cr_plugin_dump_file(int fd, int id);
+    int cr_plugin_restore_file(int id);
+
+In a ROCm process, it will be invoked for `/dev/kfd` and `/dev/dri/renderD*`
+device nodes. `/dev/kfd` is used for KFD ioctl calls to manage memory, queues,
+signals and other functionality for all GPUs through a single device file
+descriptor. `/dev/dri/renderD*` are per GPU device files, called render nodes,
+that are used mostly for CPU mapping of VRAM and GTT BOs. Each BO is given a
+unique offset in the render node of the corresponding GPU at allocation time.
+
+Render nodes are also used for memory management and command submission by the
+Mesa user mode driver for video decoding and post processing. These use cases
+are relevant even in data centers. Support for this is not an immediate
+priority but planned for the future. This will require saving additional state
+as well as synchronization with any outstanding jobs. For now, there is no
+kernel-mode state associated with `/dev/renderD*`.
+
+The two existing plugins can be used for saving and restoring most state
+associated with ROCm applications. We are planning to add new ioctl calls to
+`/dev/kfd` to help with this.
+
+## Dumping
+
+At the “dump” stage, the ioctl will execute in the context of the CRIU dumper
+process. But the file descriptor (fd) is “drained” from the process being saved
+by the parasite code that CRIU injects into its target. This allows the plugin
+to make an ioctl call with enough context to allow KFD to access all the kernel
+mode state associated with the target process. CRIU is ptrace attached to the
+target process. KFD can use that fact to authorize access to the target
+process' information.
+
+The contents of GTT and VRAM BOs are not automatically saved by CRIU. CRIU can
+only support saving the contents of normal pageable mappings. GTT and VRAM BOs
+are special device file IO mappings. Therefore, our dumper plugin will need to
+save the contents of these BOs. In the initial implementation they can be
+accessed through `/proc/<pid>/mem`. For better performance we can use a DMA
+engine in the GPU to copy the data to system memory.
+
+## Restoring
+
+At the “restore” stage we first need to ensure that the topology of visible
+devices (in the cgroup) is compatible with the topology that was saved. Once
+this is confirmed, we can use a new ioctl to load the saved state back into
+KFD. This ioctl will run in the context of the process being restored, so no
+special authorization is needed. However, some of the data being copied back
+into kernel mode could have been tampered with. MQDs and control stacks provide
+access to privileged GPU registers. Therefore, the restore ioctl will only be
+allowed to run with root privileges.
+
+## Remapping render nodes and mmap offsets
+
+BOs are mapped for CPU access by mmapping the GPU's render node at a specific
+offset. The offset within the render node device file identifies the BO.
+However, when we recreate the BOs, we cannot guarantee that they will be
+restored with the same mmap offset that was saved, because the mmap offset
+address space per device is shared system wide.
+
+When a process is restored on a different GPU, it will need to map the BOs from
+a different render node device file altogether.
+
+A new plugin call will be needed to translate device file names and mmap
+offsets to the newly allocated ones, before CRIU's PIE code restores the VMA
+mappings. Fortunately, ROCm user mode does not remember the file names and mmap
+offsets after establishing the mappings, so changing the device files and mmap
+offsets under the hood will not be noticed by ROCm user mode.
+
+*This new plugin is enabled by the new hook `__UPDATE_VMA_MAP` in our RFC patch
+series.*
+
+## Resuming GPU execution
+
+At the time of running the `cr_plugin_restore_file` plugin, it is too early to
+restore userptr GPU page table mappings and their MMU notifiers. These mappings
+mirror CPU page tables into GPU page tables using the HMM mirror API in the
+kernel. The MMU notifiers notify the driver when the virtual address mapping
+changes so that the GPU mapping can be updated.
+
+This needs to happen after the restorer PIE code has restored all the VMAs at
+their correct virtual addresses. Otherwise, the HMM mirroring will simply fail.
+Before all the GPU memory mappings are in place, it is also too early to resume
+the user mode queue execution on the GPUs.
+
+Therefore, a new plugin is needed that runs in the context of the master
+restore process after the restorer PIE code has restored all the VMAs and
+returned control to all the restored processes via sigreturn. It needs to be
+called once for each restored target process to finalize userptr mappings and
+to resume execution on the GPUs.
+
+*This new plugin is enabled by the new hook `__RESUME_DEVICES_LATE` in our RFC
+patch series.*
+
+## Other CRIU changes
+
+In addition to the new plugins, we need to make some changes to CRIU itself to
+support device file VMAs. Currently CRIU will simply fail to dump a process
+that has such PFN or IO memory mappings. While CRIU will not need to save the
+contents of those VMAs, we do need CRIU to save and restore the VMAs
+themselves, with translated mmap offsets (see “Remapping mmap offsets” above).
+
+## Security considerations
+
+The new “dump” ioctl we are adding to `/dev/kfd` will expose information about
+remote processes. This is a potential security threat. CRIU will be
+ptrace-attached to the target process, which gives it full access to the state
+of the process being dumped. KFD can use ptrace attachment to authorize the use
+of the new ioctl on a specific target process.
+
+The new “restore” ioctl will load privileged information from user mode back
+into the kernel driver and the hardware. This includes MQD contents, which will
+eventually be loaded into HQD registers, as well as a control stack, which is a
+series of low-level commands that will be executed by the command processor.
+Therefore, we are limiting this ioctl to the root user. If CRIU restore must be
+possible for non-root users, we need to sanitize the privileged state to ensure
+it cannot be used to circumvent system security policies (e.g. arbitrary code
+execution in privileged contexts with access to page tables etc.).
+
+Modified mmap offsets could potentially be used to access BOs belonging to
+different processes. This potential threat is not new with CRIU. `amdgpu.ko`
+already implements checking of mmap offsets to ensure a context (represented by
+a render node file descriptor) is only allowed access to its own BOs.
+
+# Glossary
+
+Term | Definition
+--- | ---
+CRIU | Checkpoint/Restore In Userspace
+ROCm | Radeon Open Compute Platform
+Thunk | User-mode API interface  to interact with amdgpu.ko
+KFD | AMD Kernel Fusion Driver
+Mesa | Open source OpenGL implementation
+GTT | Graphis Translation Table, also used to denote kernel-managed system memory for GPU access
+VRAM | Video RAM
+BO | Buffer Object
+HMM | Heterogenous Memory Management
+AQL | Architected Queueing Language
+EOP | End of pipe (event indicating shader dispatch completion)
+MQD | Memory Queue Descriptors
+HQD | Hardware Queue Descriptors
+PIE | Position Independent Executable