criu/plugins/amdgpu/README.md

Supporting ROCm with CRIU
=========================

_Felix Kuehling <Felix.Kuehling@amd.com>_<br>
_Rajneesh Bardwaj <Rajneesh.Bhardwaj@amd.com>_<br>
_David Yat Sin <David.YatSin@amd.com>_<br>
_Yanning Yang <yangyanning@sjtu.edu.cn>_

# Introduction

ROCm is the Radeon Open Compute Platform developed by AMD to support
high-performance computing and machine learning on AMD GPUs. It is a nearly
fully open-source software stack starting from the kernel mode GPU driver,
including compilers and language runtimes, all the way up to optimized
mathematics libraries, machine learning frameworks and communication libraries.

Documentation for the ROCm platform can be found here:
https://rocmdocs.amd.com/en/latest/

CRIU is a tool for freezing and checkpointing running applications or
containers and later restoring them on the same or a different system. The
process is transparent to the application being checkpointed. It is mostly
implemented in user mode and relies heavily on Linux kernel features, e.g.
cgroups, ptrace, vmsplice, and more. It can checkpoint and restore most
applications relying on standard libraries. However, it is not able to
checkpoint and restore applications using device drivers, with their own
per-application kernel mode state, out of the box. This includes ROCm
applications using the KFD device driver to access GPU hardware resources. CRIU
includes some plugin hooks to allow extending it to add such support in the
future.

A common environment for ROCm applications is in data centers and compute
clusters. In this environment, migrating applications using CRIU would be
beneficial and desirable. This paper outlines AMDs plans for adding ROCm
support to CRIU.

# State associated with ROCm applications

ROCm applications communicate with the kernel mode driver “amdgpu.ko” through
the Thunk library “libhsakmt.so” to enumerate available GPUs, manage
GPU-accessible memory, user mode queues for submitting work to the GPUs, and
events for synchronizing with GPUs. Many of those APIs create and manipulate
state maintained in the kernel mode driver that would need to be saved and
restored by CRIU.

## Memory

ROCm manages memory in the form of buffer objects (BOs). We are also working on
a new memory management API that will be based on virtual address ranges. For
now, we are focusing on the buffer-object based memory management.

There are different types of buffer objects supported:

* VRAM (device memory managed by the kernel mode driver)
* GTT (system memory managed by the kernel mode driver)
* Userptr (normal system memory managed by user mode driver or application)
* Doorbell (special aperture for sending signals to the GPU for user mode command submissions)
* MMIO (special aperture for accessing GPU control registers, used for certain cache flushing operations)

All these BOs are typically mapped into the GPU page tables for access by GPUs.
Most of them are also mapped for CPU access. The following BO properties need
to be saved and restored for CRIU to work with ROCm applications:

* Buffer type
* Buffer handle
* Buffer size (page aligned)
* Virtual address for GPU mapping (page aligned)
* Device file offset for CPU mapping (for VRAM and GTT BOs)
* Memory contents (for VRAM and GTT BOs)

## Queues

ROCm uses user mode queues to submit work to the GPUs. There are several memory
buffers associated with queues. At the language runtime or application level,
they expose the ring buffer as well as a signal object to tell the GPU about
new commands added to the queue. The signal is mapped to a doorbell (a 64-bit
entry in the doorbell aperture mapped by the doorbell BO). Internally there are
other buffers needed for dispatch completion tracking, shader state saving
during queue preemption and the queue state itself. Some of these buffers are
managed in user mode, others are managed in kernel mode.

When an application is checkpointed, we need to preempt all user mode queues
belonging to the process, and then save their state, including:

* Queue type (compute or DMA)
* MQD (memory queue descriptor managed in kernel mode), with state such as
  * ring buffer address
  * read and write pointers
  * doorbell offset
  * pointer to AQL queue data structure
* Control stack (kernel-managed piece of state needed for resuming preempted queue)

The rest of the queue state is contained in user-managed buffer objects that
will be saved by the memory state handling described above:

* Ring buffer (userptr BO containing commands sent to the GPU)
* AQL queue data structure (userptr BO containing `struct hsa_queue_t`)
* EOP buffer (VRAM BO used for dispatch completion tracking by the command processor)
* Context save area (userptr BO for saving shader state of preempted wavefronts)

## Events

Events are used to implement interrupt-based sleeping/waiting for signals sent
from the GPU to the host. Signals are represented by some data structures in
KFD and an entry in a user-allocated, GPU-accessible BO with event slots. We
need to save the allocated set of event IDs and each event’s signaling state.
The contents of the event slots will be saved by the memory state handling
described above.

## Topology

When ROCm applications are started, they enumerate the device topology to find
available GPUs, their capabilities and connectivity. An application can be
checkpointed at any time, so it will not be at a safe place to re-enumerate the
topology when it is restored. Therefore, we can only support restoring
applications on systems with a very similar topology:

* Same number of GPUs
* Same type of GPUs (i.e. instruction set, cache sizes, number of compute units, etc.)
* Same or larger memory size
* Same VRAM accessibility by the host
* Same connectivity and P2P memory support between GPUs

At the KFD ioctl level, GPUs are identified by GPUIDs, which are unique
identifiers created by hashing various GPU properties. That way a GPUID will
not change during the lifetime of a process, even in a future where GPUs may be
added or removed dynamically. When restoring a process on a different system,
the GPUID may have changed. Or it may be desirable to restore a process using a
different subset of GPUs on the same system (using cgroups). Therefore, we will
need a translation of GPUIDs for restored processes that applies to all KFD
ioctl calls after an application was restored.

# CRIU plugins

CRIU provides plugin hooks for device files:

    int cr_plugin_dump_file(int fd, int id);
    int cr_plugin_restore_file(int id);

In a ROCm process, it will be invoked for `/dev/kfd` and `/dev/dri/renderD*`
device nodes. `/dev/kfd` is used for KFD ioctl calls to manage memory, queues,
signals and other functionality for all GPUs through a single device file
descriptor. `/dev/dri/renderD*` are per GPU device files, called render nodes,
that are used mostly for CPU mapping of VRAM and GTT BOs. Each BO is given a
unique offset in the render node of the corresponding GPU at allocation time.

Render nodes are also used for memory management and command submission by the
Mesa user mode driver for video decoding and post processing. These use cases
are relevant even in data centers. Support for this is not an immediate
priority but planned for the future. This will require saving additional state
as well as synchronization with any outstanding jobs. For now, there is no
kernel-mode state associated with `/dev/renderD*`.

The two existing plugins can be used for saving and restoring most state
associated with ROCm applications. We are planning to add new ioctl calls to
`/dev/kfd` to help with this.

## Dumping

At the “dump” stage, the ioctl will execute in the context of the CRIU dumper
process. But the file descriptor (fd) is “drained” from the process being saved
by the parasite code that CRIU injects into its target. This allows the plugin
to make an ioctl call with enough context to allow KFD to access all the kernel
mode state associated with the target process. CRIU is ptrace attached to the
target process. KFD can use that fact to authorize access to the target
process' information.

The contents of GTT and VRAM BOs are not automatically saved by CRIU. CRIU can
only support saving the contents of normal pageable mappings. GTT and VRAM BOs
are special device file IO mappings. Therefore, our dumper plugin will need to
save the contents of these BOs. In the initial implementation they can be
accessed through `/proc/<pid>/mem`. For better performance we can use a DMA
engine in the GPU to copy the data to system memory.

## Restoring

At the “restore” stage we first need to ensure that the topology of visible
devices (in the cgroup) is compatible with the topology that was saved. Once
this is confirmed, we can use a new ioctl to load the saved state back into
KFD. This ioctl will run in the context of the process being restored, so no
special authorization is needed. However, some of the data being copied back
into kernel mode could have been tampered with. MQDs and control stacks provide
access to privileged GPU registers. Therefore, the restore ioctl will only be
allowed to run with root privileges.

## Remapping render nodes and mmap offsets

BOs are mapped for CPU access by mmapping the GPU's render node at a specific
offset. The offset within the render node device file identifies the BO.
However, when we recreate the BOs, we cannot guarantee that they will be
restored with the same mmap offset that was saved, because the mmap offset
address space per device is shared system wide.

When a process is restored on a different GPU, it will need to map the BOs from
a different render node device file altogether.

A new plugin call will be needed to translate device file names and mmap
offsets to the newly allocated ones, before CRIU's PIE code restores the VMA
mappings. Fortunately, ROCm user mode does not remember the file names and mmap
offsets after establishing the mappings, so changing the device files and mmap
offsets under the hood will not be noticed by ROCm user mode.

*This new plugin is enabled by the new hook `__UPDATE_VMA_MAP` in our RFC patch
series.*

## Resuming GPU execution

At the time of running the `cr_plugin_restore_file` plugin, it is too early to
restore userptr GPU page table mappings and their MMU notifiers. These mappings
mirror CPU page tables into GPU page tables using the HMM mirror API in the
kernel. The MMU notifiers notify the driver when the virtual address mapping
changes so that the GPU mapping can be updated.

This needs to happen after the restorer PIE code has restored all the VMAs at
their correct virtual addresses. Otherwise, the HMM mirroring will simply fail.
Before all the GPU memory mappings are in place, it is also too early to resume
the user mode queue execution on the GPUs.

Therefore, a new plugin is needed that runs in the context of the master
restore process after the restorer PIE code has restored all the VMAs and
returned control to all the restored processes via sigreturn. It needs to be
called once for each restored target process to finalize userptr mappings and
to resume execution on the GPUs.

*This new plugin is enabled by the new hook `__RESUME_DEVICES_LATE` in our RFC
patch series.*

## Restoring BO content in parallel

Restoring the BO content is an important part in the restore of GPU state and
usually takes a significant amount of time. A possible location for this
procedure is the `cr_plugin_restore_file` hook. However, restoring in this hook
blocks the target process from performing other restore operations, which
hinders further optimization of the restore process.

Therefore, a new plugin hook that runs in the master restore process is
introduced, and it interacts with the `cr_plugin_restore_file` hook to complete
the restore of BO content. Specifically, the target process only needs to send
the relevant BOs to the master restore process, while this new hook handles all
the restore of buffer objects. Through this method, during the restore of the BO
content, the target process can perform other restore operations, thus
accelerating the restore procedure. This is an implementation of the gCROP
method proposed in the ACM SoCC'24 paper: [On-demand and Parallel
Checkpoint/Restore for GPU Applications](https://dl.acm.org/doi/10.1145/3698038.3698510).

*This optimization technique is enabled by the `__POST_FORKING` hook.*

## Other CRIU changes

In addition to the new plugins, we need to make some changes to CRIU itself to
support device file VMAs. Currently CRIU will simply fail to dump a process
that has such PFN or IO memory mappings. While CRIU will not need to save the
contents of those VMAs, we do need CRIU to save and restore the VMAs
themselves, with translated mmap offsets (see “Remapping mmap offsets” above).

## Security considerations

The new “dump” ioctl we are adding to `/dev/kfd` will expose information about
remote processes. This is a potential security threat. CRIU will be
ptrace-attached to the target process, which gives it full access to the state
of the process being dumped. KFD can use ptrace attachment to authorize the use
of the new ioctl on a specific target process.

The new “restore” ioctl will load privileged information from user mode back
into the kernel driver and the hardware. This includes MQD contents, which will
eventually be loaded into HQD registers, as well as a control stack, which is a
series of low-level commands that will be executed by the command processor.
Therefore, we are limiting this ioctl to the root user. If CRIU restore must be
possible for non-root users, we need to sanitize the privileged state to ensure
it cannot be used to circumvent system security policies (e.g. arbitrary code
execution in privileged contexts with access to page tables etc.).

Modified mmap offsets could potentially be used to access BOs belonging to
different processes. This potential threat is not new with CRIU. `amdgpu.ko`
already implements checking of mmap offsets to ensure a context (represented by
a render node file descriptor) is only allowed access to its own BOs.

# Glossary

Term | Definition
--- | ---
CRIU | Checkpoint/Restore In Userspace
ROCm | Radeon Open Compute Platform
Thunk | User-mode API interface  to interact with amdgpu.ko
KFD | AMD Kernel Fusion Driver
Mesa | Open source OpenGL implementation
GTT | Graphics Translation Table, also used to denote kernel-managed system memory for GPU access
VRAM | Video RAM
BO | Buffer Object
HMM | Heterogeneous Memory Management
AQL | Architected Queueing Language
EOP | End of pipe (event indicating shader dispatch completion)
MQD | Memory Queue Descriptors
HQD | Hardware Queue Descriptors
PIE | Position Independent Executable