amdgpu_plugin_drm.c:167:6: error: variable 'num_bos' set but not used [-Werror,-Wunused-but-set-variable]
167 | int num_bos = 0;
|
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
These pr_info lines begin with "CC3" and "TWI" were not meant to be
included in the patch.
Co-authored-by: Andrei Vagin <avagin@google.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
amdgpu libraries that use dmabuf fd to share GPU memory between
processes close the dmabuf fds immediately after using them.
However, it is possible that checkpoint of a process catches one
of the dmabuf fds open. In that case, the amdgpu plugin needs
to handle it.
The checkpoint of the dmabuf fd does require the device file
it was exported from to have already been dumped
To identify which device this dmabuf fd was exprted from, attempt
to import it on each device, then record the dmabuf handle
it imports as. This handle can be used to restore it.
Signed-off-by: David Francis <David.Francis@amd.com>
Buffer objects held by the amdgpu drm driver are checkpointed with
the new BO_INFO and MAPPING_INFO ioctls/ioctl options. Handling
is in amdgpu_plugin_drm.h
Handling of imported buffer objects may require dmabuf fds to be
transferred between processes. These occur over fdstore, with the
handle-fstore id relationships kept in shread memory. There is a
new plugin callback: RESTORE_INIT to create the shared memory.
During checkpoint, track shared buffer objects, so that buffer objects
that are shared across processes can be identified.
During restore, track which buffer objects have been restored. Retry
restore of a drm file if a buffer object is imported and the
original has not been exported yet. Skip buffer objects that have
already been completed or cannot be completed in the current restore.
So drm code can use sdma_copy_bo, that function no longer requires
kfd bo structs
Update the protobuf messages with new amdgpu drm information.
Signed-off-by: David Francis <David.Francis@amd.com>
Refactor code used to Checkpoint DRM devices. Code is moved
into amdgpu_plugin_drm.c file which hosts various methods to
checkpoint and restore a workload.
Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
Add a new compilation unit to host symbols and methods that will be
needed to C&R DRM devices. Refactor code that indicates support for
C&R and checkpoints KFD and DRM devices
Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>