Commit graph

8 commits

Author SHA1 Message Date
Radostin Stoyanov
0038ba8431 amdgpu: use local kernel headers instead of libdrm
Use local copies of amdgpu and DRM headers for consistency.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:36:39 -08:00
Radostin Stoyanov
486513d8af plugins/amdgpu: remove unused variable
amdgpu_plugin_drm.c:167:6: error: variable 'num_bos' set but not used [-Werror,-Wunused-but-set-variable]
  167 |         int num_bos = 0;
      |

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:36:39 -08:00
David Francis
87677ff1e7 plugins/amdgpu: remove excessive debug messages
These pr_info lines begin with "CC3" and "TWI" were not meant to be
included in the patch.

Co-authored-by: Andrei Vagin <avagin@google.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-05 15:36:39 -08:00
David Francis
11f09175e2 plugin/amdgpu: Support for checkpoint of dmabuf fds
amdgpu libraries that use dmabuf fd to share GPU memory between
processes close the dmabuf fds immediately after using them.
However, it is possible that checkpoint of a process catches one
of the dmabuf fds open. In that case, the amdgpu plugin needs
to handle it.

The checkpoint of the dmabuf fd does require the device file
it was exported from to have already been dumped

To identify which device this dmabuf fd was exprted from, attempt
to import it on each device, then record the dmabuf handle
it imports as. This handle can be used to restore it.

Signed-off-by: David Francis <David.Francis@amd.com>
2025-11-02 16:21:38 +00:00
David Francis
26549aeefa plugin/amdgpu: Add handling for amdgpu drm buffer objects
Buffer objects held by the amdgpu drm driver are checkpointed with
the new BO_INFO and MAPPING_INFO ioctls/ioctl options. Handling
is in amdgpu_plugin_drm.h

Handling of imported buffer objects may require dmabuf fds to be
transferred between processes. These occur over fdstore, with the
handle-fstore id relationships kept in shread memory. There is a
new plugin callback: RESTORE_INIT to create the shared memory.

During checkpoint, track shared buffer objects, so that buffer objects
that are shared across processes can be identified.

During restore, track which buffer objects have been restored. Retry
restore of a drm file if a buffer object is imported and the
original has not been exported yet. Skip buffer objects that have
already been completed or cannot be completed in the current restore.

So drm code can use sdma_copy_bo, that function no longer requires
kfd bo structs

Update the protobuf messages with new amdgpu drm information.

Signed-off-by: David Francis <David.Francis@amd.com>
2025-11-02 16:21:38 +00:00
Radostin Stoyanov
a808f09bea amdgpu_plugin: fix lint errors
$ make lint
 ...
 # Do not append \n to pr_perror, pr_pwarn or fail
 ! git --no-pager grep -E '^\s*\<(pr_perror|pr_pwarn|fail)\>.*\\n"'
 plugins/amdgpu/amdgpu_plugin.c:		pr_perror("%s(), Can't handle VMAs of input device\n", __func__);

 ! git --no-pager grep -En '^\s*\<pr_(err|warn|msg|info|debug)\>.*);$' | grep -v '\\n'
 plugins/amdgpu/amdgpu_plugin_drm.c:45:		pr_err("Error in getting stat for: %s", path);
 plugins/amdgpu/amdgpu_plugin_util.c:77:		pr_err("Unable to read file (read:%ld buf_len:%ld)", len_read, buf_len);
 plugins/amdgpu/amdgpu_plugin_util.c:89:		pr_err("Unable to write file (wrote:%ld buf_len:%ld)", len_write, buf_len);
 plugins/amdgpu/amdgpu_plugin_util.c:120:		pr_err("%s: Failed to open for %s", path, write ? "write" : "read");
 plugins/amdgpu/amdgpu_plugin_util.c:126:		pr_err("%s: Failed get pointer for %s", path, write ? "write" : "read");
 plugins/amdgpu/amdgpu_plugin_util.c:136:		pr_err("%s:Failed to access file size", path);
 plugins/amdgpu/amdgpu_plugin_util.c:152:		pr_err("Cannot fopen %s", file_path);

 make: *** [Makefile:470: lint] Error 1

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Ramesh Errabolu
0d5923c95e amdgpu_plugin: Refactor code used to implement Checkpoint
Refactor code used to Checkpoint DRM devices. Code is moved
into amdgpu_plugin_drm.c file which hosts various methods to
checkpoint and restore a workload.

Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
2024-09-11 16:02:11 -07:00
Ramesh Errabolu
733ef96315 amdgpu_plugin: Refactor code in preparation to support C&R for DRM devices
Add a new compilation unit to host symbols and methods that will be
needed to C&R DRM devices. Refactor code that indicates support for
C&R and checkpoints KFD and DRM devices

Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
2024-09-11 16:02:11 -07:00