These pr_info lines begin with "CC3" and "TWI" were not meant to be
included in the patch.
Co-authored-by: Andrei Vagin <avagin@google.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
amdgpu libraries that use dmabuf fd to share GPU memory between
processes close the dmabuf fds immediately after using them.
However, it is possible that checkpoint of a process catches one
of the dmabuf fds open. In that case, the amdgpu plugin needs
to handle it.
The checkpoint of the dmabuf fd does require the device file
it was exported from to have already been dumped
To identify which device this dmabuf fd was exprted from, attempt
to import it on each device, then record the dmabuf handle
it imports as. This handle can be used to restore it.
Signed-off-by: David Francis <David.Francis@amd.com>
The amdgpu plugin was counting how many files were checkpointed
to determine when it should close the device files.
The number of device files is not consistent; a process may
have multiple copies of the drm device files open.
Instead of doing this counting, add a new callback after all
files are checkpointed, so plugins can clean up their
resources at an appropriate time.
Signed-off-by: David Francis <David.Francis@amd.com>
Buffer objects held by the amdgpu drm driver are checkpointed with
the new BO_INFO and MAPPING_INFO ioctls/ioctl options. Handling
is in amdgpu_plugin_drm.h
Handling of imported buffer objects may require dmabuf fds to be
transferred between processes. These occur over fdstore, with the
handle-fstore id relationships kept in shread memory. There is a
new plugin callback: RESTORE_INIT to create the shared memory.
During checkpoint, track shared buffer objects, so that buffer objects
that are shared across processes can be identified.
During restore, track which buffer objects have been restored. Retry
restore of a drm file if a buffer object is imported and the
original has not been exported yet. Skip buffer objects that have
already been completed or cannot be completed in the current restore.
So drm code can use sdma_copy_bo, that function no longer requires
kfd bo structs
Update the protobuf messages with new amdgpu drm information.
Signed-off-by: David Francis <David.Francis@amd.com>
This patch fixes the following warnings that appear
when building an RPM package:
+ /usr/lib/rpm/redhat/brp-mangle-shebangs
*** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.c is executable but has no shebang, removing executable bit
*** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.h is executable but has no shebang, removing executable bit
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Errors on aarch64:
In file included from amdgpu_plugin_drm.h:10,
from amdgpu_plugin.c:33:
amdgpu_plugin.c: In function 'amdgpu_plugin_dump_file':
amdgpu_plugin_util.h:24:20: error: format '%lld' expects argument of type 'long long int', but argument 6 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin.c:1236:9: note: in expansion of macro 'pr_info'
1236 | pr_info("devices:%d bos:%d objects:%d priv_data:%lld\n", args.num_devices, args.num_bos, args.num_objects,
| ^~~~~~~
cc1: all warnings being treated as errors
Errors on ppc64:
In file included from amdgpu_plugin_drm.h:10,
from amdgpu_plugin.c:33:
amdgpu_plugin.c: In function 'amdgpu_plugin_dump_file':
amdgpu_plugin_util.h:24:20: error: format '%llu' expects argument of type 'long long unsigned int', but argument 6 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin.c:1236:9: note: in expansion of macro 'pr_info'
1236 | pr_info("devices:%u bos:%u objects:%u priv_data:%llu\n",
| ^~~~~~~
cc1: all warnings being treated as errors
In file included from amdgpu_plugin_util.c:38:
amdgpu_plugin_util.c: In function 'print_kfd_bo_stat':
amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin_util.c:196:17: note: in expansion of macro 'pr_info'
196 | pr_info("%s(), %d. KFD BO Addr: %llx \n", __func__, idx, bo->addr);
| ^~~~~~~
amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin_util.c:197:17: note: in expansion of macro 'pr_info'
197 | pr_info("%s(), %d. KFD BO Size: %llx \n", __func__, idx, bo->size);
| ^~~~~~~
amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin_util.c:198:17: note: in expansion of macro 'pr_info'
198 | pr_info("%s(), %d. KFD BO Offset: %llx \n", __func__, idx, bo->offset);
| ^~~~~~~
amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
24 | #define LOG_PREFIX "amdgpu_plugin: "
| ^~~~~~~~~~~~~~~~~
../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
| ^~~~~~~~~~
amdgpu_plugin_util.c:199:17: note: in expansion of macro 'pr_info'
199 | pr_info("%s(), %d. KFD BO Restored Offset: %llx \n", __func__, idx, bo->restored_offset);
| ^~~~~~~
cc1: all warnings being treated as errors
Co-developed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Add a new compilation unit to host symbols and methods that will be
needed to C&R DRM devices. Refactor code that indicates support for
C&R and checkpoints KFD and DRM devices
Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>