Commit graph

62 commits

Author SHA1 Message Date
David Francis
ff35a9126e plugins/amdgpu: remove excessive debug messages
These pr_info lines begin with "CC3" and "TWI" were not meant to be
included in the patch.

Co-authored-by: Andrei Vagin <avagin@google.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-11-14 18:31:37 +00:00
David Francis
9e404e2083 plugin/amdgpu: Support for checkpoint of dmabuf fds
amdgpu libraries that use dmabuf fd to share GPU memory between
processes close the dmabuf fds immediately after using them.
However, it is possible that checkpoint of a process catches one
of the dmabuf fds open. In that case, the amdgpu plugin needs
to handle it.

The checkpoint of the dmabuf fd does require the device file
it was exported from to have already been dumped

To identify which device this dmabuf fd was exprted from, attempt
to import it on each device, then record the dmabuf handle
it imports as. This handle can be used to restore it.

Signed-off-by: David Francis <David.Francis@amd.com>
2025-11-14 18:31:37 +00:00
David Francis
d43217dadb plugin: Add DUMP_DEVICES_LATE callback
The amdgpu plugin was counting how many files were checkpointed
to determine when it should close the device files.

The number of device files is not consistent; a process may
have multiple copies of the drm device files open.

Instead of doing this counting, add a new callback after all
files are checkpointed, so plugins can clean up their
resources at an appropriate time.

Signed-off-by: David Francis <David.Francis@amd.com>
2025-11-14 18:31:37 +00:00
David Francis
db0ec806d1 plugin/amdgpu: Add handling for amdgpu drm buffer objects
Buffer objects held by the amdgpu drm driver are checkpointed with
the new BO_INFO and MAPPING_INFO ioctls/ioctl options. Handling
is in amdgpu_plugin_drm.h

Handling of imported buffer objects may require dmabuf fds to be
transferred between processes. These occur over fdstore, with the
handle-fstore id relationships kept in shread memory. There is a
new plugin callback: RESTORE_INIT to create the shared memory.

During checkpoint, track shared buffer objects, so that buffer objects
that are shared across processes can be identified.

During restore, track which buffer objects have been restored. Retry
restore of a drm file if a buffer object is imported and the
original has not been exported yet. Skip buffer objects that have
already been completed or cannot be completed in the current restore.

So drm code can use sdma_copy_bo, that function no longer requires
kfd bo structs

Update the protobuf messages with new amdgpu drm information.

Signed-off-by: David Francis <David.Francis@amd.com>
2025-11-14 18:31:36 +00:00
David Francis
5eb61e1b14 plugin/amdgpu: Add drm header
The amdgpu plugin usually calls drm ioctls through the libdrm
wrappers. However, amdgpu restore requires dealing with dmabufs
and gem handles directly, which means drm ioctls must be
called directly.

Add the drm.h header (from the kernel's uapi).

Signed-off-by: David Francis <David.Francis@amd.com>
2025-11-14 18:31:36 +00:00
David Francis
0b7ca29c19 plugin/amdgpu: Add amdgpu drm header
For amdgpu plugin to call the new amdgpu drm CRIU ioctls, it needs
the amdgpu drm header file, copied from the kernel's includes.

Signed-off-by: David Francis <David.Francis@amd.com>
2025-11-14 18:31:36 +00:00
David Francis
fb02dbf685 files-ext: Allow plugin files to retry
amdgpu dmabuf CRIU requires the ability of the amdgpu plugin
to retry.

Change files_ext.c to read a response of 1 from a plugin restore
function to mean retry.

Signed-off-by: David Francis <David.Francis@amd.com>
2025-11-14 18:31:36 +00:00
Yanning Yang
920437205c plugins/amdgpu: Update README.md and criu-amdgpu-plugin.txt
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-14 18:27:31 +00:00
Yanning Yang
4a3a695dfb plugins/amdgpu: Implement parallel restore
This patch implements the entire logic to enable the offloading of
buffer object content restoration.

The goal of this patch is to offload the buffer object content
restoration to the main CRIU process so that this restoration can occur
in parallel with other restoration logic (mainly the restoration of
memory state in the restore blob, which is time-consuming) to speed up
the restore phase. The restoration of buffer object content usually
takes a significant amount of time for GPU applications, so
parallelizing it with other operations can reduce the overall restore
time.

It has three parts: the first replaces the restoration of buffer objects
in the target process by sending a parallel restore command to the main
CRIU process; the second implements the POST_FORKING hook in the amdgpu
plugin to enable buffer object content restoration in the main CRIU
process; the third stops the parallel thread in the RESUME_DEVICES_LATE
hook.

This optimization only focuses on the single-process situation (common
case). In other scenarios, it will turn to the original method. This is
achieved with the new `parallel_disabled` flag.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-14 18:27:31 +00:00
Yanning Yang
33ed774c8d plugins/amdgpu: Add parallel restore command
Currently the restore of buffer object comsumes a significant amount of
time. However, this part has no logical dependencies with other restore
operations. This patch introduce some structures and some helper
functions for the target process to offload this task to the main CRIU
process.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-14 18:27:31 +00:00
Yanning Yang
6386140754 plugins/amdgpu: Add socket operations
When enabling parallel restore, the target process and the main CRIU
process need an IPC interface to communicate and transfer restore
commands. This patch adds a Unix domain TCP socket and stores this
socket in `fdstore`.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-14 18:27:31 +00:00
Andrei Vagin
ce680fc6c7 Revert "plugins/amdgpu: Implement parallel restore"
This functionality (#2527) is being reverted and excluded from this
release due to issue #2812.

It will be included in a subsequent release once all associated issues
are resolved.

Signed-off-by: Andrei Vagin <avagin@google.com>
2025-11-13 08:40:46 -08:00
Andrei Vagin
2b8951a9cf image: use protoc instead of protoc-c
The new protoc 1.5.2 reports warnings:
`protoc-c` is deprecated. Please use `protoc` instead!

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-11-02 07:48:22 -08:00
Yanning Yang
7a5b3d1f41 plugins/amdgpu: Update README.md and criu-amdgpu-plugin.txt
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-02 07:48:22 -08:00
Yanning Yang
a61116fd93 plugins/amdgpu: Implement parallel restore
This patch implements the entire logic to enable the offloading of
buffer object content restoration.

The goal of this patch is to offload the buffer object content
restoration to the main CRIU process so that this restoration can occur
in parallel with other restoration logic (mainly the restoration of
memory state in the restore blob, which is time-consuming) to speed up
the restore phase. The restoration of buffer object content usually
takes a significant amount of time for GPU applications, so
parallelizing it with other operations can reduce the overall restore
time.

It has three parts: the first replaces the restoration of buffer objects
in the target process by sending a parallel restore command to the main
CRIU process; the second implements the POST_FORKING hook in the amdgpu
plugin to enable buffer object content restoration in the main CRIU
process; the third stops the parallel thread in the RESUME_DEVICES_LATE
hook.

This optimization only focuses on the single-process situation (common
case). In other scenarios, it will turn to the original method. This is
achieved with the new `parallel_disabled` flag.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-02 07:48:22 -08:00
Yanning Yang
e8ba7c103a plugins/amdgpu: Add parallel restore command
Currently the restore of buffer object comsumes a significant amount of
time. However, this part has no logical dependencies with other restore
operations. This patch introduce some structures and some helper
functions for the target process to offload this task to the main CRIU
process.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-02 07:48:22 -08:00
Yanning Yang
1fd1b670c4 plugins/amdgpu: Add socket operations
When enabling parallel restore, the target process and the main CRIU
process need an IPC interface to communicate and transfer restore
commands. This patch adds a Unix domain TCP socket and stores this
socket in `fdstore`.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
2025-11-02 07:48:22 -08:00
Radostin Stoyanov
5335b35f72 images/inventory: add field for enabled plugins
This patch extends the inventory image with a `plugins` field that
contains an array of plugins which were used during checkpoint,
for example, to save GPU state. In particular, the CUDA and AMDGPU
plugins are added to this field only when the checkpoint contains
GPU state. This allows to disable unnecessary plugins during restore,
show appropriate error messages if required CRIU plugin are missing,
and migrate a process that does not use GPU from a GPU-enabled system
to CPU-only environment.

We use the `optional plugins_entry` for backwards compatibility. This
entry allows us to distinguish between *unset* and *missing* field:

- When the field is missing, it indicates that the checkpoint was
  created with a previous version of CRIU, and all plugins should be
  *enabled* during restore.

- When the field is empty, it indicates that no plugins were used during
  checkpointing. Thus, all plugins can be *disabled* during restore.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
4f8f6f2883 Makefile.config: set CR_PLUGIN_DEFAULT variable
By default, CRIU uses the path "/usr/lib/criu" to install and load
plugins at runtime. This path is defined by the `PLUGINDIR` variable
in Makefile.install and `CR_PLUGIN_DEFAULT` in `criu/include/plugin.h`.
However, some distribution packages might install the CRIU plugins at
"/usr/lib64/criu" instead. This patch updates the makefile to align
the path defined by `CR_PLUGIN_DEFAULT` with the value of `PLUGINDIR`.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
Radostin Stoyanov
f1d465448f amdgpu: remove exec permissions on source files
This patch fixes the following warnings that appear
when building an RPM package:

+ /usr/lib/rpm/redhat/brp-mangle-shebangs
*** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.c is executable but has no shebang, removing executable bit
*** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.h is executable but has no shebang, removing executable bit

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-03-21 12:40:31 -07:00
David Francis
096c1f7a4d plugins/amdgpu - Increase maximum parameter length
The topology parsing assumed that all parameter names were
30 characters or fewer, but

recommended_sdma_engine_id_mask

is 31 characters.

Make the maximum length a macro, and set it to 64.

Signed-off-by: David Francis <David.Francis@amd.com>
2024-09-19 15:23:42 -07:00
David Francis
60ee5ebd9d plugins/amdgpu: Zero ib_info on initialization
This struct was being used un-initialized, meaning it
was filled with random garbage.

Mea culpa.

Signed-off-by: David Francis <David.Francis@amd.com>
2024-09-19 15:23:42 -07:00
Radostin Stoyanov
21ea718f9f plugins/amdgpu: fix printf format specifiers
Errors on aarch64:

	In file included from amdgpu_plugin_drm.h:10,
			 from amdgpu_plugin.c:33:
	amdgpu_plugin.c: In function 'amdgpu_plugin_dump_file':
	amdgpu_plugin_util.h:24:20: error: format '%lld' expects argument of type 'long long int', but argument 6 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
	   24 | #define LOG_PREFIX "amdgpu_plugin: "
	      |                    ^~~~~~~~~~~~~~~~~
	../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
	   47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
	      |                                                    ^~~~~~~~~~
	amdgpu_plugin.c:1236:9: note: in expansion of macro 'pr_info'
	 1236 |         pr_info("devices:%d bos:%d objects:%d priv_data:%lld\n", args.num_devices, args.num_bos, args.num_objects,
	      |         ^~~~~~~
	cc1: all warnings being treated as errors

Errors on ppc64:

	In file included from amdgpu_plugin_drm.h:10,
			 from amdgpu_plugin.c:33:
	amdgpu_plugin.c: In function 'amdgpu_plugin_dump_file':
	amdgpu_plugin_util.h:24:20: error: format '%llu' expects argument of type 'long long unsigned int', but argument 6 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
	   24 | #define LOG_PREFIX "amdgpu_plugin: "
	      |                    ^~~~~~~~~~~~~~~~~
	../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
	   47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
	      |                                                    ^~~~~~~~~~
	amdgpu_plugin.c:1236:9: note: in expansion of macro 'pr_info'
	 1236 |         pr_info("devices:%u bos:%u objects:%u priv_data:%llu\n",
	      |         ^~~~~~~
	cc1: all warnings being treated as errors
	In file included from amdgpu_plugin_util.c:38:
	amdgpu_plugin_util.c: In function 'print_kfd_bo_stat':
	amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
	   24 | #define LOG_PREFIX "amdgpu_plugin: "
	      |                    ^~~~~~~~~~~~~~~~~
	../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
	   47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
	      |                                                    ^~~~~~~~~~
	amdgpu_plugin_util.c:196:17: note: in expansion of macro 'pr_info'
	  196 |                 pr_info("%s(), %d. KFD BO Addr: %llx \n", __func__, idx, bo->addr);
	      |                 ^~~~~~~
	amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
	   24 | #define LOG_PREFIX "amdgpu_plugin: "
	      |                    ^~~~~~~~~~~~~~~~~
	../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
	   47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
	      |                                                    ^~~~~~~~~~
	amdgpu_plugin_util.c:197:17: note: in expansion of macro 'pr_info'
	  197 |                 pr_info("%s(), %d. KFD BO Size: %llx \n", __func__, idx, bo->size);
	      |                 ^~~~~~~
	amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
	   24 | #define LOG_PREFIX "amdgpu_plugin: "
	      |                    ^~~~~~~~~~~~~~~~~
	../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
	   47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
	      |                                                    ^~~~~~~~~~
	amdgpu_plugin_util.c:198:17: note: in expansion of macro 'pr_info'
	  198 |                 pr_info("%s(), %d. KFD BO Offset: %llx \n", __func__, idx, bo->offset);
	      |                 ^~~~~~~
	amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=]
	   24 | #define LOG_PREFIX "amdgpu_plugin: "
	      |                    ^~~~~~~~~~~~~~~~~
	../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX'
	   47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__)
	      |                                                    ^~~~~~~~~~
	amdgpu_plugin_util.c:199:17: note: in expansion of macro 'pr_info'
	  199 |                 pr_info("%s(), %d. KFD BO Restored Offset: %llx \n", __func__, idx, bo->restored_offset);
	      |                 ^~~~~~~
	cc1: all warnings being treated as errors

Co-developed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-19 15:23:42 -07:00
Radostin Stoyanov
3e2ed18790 plugins/amdgpu: use C99-standard types
Co-developed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-19 15:23:42 -07:00
Radostin Stoyanov
2ee5844411 plugins/amdgpu: fix cross-compilation
To enable cross-compile we need to use the CC definition from
criu/scripts/nmk/scripts/tools.mk:

CC := $(CROSS_COMPILE)$(HOSTCC)

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-19 15:23:42 -07:00
Radostin Stoyanov
c42b58f4fb plugin: enable multiple plugins for the same hook
CRIU provides two plugins for checkpoint/restore of GPU applications:
amdgpu and cuda. Both plugins use the `RESUME_DEVICES_LATE` hook to
enable restore:

    CR_PLUGIN_REGISTER_HOOK(CR_PLUGIN_HOOK__RESUME_DEVICES_LATE, amdgpu_plugin_resume_devices_late)
    CR_PLUGIN_REGISTER_HOOK(CR_PLUGIN_HOOK__RESUME_DEVICES_LATE, cuda_plugin_resume_devices_late)

However, CRIU currently does not support running more than one plugin
for the same hook. As a result, when both plugins are installed, the
resume function for CUDA applications is not executed. To fix this,
we need to make sure that both `plugin_resume_devices_late()` functions
return `-ENOTSUP` when restore is not supported.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
fcbadfbdbf plugins: set executable bit on .so files
For historical reasons, some tools like rpm [1] or ldd [2,3]
may expect the executable bit to be present for the correct
identification of shared libraries. The executable bit on .so
files is set by default by compilers (e.g., GCC). It is not
strictly necessary but primarily a convention.

[1] https://docs.fedoraproject.org/en-US/package-maintainers/CommonRpmlintIssues/#unstripped_binary_or_object
[2] https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/ldd.bash.in;h=d6b640df;hb=HEAD#l154

[3] $ sudo ldd /usr/lib/criu/*.so
/usr/lib/criu/amdgpu_plugin.so:
ldd: warning: you do not have execution permission for `/usr/lib/criu/amdgpu_plugin.so'
	linux-vdso.so.1 (0x00007fd0a2a3e000)
	libdrm.so.2 => /lib64/libdrm.so.2 (0x00007fd0a29eb000)
	libdrm_amdgpu.so.1 => /lib64/libdrm_amdgpu.so.1 (0x00007fd0a29de000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fd0a27fc000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fd0a2a40000)
/usr/lib/criu/cuda_plugin.so:
ldd: warning: you do not have execution permission for `/usr/lib/criu/cuda_plugin.so'
	linux-vdso.so.1 (0x00007f1806e13000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f1806c08000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f1806e15000)

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
a808f09bea amdgpu_plugin: fix lint errors
$ make lint
 ...
 # Do not append \n to pr_perror, pr_pwarn or fail
 ! git --no-pager grep -E '^\s*\<(pr_perror|pr_pwarn|fail)\>.*\\n"'
 plugins/amdgpu/amdgpu_plugin.c:		pr_perror("%s(), Can't handle VMAs of input device\n", __func__);

 ! git --no-pager grep -En '^\s*\<pr_(err|warn|msg|info|debug)\>.*);$' | grep -v '\\n'
 plugins/amdgpu/amdgpu_plugin_drm.c:45:		pr_err("Error in getting stat for: %s", path);
 plugins/amdgpu/amdgpu_plugin_util.c:77:		pr_err("Unable to read file (read:%ld buf_len:%ld)", len_read, buf_len);
 plugins/amdgpu/amdgpu_plugin_util.c:89:		pr_err("Unable to write file (wrote:%ld buf_len:%ld)", len_write, buf_len);
 plugins/amdgpu/amdgpu_plugin_util.c:120:		pr_err("%s: Failed to open for %s", path, write ? "write" : "read");
 plugins/amdgpu/amdgpu_plugin_util.c:126:		pr_err("%s: Failed get pointer for %s", path, write ? "write" : "read");
 plugins/amdgpu/amdgpu_plugin_util.c:136:		pr_err("%s:Failed to access file size", path);
 plugins/amdgpu/amdgpu_plugin_util.c:152:		pr_err("Cannot fopen %s", file_path);

 make: *** [Makefile:470: lint] Error 1

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Ramesh Errabolu
0d5923c95e amdgpu_plugin: Refactor code used to implement Checkpoint
Refactor code used to Checkpoint DRM devices. Code is moved
into amdgpu_plugin_drm.c file which hosts various methods to
checkpoint and restore a workload.

Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
2024-09-11 16:02:11 -07:00
Ramesh Errabolu
733ef96315 amdgpu_plugin: Refactor code in preparation to support C&R for DRM devices
Add a new compilation unit to host symbols and methods that will be
needed to C&R DRM devices. Refactor code that indicates support for
C&R and checkpoints KFD and DRM devices

Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
2024-09-11 16:02:11 -07:00
Pavel Tikhomirov
b689a6710c plugin/amdgpu: Also don't print 'plugin failed' in criu
We already don't treat it as error in the plugin itself, but after
returning -1 from RESUME_DEVICES_LATE hook we print debug message in
criu about failed plugin, let's return 0 instead.

While on it let's replace ret to exit_code.

Fixes: a9cbdad76 ("plugin/amdgpu: Don't print error for "No such process" during resume")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2024-09-11 16:02:11 -07:00
David Francis
59599dacdd plugin/amdgpu: Don't print error for "No such process" during resume
During the late stages of restore, each process being resumed gets
an ioctl call to KFD_CRIU_OP_RESUME. If the process has no kfd
process info, this call with fail with -ESRCH. This is normal
behaviour, so we shouldn't print an error message for it.

Signed-off-by: David Francis <David.Francis@amd.com>
2024-09-11 16:02:11 -07:00
Andrei Vagin
e076c11e22 ci: fix codespell errors
Signed-off-by: Andrei Vagin <avagin@gmail.com>
2023-11-27 16:47:16 -08:00
Radostin Stoyanov
28e854d662 amdgpu: fix clang warnings
amdgpu_plugin.c:930:6: error: variable 'buffer' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
        if (ret) {
            ^~~
amdgpu_plugin.c:988:8: note: uninitialized use occurs here
        xfree(buffer);

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2023-10-22 13:29:25 -07:00
Radostin Stoyanov
ba168ab78c ci: enable build with amdgpu plugin
This patch adds the `libdrm-dev` package to the list of CRIU
dependencies installed in CI to build CRIU with amdgpu plugin.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2023-10-22 13:29:25 -07:00
Andrei Vagin
aa38a59899 amdgpu: print an error if the dup syscall fails
Signed-off-by: Andrei Vagin <avagin@gmail.com>
2023-10-22 13:29:25 -07:00
Andrei Vagin
940a05c0ba amdgpu: don't leak fd on an error path in open_img_file
Signed-off-by: Andrei Vagin <avagin@gmail.com>
2023-10-22 13:29:25 -07:00
Andrei Vagin
a68975c06d plugins: the UPDATE_VMA_MAP callback returns fd with the full control
It means CRIU has to close it when it is not needed.

It looks more logically correct and matches the behaviour of
the RESTORE_EXT_FILE callback.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2023-10-22 13:29:25 -07:00
David Francis
d06c9b5cda criu/plugin: Add environment variable to cap size of buffers.
The amdgpu plugin would create a memory buffer at the size
of the largest VRAM bo (buffer object). On some systems, VRAM
size exceeds RAM size, so the largest bo might be larger than
the available memory.

Add an environment variable KFD_MAX_BUFFER_SIZE, which caps the
size of this buffer. By default, it is set to 0, and has no
effect. When active, any bo larger than its value will be
saved to/restored from file in multiple passes.

Signed-off-by: David Francis <David.Francis@amd.com>
2023-10-22 13:29:25 -07:00
Radostin Stoyanov
a4b49c46fe amdgpu_plugin: remove duplicated log prefix
The log prefix "amdgpu_plugin:" is defined with `LOG_PREFIX` in
`amdgpu_plugin.c`.  However, the prefix is also included in each
log message. As a result it appears duplicated in the log messages:

(00.044324) amdgpu_plugin: amdgpu_plugin: devices:1 bos:58 objects:148 priv_data:45696
(00.045376) amdgpu_plugin: amdgpu_plugin: Thread[0x5589] started
(00.167172) amdgpu_plugin: amdgpu_plugin: img_path = amdgpu-kfd-62.img
(00.083739) amdgpu_plugin: amdgpu_plugin : amdgpu_plugin_dump_file() called for fd = 235

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2023-10-22 13:29:25 -07:00
Adrian Reber
df7b897a22 ci: fix new codespell errors
Signed-off-by: Adrian Reber <areber@redhat.com>
2023-10-22 13:29:25 -07:00
Radostin Stoyanov
fa2c585c22 amdgpu: define __nmk_dir if missing
This patch adds a missing definition for `__nmk_dir` in the Makefile
for the amdgpu plugin. This definition is required, for example, when
building the `test_topology_remap` target:

	make -C plugins/amdgpu/ test_topology_remap

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2023-04-15 21:17:21 -07:00
Radostin Stoyanov
dd0217976c amdgpu: Add gitignore
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2023-04-15 21:17:21 -07:00
Radostin Stoyanov
46ec6749fa ci: Fix code indent
This patch contains auto-generated changes from `make indent`

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2022-06-22 10:20:33 -07:00
Radostin Stoyanov
a90a1d4827 amdgpu: Set PLUGINDIR to /usr/lib/criu
Building the criu packages for Ubuntu/Debian fails with:

	mkdir: cannot create directory '/var/lib/criu': Permission denied

This patch updates PLUGINDIR with the value /usr/lib/criu

Fixes: #1877

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2022-06-22 10:20:33 -07:00
Radostin Stoyanov
e6f292cb38 amdgpu/Makefile: Fix include path
When building packages for CRIU the source directory might have a
name different than 'criu'.

Fixes: #1877

Reported-by: @siris
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2022-06-22 10:20:33 -07:00
Kir Kolyshkin
0194ed392f Fix some codespell warnings
Brought to you by

	codespell -w

(using codespell v2.1.0).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
87d3735145 criu/plugin: Add support for criu image streamer
Modifications to support criu image streamer when using amdgpu_plugin.
When running with criu image streamer, fseek/lseek is not available so
we store the file size in the first 8-bytes of the actual file.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
55370b720e criu/plugin: Store BO contents directly to file
Store BO contents directly to file (1 per GPU) instead of using
protobuf.

Bug Fix:
Fixes an issue where we could not handle BOs bigger than 4GB because
protobuf has an internal limit of 4GB for the Bytes structure.

Performance Improvements:
This significantly reduces CR duration on multi-GPU systems as it allows
reading and writing to disk in parallel. During checkpoint, instead of
waiting for all the BO contents to be read from the one protobuf file,
we can now start writing the BO contents as soon as the first BO is read
from disk. During restore, we can start writing BO contents to disk
after the first BO from VRAM. This also reduces the peak amount of
system memory used as we only need to keep 1 BO content in memory per
GPU at a time instead of all the BO contents.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
Felix Kuehling
ecdf740fa3 criu/plugin: Add whitepaper document
Adding whitepaper document

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00