![VulkanMod](https://media.forgecdn.net/avatars/thumbnails/561/294/256/256/637913373178716740.png)
Catastrophic GPU reset of host system on Linux (Dynamic graphics)
ZechariahB opened this issue · 14 comments
Yes, this title is not clickbait and this is indeed a real bug. I have to make this clear. I use a Lenono Legion 5 set with dynamic graphics rather than discrete graphics. Booting up Minecraft will cause unstable stuttering under default settings. Removing Vsync is fine, but still same effect. Maximum Framerate below display refresh rate functions until Maximum Framerate is set to Unlimited which will first cause the game to freeze then the integrated graphics card to crash. Both my AMD CPU and NVIDIA GPU support Vulkan.
FYI If I use by discrete graphics card (prime-run
or __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia
) , I will instead get a nicer Minecraft crash log, which could possibly be made a separate issue.
crash-2023-04-09_12.59.19-client.txt
Operating System: Kubuntu 22.10
KDE Plasma Version: 5.25.5
KDE Frameworks Version: 5.98.0
Qt Version: 5.15.6
Kernel Version: 5.19.0-38-generic (64-bit)
Graphics Platform: Wayland
Processors: 16 × AMD Ryzen 7 5800H with Radeon Graphics
Memory: 13.5 GiB of RAM
Graphics Processor: RENOIR
Manufacturer: LENOVO
Product Name: 82JW
System Version: Legion 5 15ACH6
Graphics Card: GeForce RTX 3050 Ti Mobile (Nvidia 525.105.17 driver)
As far as I know, this is an unrecoverable fatal crash that does not produce a Minecraft crash log. I do not know what files you need, but here is a start.
glxinfo.txt
vulkaninfo.txt
I found the error before all hell broke loose. This might be important for you.
/var/log/syslog:Apr 9 12:58:26 konqi-Legion-5-15ACH6 kernel: [213149.657188] NVRM: API mismatch: the client has the version 525.105.17, but
/var/log/syslog:Apr 9 12:58:26 konqi-Legion-5-15ACH6 kernel: [213149.657188] NVRM: this kernel module has the version 525.89.02. Please
/var/log/syslog:Apr 9 12:58:26 konqi-Legion-5-15ACH6 kernel: [213149.657188] NVRM: make sure that this kernel module and all NVIDIA driver
/var/log/syslog:Apr 9 12:58:26 konqi-Legion-5-15ACH6 kernel: [213149.657188] NVRM: components have the same version.
/var/log/syslog:Apr 9 12:58:30 konqi-Legion-5-15ACH6 plasmashell[196784]: Installing breakpad exception handler for appid(steam)/version(1679680416)/tid(197932)
/var/log/syslog:Apr 9 12:58:30 konqi-Legion-5-15ACH6 plasmashell[196784]: Installing breakpad exception handler for appid(steam)/version(1679680416)/tid(197933)
/var/log/syslog:Apr 9 12:58:33 konqi-Legion-5-15ACH6 xdg-desktop-portal-kde[196968]: xdp-kde-background: GetAppState called: no parameters
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.642585] [drm:gfx_v9_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.652982] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=67439730, emitted seq=67439732
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.653395] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process java pid 197775 thread Render thread pid 197777
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.653739] amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.688893] [drm] free PSP TMR buffer
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.717002] CPU: 13 PID: 195147 Comm: kworker/u32:0 Tainted: P O 5.19.0-38-generic #39-Ubuntu
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.717014] Hardware name: LENOVO 82JW/LNVNB161216, BIOS HHCN24WW 11/24/2021
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.717018] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.717034] Call Trace:
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.717037] <TASK>
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.717041] show_stack+0x4e/0x61
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.717048] dump_stack_lvl+0x4a/0x6f
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.717054] dump_stack+0x10/0x18
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.717058] amdgpu_do_asic_reset+0x2b/0x45e [amdgpu]
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.717640] amdgpu_device_gpu_recover_imp.cold+0x748/0x7f0 [amdgpu]
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.718006] amdgpu_job_timedout+0x196/0x1d0 [amdgpu]
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.718201] ? finish_task_switch.isra.0+0x85/0x290
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.718206] drm_sched_job_timedout+0x70/0x120 [gpu_sched]
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.718210] process_one_work+0x225/0x400
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.718213] worker_thread+0x50/0x3e0
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.718215] ? rescuer_thread+0x3c0/0x3c0
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.718217] kthread+0xe9/0x110
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.718219] ? kthread_complete_and_exit+0x20/0x20
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.718222] ret_from_fork+0x22/0x30
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.718226] </TASK>
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.718230] amdgpu 0000:05:00.0: amdgpu: MODE2 reset
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.718299] amdgpu 0000:05:00.0: amdgpu: GPU reset succeeded, trying to resume
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.718444] [drm] PCIE GART of 1024M enabled.
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.718446] [drm] PTB located at 0x000000F400900000
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.718460] [drm] VRAM is lost due to GPU reset!
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.718461] [drm] PSP is resuming...
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213201.738320] [drm] reserve 0x400000 from 0xf47fb00000 for PSP TMR
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213202.006775] amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not available
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213202.016683] amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not available
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213202.016690] amdgpu 0000:05:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213202.016696] amdgpu 0000:05:00.0: amdgpu: SMU is resuming...
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213202.017513] amdgpu 0000:05:00.0: amdgpu: SMU is resumed successfully!
/var/log/syslog:Apr 9 12:59:18 konqi-Legion-5-15ACH6 kernel: [213202.018156] [drm] DMUB hardware initialized: version=0x0101001F
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.570094] [drm] kiq ring mec 2 pipe 1 q 0
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.572827] [drm] VCN decode and encode initialized successfully(under DPG Mode).
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.572869] [drm] JPEG decode initialized successfully.
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.572874] amdgpu 0000:05:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.572877] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.572880] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.572882] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.572883] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.572885] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.572886] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.572887] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.572889] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.572891] amdgpu 0000:05:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.572892] amdgpu 0000:05:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.572894] amdgpu 0000:05:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.572896] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.572897] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.572899] amdgpu 0000:05:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.574867] amdgpu 0000:05:00.0: amdgpu: recover vram bo from shadow start
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.574870] amdgpu 0000:05:00.0: amdgpu: recover vram bo from shadow done
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.574873] [drm] Skip scheduling IBs!
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.574875] [drm] Skip scheduling IBs!
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.574920] [drm] Skip scheduling IBs!
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.574924] amdgpu 0000:05:00.0: amdgpu: GPU reset(4) succeeded!
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.574930] [drm] Skip scheduling IBs!
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.574937] [drm] Skip scheduling IBs!
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.574946] [drm] Skip scheduling IBs!
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.575011] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.575408] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kernel: [213202.575692] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
/var/log/syslog:Apr 9 12:59:03 konqi-Legion-5-15ACH6 xdg-desktop-portal-kde[196968]: xdp-kde-background: GetAppState called: no parameters
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kwin_wayland_wrapper[196336]: amdgpu: The CS has been cancelled because the context is lost.
/var/log/syslog:Apr 9 12:59:19 konqi-Legion-5-15ACH6 kwin_wayland_wrapper[196336]: amdgpu: The CS has been cancelled because the context is lost.
/var/log/syslog:Apr 9 12:59:21 konqi-Legion-5-15ACH6 plasmashell[196381]: amdgpu: The CS has been cancelled because the context is lost.
/var/log/syslog:Apr 9 12:59:21 konqi-Legion-5-15ACH6 kernel: [213204.751114] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
/var/log/syslog:Apr 9 12:59:22 konqi-Legion-5-15ACH6 plasmashell[196784]: Steam: An X Error occurred
/var/log/syslog:Apr 9 12:59:22 konqi-Legion-5-15ACH6 plasmashell[196784]: X Error of failed request: BadWindow (invalid Window parameter)
/var/log/syslog:Apr 9 12:59:22 konqi-Legion-5-15ACH6 plasmashell[196784]: Major opcode of failed request: 20 (X_GetProperty)
/var/log/syslog:Apr 9 12:59:22 konqi-Legion-5-15ACH6 plasmashell[196784]: Resource id in failed request: 0x1
/var/log/syslog:Apr 9 12:59:22 konqi-Legion-5-15ACH6 plasmashell[196784]: Serial number of failed request: 9
/var/log/syslog:Apr 9 12:59:22 konqi-Legion-5-15ACH6 plasmashell[196784]: xerror_handler: X failed, continuing
/var/log/syslog:Apr 9 12:59:23 konqi-Legion-5-15ACH6 kwin_wayland_wrapper[196336]: amdgpu: The CS has been cancelled because the context is lost.
/var/log/syslog:Apr 9 12:59:23 konqi-Legion-5-15ACH6 kernel: [213206.415230] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
I still think there's something illegal/out of spec being done with the way this mod is either handling the swapchain or something else, and I have a feeling that it has something to do with the reason the frames are rendered out of order.
I don't know enough vulkan to fix this though :(
afaik RADV has always had issues with the swapchain with this mod, unfortunately imho Swapchain/V-Sync in Vulkan sucks as it difficult to debug when issues like this manifest
I also have a stutter V-Sync bug, however that's Nvidia specific and is likely unrelated: (Found a fix for it but wont bother releasing a PR unless its a common problem)
If Switchable graphics is changing the GPU, that is the most likely cause imho as VulkanMod only loads the selected GPU once and cannot reload it. (As it cannot sort or switch between multiple GPUs and can only do one at a time)
By default, VulkanMod loads and Selects the first GPU it finds that is a Discrete GPU and supports Vulkan, so if Switchable Graphics unloads it and is inaccessible, the mod then tries to send the frame to it and fails, which could explain the crash if that's the case.
It should not be caused by just the iGPU driver, I have the same specs (minus eGPU, running Arch) and the mod works (with minor chunk seethrough bugs expected) fine at around 60 fps (probably some issue with the iGPU still, but working fine).
In that case it looks like I was wrong and Switchable Graphics works OK, sorry about that just wasn't 100% sure switching the GPU like that works correctly with this mod.
The logs don't give much information as VK_ERROR_DEVICE_LOST is a very generic error and can mean a number of different things and can be a massive pain to trouble shoot.
What might work is using vkconfig.exe from the VulkanSDK which allows use of injecting Validation layers when running VulkanMod which may give more information. (Sorry for being Unhelpful)
What might also help is using debug flags RADV_DEBUG=nocompute or nofastclears or syncshaders, in case it somehow corrects the issue or changed the nature of the crash which may give useful information. However I can't test RADV itself as I don't have an AMD GPU
What might also help is using debug flags RADV_DEBUG=nocompute or nofastclears or syncshaders, in case it somehow corrects the issue or changed the nature of the crash which may give useful information. However I can't test RADV itself as I don't have an AMD GPU
RADV_DEBUG=hang (which is sync shaders with hang protection) still causes hangs, vulkanmod is simply written that well
Can verify that the same happens for me, think it's a Linux issue in general because I have Intel i7 11700k and AMD Radeon RX 6700 with Mesa drivers in a custom-built desktop so different manufacturer for both CPU and GPU
EDIT: Forgot to mention that it only happens if I join a world or server
Also turning vsync on with unlimited max framerate fixes it for some reason?
Can verify that the same happens for me, think it's a Linux issue in general because I have Intel i7 11700k and AMD Radeon RX 6700 with Mesa drivers in a custom-built desktop so different manufacturer for both CPU and GPU
EDIT: Forgot to mention that it only happens if I join a world or server Also turning vsync on with unlimited max framerate fixes it for some reason?
yes, all of this is known, and it's a (probably) swapchain bug
vulkanmod (still) can't handle > 2 swapchain images (causing hangs and crashes), and presents them in the wrong order as well!
let me be very clear here, none of this is a driver issue, it's an issue with vulkanmod not respecting the vulkan spec
@Attemial not sure if this matters that much, but are you running under Wayland too?
Just out of curiosity, does this issue also occur with any other Nvidia drivers on Linux, or is it just RADV only?
If it is in fact RADV specific, does this bug also occur with any other AMD Drivers in Linux (e.g. AMDVLK or the Official drivers),
(Just to rule out with 100% certainty if it is in fact driver related or not)
Might open a new PR that seems to fix an Nvidia specific odd issue with V-Sync Fullscreen that I have run into, just in case it somehow magically fixes this issue. However that's unlikely as I am using Win10, and it isn't a issue that has been reported AFAIK, and happens in Fullscreen only, so I may just be using a Buggy Driver.
This is the PR in question in case if it actually does anything to fix this issue, if it doesn't please ignore me (As it is likely a separate bug)
I can confirm this issue, my game first started rendering out of order before everything locked up. Using AMD integrated graphics
As #180