Graphical corruption and memory page faults on Vega 56/64 under Linux

Question

Graphical corruption and memory page faults on Vega 56/64 under Linux

CodingTwist opened this issue a year ago · 27 comments

CodingTwist commented a year ago

Version information

mc1.19.4-0.4.10+build.24

Expected Behavior

Game renders

Actual Behavior

Game doesn't render. Creating huge artifacts. While bring the GPU to 100%

Reproduction Steps

Launch the game
Join a world and wait a few seconds

Java version

Java 17.0.7 & Java 20.0.1

CPU

Intel i7-8700

GPU

AMD ATI Radeon RX Vega 56/64

Additional information

I am running Arch Linux on 6.3.5-arch1-1 with a AMD GPU.

I was asked to launch the mod with Fabric API api which had no effect. Vanilla Minecraft runs fine and optifine works

This was the log after launching the game then once it began lagging force killing the game.
https://paste.ee/p/yqLZu

The only sort of error I am getting is in the kernel buffer.

[  191.917437] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, but soft recovered
[  191.920212] amdgpu 0000:03:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:6 pasid:32778, for process java pid 2986 thread java:cs0 pid 3064)
[  191.920233] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000080011a86c000 from IH client 0x1b (UTCL2)
[  191.920246] amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00601030
[  191.920253] amdgpu 0000:03:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[  191.920259] amdgpu 0000:03:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  191.920264] amdgpu 0000:03:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  191.920270] amdgpu 0000:03:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[  191.920274] amdgpu 0000:03:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  191.920279] amdgpu 0000:03:00.0: amdgpu: 	 RW: 0x0
[  201.943945] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, but soft recovered

GPU driver info:

OpenGL vendor string: AMD
OpenGL renderer string: AMD Radeon RX Vega (vega10, LLVM 15.0.7, DRM 3.52, 6.3.5-arch1-1)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 23.1.1
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
OpenGL core profile extensions:
OpenGL version string: 4.6 (Compatibility Profile) Mesa 23.1.1
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile
OpenGL extensions:
OpenGL ES profile version string: OpenGL ES 3.2 Mesa 23.1.1
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20
OpenGL ES profile extensions:

Please just ask if you need more info about my system

0-x-2-2 commented 7 months ago

very nice

goeiecool9999 · Answer 1 · 2024-04-01T14:36:05.000Z

I am on kernel 6.8.1 and mesa 24.0.4. The issue seems to be gone!

pajicadvance · Answer 2 · 2024-04-02T03:56:35.000Z

This issue was listed as fixed in the Mesa 24.0.4 release notes. The issue has an identical crash and GPU architecture as this one, so I assume that is what fixed it.

Motschen · Answer 3 · 2023-06-04T09:43:50.000Z

I'm also encountering the same issue, but instead of just crashing the game, it crashes the whole compositor for me, both on Hyprland using Wayland and on KDE using X11.
Seems to be caused by a recent mesa update, as this just started happening after a system update.

pr1nt-is-not-available · Answer 4 · 2023-06-05T03:11:36.000Z

setting Chunk Memory Allocator to Swap (the default being Async) fixed this on my system (AMD Vega 56, Mesa 23.1.1, Wayland)

Regular-Baf · Answer 5 · 2023-06-16T16:56:58.000Z

setting Chunk Memory Allocator to Swap (the default being Async) fixed this on my system (AMD Vega 56, Mesa 23.1.1, Wayland)

Pretty sure I'm having the exact same issue on Vega 64 (Mesa 23.1.2 on Fedora 38 Plasma Wayland). Changing Async to Swap does resolve it, as does running Minecraft through Zink. I've had nothing but stability issues with Vega across OpenGL/OpenCL for years, so maybe this is a Mesa or amdgpu issue more than a Sodium issue.

ZtereoHYPE · Answer 6 · 2023-06-30T12:00:30.000Z

Encountered the same issue on a friend's system, and joining a world brought the entire system down to a screen-flickering state. AMD Vega 64, Mesa 23.1.3, plasma X11. Switching to swap also seems to fix it.

RedMaster13 · Answer 7 · 2023-06-30T16:28:40.000Z

I'm having the same issue here. AMD Vega 56, Arch Linux. Downgrading to mesa 23.0.3 fixed my issue.

jellysquid3 · Answer 8 · 2023-07-01T16:25:42.000Z

Hm. I haven't been able to reproduce any of these issues on my system (RX 6900 XT, Mesa 23.1.2, Linux 6.3.8), but it also seems that this problem exclusively affects the Vega 56/64 (which are a known problem child on Linux...)

The problem seems to be related to persistently mapped memory under OpenGL, hence the reason why switching the "Chunk Memory Allocator" strategy to "Swap" fixes the crashes. Both the corruption and hardware page faults would seem to agree with this.

I am going to see if we can bisect where the problem appeared in Mesa, and look into filing a bug. They've been helpful in the past with these things, so I think we have a good chance at fixing this.

To be clear, I don't think there is any bug with Sodium here, rather this is a regression in the Mesa graphics stack.

jellysquid3 · Answer 9 · 2023-07-01T16:29:20.000Z

For the time being, the solutions we've seen solve this problem are:

Using the Zink driver (set the environment variable MESA_LOADER_DRIVER_OVERRIDE=zink for Minecraft, might not perform well.)
Changing the setting at Video Settings > Advanced > Chunk Memory Allocator to "SWAP" (will likely degrade performance severely.)
Downgrading to Mesa 23.0.3 (unverified, but one other user said it worked.)

jellysquid3 · Answer 10 · 2023-07-20T00:40:25.000Z

We do not have any way to debug or fix this. The problem seems exclusively limited to the Vega 56/64 (and professional cards of that series) and we do not have any such graphics cards on hand. That said, I'm almost certain this problem has nothing to do with Sodium, as there's no good explanation for what could be going wrong on our side.

The only option here would be to make a bug report to Mesa about this problem. I suspect it would help them a lot if you could provide an API trace.

wingedseahorse · Answer 11 · 2023-07-20T02:41:46.000Z

Downgrading to Mesa 23.0.3 (unverified, but one other user said it worked.)

This is working for me as well.

jellysquid3 · Answer 12 · 2023-08-14T04:56:30.000Z

This might be accidentally fixed with Sodium 0.5.1 since we now use a 16-byte alignment on vertex data.

electron271 · Answer 13 · 2023-08-10T14:31:10.000Z

* Downgrading to Mesa 23.0.3 (unverified, but one other user said it worked.)

Working as well

Bettehem · Answer 14 · 2023-08-11T02:41:26.000Z

I'm using the Zink workaround as downgrading Mesa isn't a viable option for me. Works nicely without shaders but when using shaders, Zink's performance isn't very good

Regular-Baf · Answer 15 · 2023-08-27T06:49:59.000Z

I've just tested Sodium 0.5.2 and unfortunately the system freeze still occurs.

goeiecool9999 · Answer 16 · 2023-10-02T20:29:56.000Z

Bisected to this commit. Unfortunately it's not cleanly reversible on later versions.

goeiecool9999 · Answer 17 · 2023-10-02T21:40:42.000Z

I have opened an issue on the mesa repo.

BIGFAAT · Answer 18 · 2023-10-05T16:36:56.000Z

setting Chunk Memory Allocator to Swap (the default being Async) fixed this on my system (AMD Vega 56, Mesa 23.1.1, Wayland)

Option is in newer versions not available anymore, forcing vega user to start with MESA_LOADER_DRIVER_OVERRIDE=zink.
Please rollback.

KnownDimension · Answer 19 · 2023-11-23T19:19:56.000Z

setting Chunk Memory Allocator to Swap (the default being Async) fixed this on my system (AMD Vega 56, Mesa 23.1.1, Wayland)

Option is in newer versions not available anymore, forcing vega user to start with MESA_LOADER_DRIVER_OVERRIDE=zink. Please rollback.

I tried that a couple of weeks ago, the current version of zink is broken globally on Vega 56 Linux rn so that workaround is out the window

(Nixos for reference)

an0nfunc · Answer 20 · 2023-11-23T20:27:44.000Z

Works fine for me on Arch with zink.

jellysquid3 · Answer 21 · 2023-11-23T21:00:10.000Z

Sorry. We are not going to re-implement the option people were using to workaround this problem. If it is useful, a technical explanation is provided below for why the option ever existed, and why it was removed.

Technical explanation...

The problem

Normally, Sodium uses asynchronous transfers (buffer copies which are put into the GPU's command stream) and a staging buffer (mapped persistently within host memory) to upload geometry data to the GPU. We heavily rely on this functionality for good performance, and most other games will do something similar.

While OpenGL does have alternative ways to upload data to the GPU (i.e. glBufferSubData), it has very poor performance when updating an only certain parts of a buffer, and it requires additional memory copies. This is a problem, because we use very large shared buffers for our geometry, and implement a custom memory allocator on top of them.

(As an aside, it's worth mentioning that DirectX 12 and Vulkan only provide you with this option for uploading data to the GPU -- the driver does not hold your hand.)

More importantly: Our memory management strategy in Sodium directly relates to how we can optimize rendering. Using fewer buffer objects means we can switch between resource sets much less frequently, which in turn allows us to pack hundreds of draw commands into a single draw call.

Why the option ever existed in the first place

To workaround the broken support for asynchronous transfers on Apple's M1 hardware, we implemented an alternative approach which we called "swapping" (for disambiguation sake.)

Essentially, that approach involved keeping a copy of all chunk geometry in the CPU's memory, and each time a chunk was updated, we would allocate a new geometry buffer, and re-upload all the chunks into it. Hence the name "swap" -- it was swapping the geometry buffer each time.

Obviously, this is a very slow thing to do, and it meant updating chunks (such as when placing or breaking blocks) would cause significant lag, since it needs to constantly re-allocate and transfer huge amounts of memory. Another consequence was that we needed three copies of the geometry data, which doubled the memory requirements of the game.

Why the option was removed

When our hardware support policy changed (to require OpenGL 4.5 support), none of Apple's computers met this requirement any longer, so we dropped support for this workaround. We then took advantage of that to refactor the code for better performance and to fix a number of long-standing issues.

Because of this, I don't think there's any chance we could restore the workaround without undoing a lot of technical changes, and introducing a lot of technical debt back into the project. And I really don't want to implement more workarounds for critical functionality (asynchronous transfers) being plainly broken.

Anyways. There's really not much more point to keeping this issue open, because the only remaining actionable part here would be to implement more workarounds, which we are not willing to do (see above reasoning.)

The Mesa developers are already aware of this issue and the cause of the regression has been bisected. There is not much else that can be done to help them (at least to my knowledge) other than to provide them with an apitrace file. They have a lot of things to do, and I am not going to push for users to nag them.

electron271 · Answer 22 · 2023-12-13T23:46:19.000Z

Sorry to bother but is there any workaround that does not involve zink or downgrading? Zink heavily impacts shader performance, and downgrading breaks a lot of stuff.

BIGFAAT · Answer 23 · 2023-12-14T08:14:51.000Z

Sadly not, but looks like someone got assigned to the bug on the stated MESA issue. So keep a look there.

wingedseahorse · Answer 24 · 2023-12-14T14:47:54.000Z

Sorry to bother but is there any workaround that does not involve zink or downgrading? Zink heavily impacts shader performance, and downgrading breaks a lot of stuff.

At this point I'm having to accept the best solution is just to switch back to Forge until Mesa resolves since downgrading no longer works for me.

electron271 · Answer 25 · 2023-12-14T18:04:43.000Z

Sadly not, but looks like someone got assigned to the bug on the stated MESA issue. So keep a look there.

Hopefully it gets fixed soon

Jaggwagg · Answer 26 · 2024-01-12T20:46:15.000Z

For anyone experiencing issues with loading Zink drivers, this article helped me fix it https://www.supergoodcode.com/preemptive/.

Share to

Version information

Expected Behavior

Actual Behavior

Reproduction Steps

Java version

CPU

GPU

Additional information

Technical explanation...

The problem

Why the option ever existed in the first place

Why the option was removed