Image object allocation memory leak/inefficient allocation/render queue backup bug
kev626 opened this issue ยท 14 comments
There's been much talk of a "memory leak" type of bug in squaremap. This issue intends to prove the existence of such a bug, explain the exact cause of the bug, and suggest a few ways to address the bug. I'm going to make this as thorough as possible for the avoidance of any doubt that this is the true issue.
The proof
On a buddy's server running squaremap, some information redacted for privacy reasons:
Before unloading squaremap:
Heap dump (here, Image objects account for 5GB of memory, the details of which will be explained below, this is on a totally different server from the above, but caused by the exact same thing):
The heap dump and pterodactyl screenshots were both observed on machines with high-end NVMe SSDs -- for the avoidance of any doubt that the problem is related to I/O -- it is not I/O related.
What's actually going on?
The allocation of the object we are interested in happens on Line 61 of BackgroundRenderer. This will be the focus of much of the bug report.
See also the allocation triggered here, which calls this, which then allocates the large 2D int array, before sticking the whole thing back in the queue, where it may wait for several hours.
So what's actually going on here? Prior to the job being submitted to the queue, an Image object is allocated. This object is cheap, but only temporarily. After the first chunk is rendered, it allocates a massive 2D int array to store the raw image data. Immediately following this, it sticks the whole thing BACK AT THE END OF THE QUEUE, where it waits in line all over again for the next chunk to be ready.
This becomes problematic as the background render queue grows in size. In testing, this is not an issue as the render queue can easily keep up with any actions a small number of players can make. However, on a large server, there are many many more actions being taken on chunks in the world, which means many more background render submissions are required. If these submissions are made faster than the tiles can be rendered, then the queue grows in size. This is expected behavior. The problem here is that the 2D int array is allocated in its entirety PRIOR TO all of the data actually being ready, and it's left there re-entering the end of the queue. This process repeats for hours on large servers with lots of activity.
But this wasn't an issue on Pl3xMap!
You're right, that's because I patched it in Pl3xMap just over a year ago. The way this patch works is by blocking the thread calling the render function until the render has actually finished. By blocking the thread calling the render function, no new allocations will be made as the render function won't be entered again until the previous invocation has fully completed. Additionally, back then, this render function rendered the entire tile, all in one go, rather than re-submitting it to the queue to render it one piece at a time. Admittedly this wasn't the best patch, but given that the software was not being well-maintained and I wanted a patch that was as simple to implement and as low-touch as possible, this was the method of patching I decided upon.
On a side note: this is actually a little funny because my original patch did not wait for I/O to complete. If the problem were that the I/O queue were backed up instead of the render queue, then this patch wouldn't have worked.
Squaremap has code that looks very similar to the code in the patch. So why doesn't the patch work anymore? Well, at some point in the development of squaremap, the threading model behind the BackgroundRenderer changed. Blocking the render function is no longer effective in preventing reentry into the function, which means that the image object allocation can proceed.
So how can this be fixed?
There are several ways that this issue can be resolved in Squaremap. Any of the below should do the trick:
- Revert the threading changes involved in calling the BackgroundRenderer. This will allow the original patch, which still exists in Squaremap, to function correctly.
- Re-prioritize the queueing somehow so that tiles mid-render don't have to get through the entire queue again
There are probably several more ways to fix this, but those are just a few that I came up with.
Are you sure this not io related? ik the fast NVME ssd but when I see the analysis shot it looks maybe
in the analyze image there are 5403 Image
instances, and also 5403 MapWorldInternal$$Lambda$...
(eyes not seeing the rest) instances. the only lambda with image in the class is not for the renders : https://github.com/jpenilla/squaremap/blob/master/common/src/main/java/xyz/jpenilla/squaremap/common/data/MapWorldInternal.java#L176
it is only once the render completes this method is called. this is my doubt.
I use squaremaps on my srv because ple3map is dead, so this is major concern for me. i only having 55 players at once so not seen it yet but I do not want this for my server. when I look Im not sure maybe
Are you sure this not io related? ik the fast NVME ssd but when I see the analysis shot it looks maybe
in the analyze image there are 5403
Image
instances, and also 5403MapWorldInternal$$Lambda$...
(eyes not seeing the rest) instances. the only lambda with image in the class is not for the renders : https://github.com/jpenilla/squaremap/blob/master/common/src/main/java/xyz/jpenilla/squaremap/common/data/MapWorldInternal.java#L176it is only once the render completes this method is called. this is my doubt.
I use squaremaps on my srv because pl3xmap is dead, so this is major concern for me. i only having 55 players at once so not seen it yet but I do not want this for my server. when I look Im not sure maybe
You are mostly right, but not quite. This is definitely not I/O related, at all:
it is only once the render completes this method is called.
That is incorrect. This happens when the lambda is allocated, not when it is invoked. Due to the function likely being inlined, the lambda is allocated at the same time as the lamdba for the future, which means that the lambda is allocated prior to the save function being called. This is why this leak appears I/O related, when it in fact has nothing to do with I/O at all.
If it was I/O related, the exact same leak, which is what is observed, would not have been resolved by my original patch in Pl3xMap, as noted in the original issue. It cannot possibly be I/O related, especially when a prior working patch would only work if it weren't I/O related, and a perfectly logical explanation for this leak not to be I/O related exists.
I apologise for my previous message and it being marked as spam, I have written a little more to circumvent this mistake:
I fully agree with the statements made by @kev626 and have seen this issue first hand on multiple occasions. I believe it should be considered a high priority to he fixed and should not just be dismissed
(plus)1 from me! :D xoxo
Would be happy to PR a fix for you pending you're not already working on it (or don't already have plans to do it yourself)
Edit this comment, react, or just straight up reply if you'd prefer that someone else did it because you clearly don't want people using alternatives whilst they wait.
I just want everyone to chill out and take a step back and stop the speculation until the original author has time to make a proper reply.
It is already an incredibly bad taste to advertise and hotlink another resource just because jmp does not have time to make a proper reply on Memorial Day's weekend (mind you, a historically period of the time where American takes their first holidays and marks the official start of the summer). It is within his full right to not wanting the unnecessary replies on the threads especially clearly a majority of replies have not taken their time to review the code themselves. We already have enough of drama in Minecraft community. Please stop being an enabler to this toxic behavior and entitlement.
Thank you very much.
It is already an incredibly bad taste to advertise and hotlink another resource just because jmp does not have time to make a proper reply on Memorial Day's weekend (mind you, a historically period of the time where American takes their first holidays and marks the official start of the summer).
Correction -- It is an incredibly bad taste to moderate an issue, removing a link to an open source fork of a project that resolves the bug, while choosing not to address or acknowledge the bug's existence. That directly contradicts the spirit of open source.
It is within his full right to not wanting the unnecessary replies on the threads especially clearly a majority of replies have not taken their time to review the code themselves.
Fair.
We already have enough of drama in Minecraft community.
Frankly, agreed.
Please stop being an enabler to this toxic behavior and entitlement.
May I ask how defending the action of removing a link to a fork which literally solves the problem isn't directly hypocritical to this statement? I don't want any statements to be misconstrued here. I have no issues whatsoever with whatever timeframe it takes to resolve this problem. Nor do I have a problem with the lack of acknowledgement thus far. I am owed nothing. But what I do have an issue with is the suppression of resources that can help the dozens of server owners who are having this problem, in the place of actually acknowledging the issue. Had my reply not been moderated to remove the link, I would not have cared at all and this whole blow-up would never have happened.
Speaking as someone who's involved in many major projects, and also as someone who doesn't have much free time very often, actually responding to issues/PRs is a lot more time consuming compared to something like a quick pass-by.
As others have said, jmp has been far less active lately compared to normal, so it's obvious that things are more busy currently - it's taken me 2-3 pings to get a simple "fyi" message seen lately.
That's not the part of this I have an issue with though. I can't speak for anybody else, but you won't find one message from me complaining about the timeframe that this is being handled by. What I do have a problem with is the unexplained silent censorship of solutions to people's problems. (Now, as you know, and probably a few others as well, that explanation was recently given a few minutes ago and I can agree with it, however it was not given publicly. I should also note that, at least the immediately-apparent reason had nothing to do with the code whatsoever.)
To reiterate; I am not expecting a fast resolution. This is not a simple issue and although I did try to be as clear and concise as possible in my bug report, I understand that this will take a while to understand and figure out a good solution for. It's not a trivial bug at all and the diagnostics you see in heap dumps are extremely misleading and confusing. Not to mention this issue only pops up under pretty heavy load. When I originally solved the issue in Pl3xMap it took me several days to figure out what was happening, and the issue squaremap is seeing, while similar, is much more complicated due to that section of the code being more complicated. This is going to take some time to figure out how to fix, even if reverting the all of the threading/scheduling changes ends up being the right answer.
The plugin is riddled with memory leaks. Why do you remove comments saying so?
Thanks for the report, this has been fixed in the latest snapshot as of the merge of #62.
Although this is a "memory leak" type issue, most if not all of the analysis and statements presented in the original issue and subsequent comments sadly missed the mark (with the exception of the comment from @Oharass, which was pretty close to the actual problem) about the root cause.
Both from the image of a heap dump in your issue, and from a heap dump I've examined personally, it's clear the objects are being retained due to waiting in the image IO executor's internal queue. Save tasks aren't submitted until rendering is finished for a region. Why exactly the image IO executor gets backed up and stops executing tasks in a timely manner, I don't know, if you have a fast disk it's hopefully not related to that, so it could be thread starvation, or many other possibilities. Regardless, the aforementioned changes will cause renders to throttle when the save queue gets too full, which effectively solves this issue.