ConcurrentModificationException at EnergyGridCache.java:316
fluffle opened this issue ยท 33 comments
Hi! This is a dupe of #3800, which is locked, but please don't close this without reading it! I managed to trigger this exception repeatedly in a very short space of time, so I think I can provide some useful information on a way to reliably reproduce the crash and thus narrow down the root cause.
Describe the bug
Background
I was scaling out my AE system by adding a large number of quantum bridges. I had what I thought was a neat solution to the "both sides of the bridge must be powered" problem: I put an AE energy cell on the remote side of each bridge link. This was charged while the link was active, and contained enough of a buffer to kick the bridge back into life when the world was reloaded. It would then recharge itself.
Problem
Quantum bridges are power hungry. The crashes started when the number of bridges on my network exceeded the power I was supplying (5400 RF/t from a nuclearcraft reactor, FYI), causing parts of the network to switch on and off rapidly. If we look at what the code is doing:
Line 316 is in the middle of extractProviderPower
. It's iterating over the LinkedHashSet of energy providers in the grid cache, and removing these providers from the grid cache if they don't have enough power. The set of providers is modified in storagePowerChangeHandler
, removeNode
and addNode
:
https://github.com/AppliedEnergistics/Applied-Energistics-2/blob/rv6-1.12/src/main/java/appeng/me/cache/EnergyGridCache.java#L151
https://github.com/AppliedEnergistics/Applied-Energistics-2/blob/rv6-1.12/src/main/java/appeng/me/cache/EnergyGridCache.java#L509
https://github.com/AppliedEnergistics/Applied-Energistics-2/blob/rv6-1.12/src/main/java/appeng/me/cache/EnergyGridCache.java#L557
And iterated over (read-only) here in refreshPower
:
There's a few things that could be happening here, but given how I triggered this I expect I'm running into race conditions between extractProviderPower
and addNode
/removeNode
. When a quantum bridge powers down, I expect this causes the energy cell on the far side of the bridge to be removed from the cache via removeNode
. If the cache was providing power to the network at the time via extractProviderPower
, this will result in the ConcurrentModificationException you see here.
To Reproduce
Build a large network that has 7-8 quantum bridges with energy cells keeping their remote side powered. You don't need anything more than a bridge and a cell on the remote side. Limit the power to the network such that it only has 90-95% of the power it needs. This ought to result in crashes pretty quickly as individual bridges power on and off.
Expected Behaviour
Not crashing?
It ought to be pretty simple to fix the race: you need a mutex protecting your hashsets when you're modifying them, so wrap addNode
/removeNode
/extractProviderPower
/storagePowerChangeHandler
in synchronized blocks. Or there appear to be ConcurrentLinkedHashSet implementations around in guava etc. It's your code :-)
Crash logs
There are three consecutive crash reports in this .zip:
Environment
I'm playing GregBlock 1.4.6 on Minecraft 1.12.2, forge 14.23.5.2768, it has ae2-rv6-stable-4.jar 1.12.2-3.3.16. The full mod list is in the attached crash reports.
Same problem here with Enigmatica2Expert 1.58 (ae2-rv6-stable6)
1 Quantum Link which is in the same Chunk as the AE2 Controller. Adding dense cells didnt solve the issue, even as the remote side of the Quantum Link is powered by an energy acceptor.
As reports about self chunkloading AE2 systems rise up, this might bring up this bug again which was never fixed.
Duplicate of #3800
add/removeNode
is never called concurrently. Maybe the event could case it, but I'm currently not sure about a potential solution as it can easily degrade the performance.
For anyone following this bug: please ignore the below! I was wrong: this is not about multiple threads, as @yueh rightly points out.
I don't think adding synchronization around a data structure that is clearly being modified from multiple threads concurrently is going to make matters worse.
If you're really worried about performance, (a) you should benchmark to prove or disprove your hypothesis rather than falling back on conjecture, and (b) using a concurrent hash set instead of broad locks would alleviate at least some of these concerns. But IMO the concerns are overblown anyway. If addNode and removeNode really aren't called concurrently, they can never contend for the lock, so it won't hurt performance there: taking a free lock a very cheap operation. And if there is lock contention, then that's fine because the lock would be preventing concurrent modification of your non-thread-safe data structure and a slight performance hit is preferable to a crash-to-desktop.
If I created a PR, what kind of evidence that it didn't have a performance impact would you be looking for?
I'm just guessing as its hard to twist the head around the complexity of AE2 code, but is it possible that stuff like this
Could cause it when the grid of the quantum bridge gets joined to the main AE grid?
Anyways closing all bugs related to this wont result in it being solved at any point of time, no matter what the reason for this is.
you could use ConcurrentSkipList/Map to avoid getting the CME but you'd have to make sure that changes are propagated properly
from what i understand out of fluffle, his assumption is most probably based on async chunk loading or unloading event ...
the upcoming challenge that i see in using concurrent safe lists/sets/maps is to make sure in case of an modification the still running iterator (which reflects, at least for my proposed classes, only the list on creation of the iterator and not includes any concurrent change at all) gets aborted and repeats it's work (which means, the already done work has to be either rolled back or the work the iterator does has to be interruptible or ultimately be redone once it finished the first cycle of iteration)
@yueh: synchronized
is explicitly for preventing concurrent access to data structures that are not safe for use by multiple threads:
https://docs.oracle.com/javase/tutorial/essential/concurrency/locksync.html
@mindforger: I'm trying not to assume anything about where the concurrent accesses are happening from. Really, it doesn't matter where they're coming from. We have evidence they are occurring, so the right thing to do is to (a) put a mutex around the data structure to prevent concurrent access, or (b) use a different data structure that is safe for use by concurrent threads.
Performance, basically every action like inserting/extracting items will extract a bit of energy and there can be hundreds of them each tick. So not having to check empty ones each time adds up over time. Especially with large builds it can add up quite fast. Same issues with concurrent collections. Even if it's fine with a single network, some cases do not scale well and I've already refactored/optimized them quite a bit with 1.12, so it's no longer such an hassle with some builds. Or run into other issues like stackoverflows. So I'm pretty reluctant about making it worse again.
but given how I triggered this I expect I'm running into race conditions between extractProviderPower and addNode/removeNode. When a quantum bridge powers down, I expect this causes the energy cell on the far side of the bridge to be removed from the cache via removeNode. If the cache was providing power to the network at the time via extractProviderPower, this will result in the ConcurrentModificationException you see here.
well that is technically an assumption and based on that i added that i suspect an mechanism (from whatever source, just to name one possibility "sponge") for async chunkloading/unloading causing this update while an iterator is running over the list
and for (b) i proposed using given datasets like the concurrentskiplist/set/map, the performance impact should be not that bad, but refactoring should make this an straight forward task for simple testing, but bears the risk of having issues with the weak link between iterators and actual lists (what i described in my earlier post)
i just wanted to give an (implyingly/implied ... i want to say i am implying it being) easy to test solution to produce some results
edit2: synchronized is causing locks and can heavily impact performance or cause deadlocking, my solution would only prevent the actual CME but additional measures would have ulimately to be taken to keep everything "synchronized" in a data-point-of-view
but why did it happen to me at all with an ae2 system where everything is in the same chunk?
so from what i did understand from your post, one of total two QNBs standing in the same chunk as the controller ... or do you mean one single unlinked QNB in the same chunk with the controller?
is EVERYTHING in the same chunk? i mean also cable etc.
when unloading is handled in a seperate thread, this can still happen while an iterator is used in the server tick for example (Edit: after a quick check i have yet no evidence to where sponge is still doing async chunkloading/unloading, and i did not find something in the configs so far, but would be funny to disable it for testing purpose)
Yes both ends of the QNB are in the same chunk. Nothing of the AE2 is connected to anything outside of that chunk.
But im pretty sure the server im playing on is using sponge, and im not sure if that uses something like vertical async loading.
Please stop guessing about stuff, if you do not understand it. It simply does not help anyone and just wastes time to debunk the mess it causes.
Maybe the event could case it
so that aint guessing, too?
Im just trying to contribute to the solution of the problem.
Edit: Please ignore this too. It's not incorrect, but it's also based on faulty assumptions.
@mindforger: That's true, I'd forgotten I wrote that earlier. My apologies.
I agree that a concurrent data structure is probably the way to go here, but I don't know where all this fear of synchronized
being slow comes from. Java's own java.util.concurrent.ConcurrentHashMap uses synchronized
blocks internally, for example. Guava's ConcurrentHashMultiset uses atomics instead. At some point, if you have concurrent access to a data structure by multiple threads, you either need to protect it with a mutex lock or use compare-and-swap based atomic operations (which are what mutexes are usually implemented with anyway).
Locking isn't inherently slow, you can Google for "java synchronized performance benchmarks" or similar to find plenty of evidence for this. I accept that there are some overheads, even when there's no contention, but they're of a different order of magnitude (i.e. measured in clock cycles rather than milliseconds). Deadlocks are logic bugs, they're not a reason to avoid locking completely. Java's synchronized blocks are re-entrant anyway which makes accidentally deadlocking somewhat harder.
@yueh: As I said earlier, I'm trying to avoid guessing โ at least for now, please disregard the conjecture in my initial bug report. You have plenty of evidence of concurrent access to a data structure that does not support it, and there are things you can and should do to prevent that.
@ben-mkiv Your idea isn't that far off. Just the cause is another one. It's is actually pretty obvious what causes it. There are just a couple different approaches and neither is ideal.
But otherwise it's just plain wrong. E.g. not understanding concurrency and parallelism and proposing a deadlock to fix it.
@yueh Sorry to bring it up again and maybe you did not mean what i proposed but skiplist is lockfree and fast to implement for testing. The only issue remaining is to keep track of concurrent access by monitoring or queueing updates like the old markDirty
Why do providers have to be removed from the grid if they run out of energy? At some point they gonna get added back again anyways?! (To avoid powered down networks to do stuff, if thats not handled at some other place, the method could check globalAvailablePower
before iterating over every provider, which it could anyways before even starting to iterate)
any solution in mind?
maybe a global hashset in the gridcache which stores which providers are currently active and trying to extract, which they remove themself when done with extracting? That should prevent any recursive loops and deadlocks
edit: actually a simple boolean mutex might be sufficient
I decided to experiment a bit with this, and I concede that I clearly don't understand things fully :-)
I figured there were two dumb, easy options that ought to DTRT and fix the bug:
- The docs for
LinkedHashSet
say that if you expect concurrent access to theLinkedHashSet
you should wrap its construction in a call toCollections.synchronizedSet()
. So I did that for boththis.providers
andthis.requesters
, and managed to trigger the same ConcurrentModificationException. - Even dumber, I wrapped all the accesses (both reads and writes) to
this.providers
insynchronized(this.providers)
blocks, and the same forthis.requesters
. This should prevent concurrent access to those sets, and yet I can still trigger the same exception.
FWIW the performance impact of either of these failed approaches appears to be negligible. I don't have better benchmarks than anecdata: repeatedly running /forge tps
before and after, and observing broadly similar numbers.
Also FWIW these attempts did not result in deadlocks. But if they don't actually fix the problem that's not so valuable either ;-)
Edit: Drinking means you miss obvious things, like providers being a LinkedHashSet of IAEPowerStorage
not IEnergyGridProvider
, and the visited set tracking and preventing this kind of recursion. Ignore this post, if you expanded it.
I continued staring at this and trying things while drinking :-)
Eventually I noticed that IEnergyGrid
extends IEnergyGridProvider
, which means that an EnergyGridCache
can have itself as an EnergyGridProvider
, either directly or indirectly. This, coupled with Java's re-entrant locks, would explain the ineffectiveness of synchronization: this isn't actually a concurrency problem at all, it's a recursion problem.
If an EnergyGridCache
ends up with itself as a child provider (directly or indirectly), we end up with co-recursion between extractAEPower
and extractProviderPower
. If one of the inner levels of recursion removes something from the set of providers, the outer level of recursion will end up triggering the ConcurrentModificationException
, because by the time its call to extractAEPower
returns the child has called it.remove()
on its own iterator.
So, yeah. Sorry for the misdirection here, it's not really a concurrency problem at all, despite the exception.
Consider the above fix suggested with little confidence. It fixes my particular test case, though!
just adding my crash reports for reference here
https://pastebin.com/SJjRrnRH
https://pastebin.com/B5EzuHg6
I'd like to apologise for some of my conduct upthread. Hopefully I've not irritated people too much with my attitude. I made some assumptions that were most definitely not true, and ran my mouth off before gathering enough evidence.
On the subject of evidence, I did Terrible Things to get a backtrace which shows one path by which an event can cause this ConcurrentModificationException
.
EnergyGridCache
begins it's update, tries to extract power from its local storage- Local storage is empty, so it calls
publicPowerState(false, ...)
- This triggers an
MENetworkPowerStatusChange
event across the connected grid - A
TileController
on the grid and in theEnergyGridCache
receives this event, updates its metaState, and updates its block state - Because the controller is directly next to a quantum bridge, this triggers a neighborUpdate for that bridge.
- The quantum bridge decides it must shut down, possibly because the other side has no power (see below).
- Shutting down the bridge destroys the grid, which causes the
TileController
to be removed from theEnergyGridCache
set of providers, while the original iterator is still live.
My repeatable test case involves creating a connection between my main network and a p2p-only subnet which causes there to be two controllers present on the main network. The path out from the main controller goes into a p2p tunnel on the subnet, through the subnet's controller which sits next to a quantum bridge, then emerges from the p2p tunnel somewhere down-cable on the other side of the bridge.
If I connect the main network to the p2p-only network, the path then continues back across the bridge and hits the subnet controller. This forces a controller reboot when the duplicate controller is discovered, so my entire network reinitializes every ~-5 seconds. If I repeatedly break + replace a couple of energy cells that are directly connected to the quantum bridge, I can trigger this CME within ~30 seconds or so.
I've spent a while now poring over the code (something I should have done before commenting!) and that's convinced me it'd take more time than I can spare to learn enough about the codebase to suggest fixes with confidence. I hope this evidence is helpful :-)
This has happened twice in less then an hour after setting up a Quantum bridge.
edit: Also tell your players not to directly connect quantum rings to a me controller as this causes the crash mentioned from one of these closed threads.
Reliably?
So I had to same problem as the OP, searched hours for a fix. I fixed my world last time it got corrupt for another reason so I assumed there should also be a way to fix it for this sort of crash report. So far I have fixed 3 worlds from corruption, from different mods. Here are the steps I did for this one:
- Go to and open config/AppliedEnergistics2/AppliedEnergistics2.cfg
- Scroll down or use ctrl+f to find:
networkfeatures {
B:Channels=true
B:QuantumNetworkBridge=false
B:Security=true
B:SpatialIO=true
and
energy {
B:DenseEnergyCells=false
B:EnergyAcceptor=false
B:EnergyCells=false
edit: this purpose of this is to bypass energy passthrough checks and to clear any block that would cause the loop crash check for power
Default is to have these all true, set them as I have
3. Start your server up and went prompted type /fml confirm or use the -Dfml line in launch .bat/sh. It is very important to wait the 10minutes for fml to load everything back up, you cannot see this, just wait for code to start flowing again.
4. Your server should start, check bases if you want and notice the items listed as false no longer exist in the game.
5. Properly stop your server and go back to config/AppliedEnergistics2/AppliedEnergistics2.cfg and set the previous settings all back to true. This will re-enable them on the server
6. Your server should start, check bases if you want, notice the items are still gone.
7. Ping your players that they will be required to recraft and replace any and all locations where they were placed or stored. I neglected to check the effect on crafting patterns, so you may want to test and ping them about that aswell.
edit: Also tell your players not to directly connect quantum rings to a me controller as this causes the crash mentioned from one of these closed threads.
Yep. Same here. As soon as I placed two Quantum Rings next to my controller and gave each a dense energy cell on the other side I crashed.
I came here from Google, so just for anyone else encountering this: Add more energy cells on the controller side of the network.
I have two total "remote" energy cells on the other side of Quantum Rings. So I added 4 Energy cells to the "controller" side of the network and that seems to work around the issue.
Edit: It would seem removing and placing an energy cell on the remote side of the Quantum Ring still causes a crash. :(
Maybe just remove the remote cells all together and find another way to power the ring?