Applied Energistics 2

Applied Energistics 2

156M Downloads

ConcurrentModificationException at EnergyGridCache.java:316

fluffle opened this issue ยท 33 comments

commented

Hi! This is a dupe of #3800, which is locked, but please don't close this without reading it! I managed to trigger this exception repeatedly in a very short space of time, so I think I can provide some useful information on a way to reliably reproduce the crash and thus narrow down the root cause.

Describe the bug

Background

I was scaling out my AE system by adding a large number of quantum bridges. I had what I thought was a neat solution to the "both sides of the bridge must be powered" problem: I put an AE energy cell on the remote side of each bridge link. This was charged while the link was active, and contained enough of a buffer to kick the bridge back into life when the world was reloaded. It would then recharge itself.

Problem

Quantum bridges are power hungry. The crashes started when the number of bridges on my network exceeded the power I was supplying (5400 RF/t from a nuclearcraft reactor, FYI), causing parts of the network to switch on and off rapidly. If we look at what the code is doing:

https://github.com/AppliedEnergistics/Applied-Energistics-2/blob/rv6-1.12/src/main/java/appeng/me/cache/EnergyGridCache.java#L316

Line 316 is in the middle of extractProviderPower. It's iterating over the LinkedHashSet of energy providers in the grid cache, and removing these providers from the grid cache if they don't have enough power. The set of providers is modified in storagePowerChangeHandler, removeNode and addNode:

https://github.com/AppliedEnergistics/Applied-Energistics-2/blob/rv6-1.12/src/main/java/appeng/me/cache/EnergyGridCache.java#L151
https://github.com/AppliedEnergistics/Applied-Energistics-2/blob/rv6-1.12/src/main/java/appeng/me/cache/EnergyGridCache.java#L509
https://github.com/AppliedEnergistics/Applied-Energistics-2/blob/rv6-1.12/src/main/java/appeng/me/cache/EnergyGridCache.java#L557

And iterated over (read-only) here in refreshPower:

https://github.com/AppliedEnergistics/Applied-Energistics-2/blob/rv6-1.12/src/main/java/appeng/me/cache/EnergyGridCache.java#L287

There's a few things that could be happening here, but given how I triggered this I expect I'm running into race conditions between extractProviderPower and addNode/removeNode. When a quantum bridge powers down, I expect this causes the energy cell on the far side of the bridge to be removed from the cache via removeNode. If the cache was providing power to the network at the time via extractProviderPower, this will result in the ConcurrentModificationException you see here.

To Reproduce

Build a large network that has 7-8 quantum bridges with energy cells keeping their remote side powered. You don't need anything more than a bridge and a cell on the remote side. Limit the power to the network such that it only has 90-95% of the power it needs. This ought to result in crashes pretty quickly as individual bridges power on and off.

Expected Behaviour

Not crashing?

It ought to be pretty simple to fix the race: you need a mutex protecting your hashsets when you're modifying them, so wrap addNode/removeNode/extractProviderPower/storagePowerChangeHandler in synchronized blocks. Or there appear to be ConcurrentLinkedHashSet implementations around in guava etc. It's your code :-)

Crash logs

There are three consecutive crash reports in this .zip:

crash-reports.zip

Environment

I'm playing GregBlock 1.4.6 on Minecraft 1.12.2, forge 14.23.5.2768, it has ae2-rv6-stable-4.jar 1.12.2-3.3.16. The full mod list is in the attached crash reports.

commented

Same problem here with Enigmatica2Expert 1.58 (ae2-rv6-stable6)

1 Quantum Link which is in the same Chunk as the AE2 Controller. Adding dense cells didnt solve the issue, even as the remote side of the Quantum Link is powered by an energy acceptor.

As reports about self chunkloading AE2 systems rise up, this might bring up this bug again which was never fixed.

commented

Duplicate of #3800

commented

add/removeNode is never called concurrently. Maybe the event could case it, but I'm currently not sure about a potential solution as it can easily degrade the performance.

commented

For anyone following this bug: please ignore the below! I was wrong: this is not about multiple threads, as @yueh rightly points out.

I don't think adding synchronization around a data structure that is clearly being modified from multiple threads concurrently is going to make matters worse.

If you're really worried about performance, (a) you should benchmark to prove or disprove your hypothesis rather than falling back on conjecture, and (b) using a concurrent hash set instead of broad locks would alleviate at least some of these concerns. But IMO the concerns are overblown anyway. If addNode and removeNode really aren't called concurrently, they can never contend for the lock, so it won't hurt performance there: taking a free lock a very cheap operation. And if there is lock contention, then that's fine because the lock would be preventing concurrent modification of your non-thread-safe data structure and a slight performance hit is preferable to a crash-to-desktop.

If I created a PR, what kind of evidence that it didn't have a performance impact would you be looking for?

commented

This is about concurrency not threads. synchronized won't to a thing here.

commented

I'm just guessing as its hard to twist the head around the complexity of AE2 code, but is it possible that stuff like this

this.getProxy().getGrid().postEvent( new MENetworkPowerStorage( this, PowerEventType.REQUEST_POWER ) );

Could cause it when the grid of the quantum bridge gets joined to the main AE grid?

Anyways closing all bugs related to this wont result in it being solved at any point of time, no matter what the reason for this is.

commented

you could use ConcurrentSkipList/Map to avoid getting the CME but you'd have to make sure that changes are propagated properly

from what i understand out of fluffle, his assumption is most probably based on async chunk loading or unloading event ...

the upcoming challenge that i see in using concurrent safe lists/sets/maps is to make sure in case of an modification the still running iterator (which reflects, at least for my proposed classes, only the list on creation of the iterator and not includes any concurrent change at all) gets aborted and repeats it's work (which means, the already done work has to be either rolled back or the work the iterator does has to be interruptible or ultimately be redone once it finished the first cycle of iteration)

commented

@yueh: synchronized is explicitly for preventing concurrent access to data structures that are not safe for use by multiple threads:

https://docs.oracle.com/javase/tutorial/essential/concurrency/locksync.html

@mindforger: I'm trying not to assume anything about where the concurrent accesses are happening from. Really, it doesn't matter where they're coming from. We have evidence they are occurring, so the right thing to do is to (a) put a mutex around the data structure to prevent concurrent access, or (b) use a different data structure that is safe for use by concurrent threads.

commented

Performance, basically every action like inserting/extracting items will extract a bit of energy and there can be hundreds of them each tick. So not having to check empty ones each time adds up over time. Especially with large builds it can add up quite fast. Same issues with concurrent collections. Even if it's fine with a single network, some cases do not scale well and I've already refactored/optimized them quite a bit with 1.12, so it's no longer such an hassle with some builds. Or run into other issues like stackoverflows. So I'm pretty reluctant about making it worse again.

commented

but given how I triggered this I expect I'm running into race conditions between extractProviderPower and addNode/removeNode. When a quantum bridge powers down, I expect this causes the energy cell on the far side of the bridge to be removed from the cache via removeNode. If the cache was providing power to the network at the time via extractProviderPower, this will result in the ConcurrentModificationException you see here.

well that is technically an assumption and based on that i added that i suspect an mechanism (from whatever source, just to name one possibility "sponge") for async chunkloading/unloading causing this update while an iterator is running over the list

and for (b) i proposed using given datasets like the concurrentskiplist/set/map, the performance impact should be not that bad, but refactoring should make this an straight forward task for simple testing, but bears the risk of having issues with the weak link between iterators and actual lists (what i described in my earlier post)

i just wanted to give an (implyingly/implied ... i want to say i am implying it being) easy to test solution to produce some results

edit2: synchronized is causing locks and can heavily impact performance or cause deadlocking, my solution would only prevent the actual CME but additional measures would have ulimately to be taken to keep everything "synchronized" in a data-point-of-view

commented

edit: my problem wasnt the same issue

commented

but why did it happen to me at all with an ae2 system where everything is in the same chunk?

so from what i did understand from your post, one of total two QNBs standing in the same chunk as the controller ... or do you mean one single unlinked QNB in the same chunk with the controller?

is EVERYTHING in the same chunk? i mean also cable etc.

when unloading is handled in a seperate thread, this can still happen while an iterator is used in the server tick for example (Edit: after a quick check i have yet no evidence to where sponge is still doing async chunkloading/unloading, and i did not find something in the configs so far, but would be funny to disable it for testing purpose)

commented

Yes both ends of the QNB are in the same chunk. Nothing of the AE2 is connected to anything outside of that chunk.
But im pretty sure the server im playing on is using sponge, and im not sure if that uses something like vertical async loading.

commented

Please stop guessing about stuff, if you do not understand it. It simply does not help anyone and just wastes time to debunk the mess it causes.

commented

Maybe the event could case it so that aint guessing, too?

Im just trying to contribute to the solution of the problem.

commented

Edit: Please ignore this too. It's not incorrect, but it's also based on faulty assumptions.

@mindforger: That's true, I'd forgotten I wrote that earlier. My apologies.

I agree that a concurrent data structure is probably the way to go here, but I don't know where all this fear of synchronized being slow comes from. Java's own java.util.concurrent.ConcurrentHashMap uses synchronized blocks internally, for example. Guava's ConcurrentHashMultiset uses atomics instead. At some point, if you have concurrent access to a data structure by multiple threads, you either need to protect it with a mutex lock or use compare-and-swap based atomic operations (which are what mutexes are usually implemented with anyway).

Locking isn't inherently slow, you can Google for "java synchronized performance benchmarks" or similar to find plenty of evidence for this. I accept that there are some overheads, even when there's no contention, but they're of a different order of magnitude (i.e. measured in clock cycles rather than milliseconds). Deadlocks are logic bugs, they're not a reason to avoid locking completely. Java's synchronized blocks are re-entrant anyway which makes accidentally deadlocking somewhat harder.

@yueh: As I said earlier, I'm trying to avoid guessing โ€“ at least for now, please disregard the conjecture in my initial bug report. You have plenty of evidence of concurrent access to a data structure that does not support it, and there are things you can and should do to prevent that.

commented

@ben-mkiv Your idea isn't that far off. Just the cause is another one. It's is actually pretty obvious what causes it. There are just a couple different approaches and neither is ideal.

But otherwise it's just plain wrong. E.g. not understanding concurrency and parallelism and proposing a deadlock to fix it.

commented

@yueh Sorry to bring it up again and maybe you did not mean what i proposed but skiplist is lockfree and fast to implement for testing. The only issue remaining is to keep track of concurrent access by monitoring or queueing updates like the old markDirty

commented

Why do providers have to be removed from the grid if they run out of energy? At some point they gonna get added back again anyways?! (To avoid powered down networks to do stuff, if thats not handled at some other place, the method could check globalAvailablePower before iterating over every provider, which it could anyways before even starting to iterate)

commented

any solution in mind?

maybe a global hashset in the gridcache which stores which providers are currently active and trying to extract, which they remove themself when done with extracting? That should prevent any recursive loops and deadlocks

edit: actually a simple boolean mutex might be sufficient

commented

I decided to experiment a bit with this, and I concede that I clearly don't understand things fully :-)

I figured there were two dumb, easy options that ought to DTRT and fix the bug:

  1. The docs for LinkedHashSet say that if you expect concurrent access to the LinkedHashSet you should wrap its construction in a call to Collections.synchronizedSet(). So I did that for both this.providers and this.requesters, and managed to trigger the same ConcurrentModificationException.
  2. Even dumber, I wrapped all the accesses (both reads and writes) to this.providers in synchronized(this.providers) blocks, and the same for this.requesters. This should prevent concurrent access to those sets, and yet I can still trigger the same exception.

FWIW the performance impact of either of these failed approaches appears to be negligible. I don't have better benchmarks than anecdata: repeatedly running /forge tps before and after, and observing broadly similar numbers.

Also FWIW these attempts did not result in deadlocks. But if they don't actually fix the problem that's not so valuable either ;-)

commented

Edit: Drinking means you miss obvious things, like providers being a LinkedHashSet of IAEPowerStorage not IEnergyGridProvider, and the visited set tracking and preventing this kind of recursion. Ignore this post, if you expanded it.

I continued staring at this and trying things while drinking :-)

Eventually I noticed that IEnergyGrid extends IEnergyGridProvider, which means that an EnergyGridCache can have itself as an EnergyGridProvider, either directly or indirectly. This, coupled with Java's re-entrant locks, would explain the ineffectiveness of synchronization: this isn't actually a concurrency problem at all, it's a recursion problem.

If an EnergyGridCache ends up with itself as a child provider (directly or indirectly), we end up with co-recursion between extractAEPower and extractProviderPower. If one of the inner levels of recursion removes something from the set of providers, the outer level of recursion will end up triggering the ConcurrentModificationException, because by the time its call to extractAEPower returns the child has called it.remove() on its own iterator.

So, yeah. Sorry for the misdirection here, it's not really a concurrency problem at all, despite the exception.

commented

Consider the above fix suggested with little confidence. It fixes my particular test case, though!

commented

just adding my crash reports for reference here
https://pastebin.com/SJjRrnRH
https://pastebin.com/B5EzuHg6

commented

I'd like to apologise for some of my conduct upthread. Hopefully I've not irritated people too much with my attitude. I made some assumptions that were most definitely not true, and ran my mouth off before gathering enough evidence.

On the subject of evidence, I did Terrible Things to get a backtrace which shows one path by which an event can cause this ConcurrentModificationException.

  1. EnergyGridCache begins it's update, tries to extract power from its local storage
  2. Local storage is empty, so it calls publicPowerState(false, ...)
  3. This triggers an MENetworkPowerStatusChange event across the connected grid
  4. A TileController on the grid and in the EnergyGridCache receives this event, updates its metaState, and updates its block state
  5. Because the controller is directly next to a quantum bridge, this triggers a neighborUpdate for that bridge.
  6. The quantum bridge decides it must shut down, possibly because the other side has no power (see below).
  7. Shutting down the bridge destroys the grid, which causes the TileController to be removed from the EnergyGridCache set of providers, while the original iterator is still live.

My repeatable test case involves creating a connection between my main network and a p2p-only subnet which causes there to be two controllers present on the main network. The path out from the main controller goes into a p2p tunnel on the subnet, through the subnet's controller which sits next to a quantum bridge, then emerges from the p2p tunnel somewhere down-cable on the other side of the bridge.

If I connect the main network to the p2p-only network, the path then continues back across the bridge and hits the subnet controller. This forces a controller reboot when the duplicate controller is discovered, so my entire network reinitializes every ~-5 seconds. If I repeatedly break + replace a couple of energy cells that are directly connected to the quantum bridge, I can trigger this CME within ~30 seconds or so.

I've spent a while now poring over the code (something I should have done before commenting!) and that's convinced me it'd take more time than I can spare to learn enough about the codebase to suggest fixes with confidence. I hope this evidence is helpful :-)

commented

https://pastebin.com/BVgkmx8U

This has happened twice in less then an hour after setting up a Quantum bridge.

commented

edit: Also tell your players not to directly connect quantum rings to a me controller as this causes the crash mentioned from one of these closed threads.

Reliably?

commented

So I had to same problem as the OP, searched hours for a fix. I fixed my world last time it got corrupt for another reason so I assumed there should also be a way to fix it for this sort of crash report. So far I have fixed 3 worlds from corruption, from different mods. Here are the steps I did for this one:

  1. Go to and open config/AppliedEnergistics2/AppliedEnergistics2.cfg
  2. Scroll down or use ctrl+f to find:
    networkfeatures {
    B:Channels=true
    B:QuantumNetworkBridge=false
    B:Security=true
    B:SpatialIO=true
    and
    energy {
    B:DenseEnergyCells=false
    B:EnergyAcceptor=false
    B:EnergyCells=false
    edit: this purpose of this is to bypass energy passthrough checks and to clear any block that would cause the loop crash check for power

Default is to have these all true, set them as I have
3. Start your server up and went prompted type /fml confirm or use the -Dfml line in launch .bat/sh. It is very important to wait the 10minutes for fml to load everything back up, you cannot see this, just wait for code to start flowing again.
4. Your server should start, check bases if you want and notice the items listed as false no longer exist in the game.
5. Properly stop your server and go back to config/AppliedEnergistics2/AppliedEnergistics2.cfg and set the previous settings all back to true. This will re-enable them on the server
6. Your server should start, check bases if you want, notice the items are still gone.
7. Ping your players that they will be required to recraft and replace any and all locations where they were placed or stored. I neglected to check the effect on crafting patterns, so you may want to test and ping them about that aswell.

edit: Also tell your players not to directly connect quantum rings to a me controller as this causes the crash mentioned from one of these closed threads.

commented

Yep. Same here. As soon as I placed two Quantum Rings next to my controller and gave each a dense energy cell on the other side I crashed.

commented

On which version? We believe this has been fixed in 1.15+

commented

I'm on a 1.12 build. So yeah, probably fixed.

commented

I came here from Google, so just for anyone else encountering this: Add more energy cells on the controller side of the network.
I have two total "remote" energy cells on the other side of Quantum Rings. So I added 4 Energy cells to the "controller" side of the network and that seems to work around the issue.

Edit: It would seem removing and placing an energy cell on the remote side of the Quantum Ring still causes a crash. :(
Maybe just remove the remote cells all together and find another way to power the ring?

commented

We just had another user in our Discord with this crash, I know it's anecdotal but it just so happens they also had a quantum ring connected directly to the ME Controller.