Permissions / Group being deleted when permissions are removed/expire
rgaminecraft opened this issue ยท 5 comments
This is in response to the following bug report: #2751
I would have responded there, but the issue has since been closed, and we have more information to add that will help with a possible solution. I will repost the initial report here and modify it a bit to fit the new research we've done.
Description
In a multi server environment a user is assigned a "temporary" permission node. ("temporary" in quotes to mean that it's either an actual temporary expiring permission, or temporary in the sense that another command will shortly remove it from the player)
Later on (could be many hours later) a player is assigned more groups or permissions that are permanent.
At some point the "temporary" permission nodes expire (or are removed via command) and at that same moment the permissions/groups added after the initial "temporary" node are removed at the exact same time.
Reproduction steps
Assign "temporary" permission to user
Wait indefinite amount of time
Assign permanent permissions/groups to that user
Wait indefinite amount of time
"Temporary" permission expires (or is removed via command) and any perms added after that "temporary" node are removed
NOTE: This does NOT occur all the time, and is very rare
Expected behaviour
User does not lose permanent groups / permissions
Environment details
Servers mixed 1.12.2 PaperSpigot / latest and 1.16.4 Paperspigot latest
Luckperms (Latest)
DataBase MySQL 5.7.32
Messaging - Redis
620,000 approx records in _players table
850,000 approx records in _user_permissions table
DB Server on 10GB LAN with servers and running at low cpu usage.
Any other relevant details
Troubleshooting steps followed:
Verified that permissions were inserted into _actions table, as well as user_permissions table
There is no corresponding delete in the _actions table
We added timestamps to user_permissions table and at corresponding timestamp found missing rows (based on record ID)
We added database table triggers to log inserts and deletes and verified that the permission / group was indeed in the table and was inserted and then deleted by the luckperms when luckperms removed "temporary" permissions (either by expiry or manual remove command)
UPDATES NOT INCLUDED IN PRIOR ISSUE
I understand that you get lots of requests from people that don't have a clue what they are doing, and as much as this answer typically works for most users, it's note really a solution as I believe the problem lies somewhere within your application (and here's why). I have no knowledge of the inner workings of your plugin, which is why I was hoping that mentioning the things we did would help jar something that you may be familiar with since you wrote the system. I'm hoping with the following insight you may be able to figure out what is causing this. And please take no offense to my tone, I just have peeves about people jumping to "it's not my issue".
First of all, this is an issue that only periodically pops up from time to time, and is definitely NOT the normal scenario. With our new table triggers we are now able to get a better idea of exactly how frequently this occurs, but it will take a while to gather the data (however useless it may be towards solving this).
User Cases
I have replaced our permissions/groups with 'friendly' names. Any use of the same GROUPA/GROUPB/Perm.NodeA name will reflect the same permission across all cases. Rather than forcing you to learn our schema, I simplified it.
User 7a63
2020-11-30 20:51:17 | CONSOLE at HUB ran [parent add groupa]
2020-11-30 20:51:17 | CONSOLE at HUB ran [parent add groupb]
2020-11-30 22:28:23 | Trigger=DELETE, UUID=[7a63...] Node=[group.groupa]
2020-11-30 22:28:23 | Trigger=DELETE, UUID=[7a63...] Node=[group.groupb]
2020-11-30 22:28:23 | Trigger=DELETE, UUID=[7a63...] Node=[Specific.SpecialNode] <- This node is a permanent node that is added and removed when changing servers, and is added/removed via "console" command.
User a548
2020-12-01 06:36:07 | CONSOLE at HUB ran [permission settemp Timed.PermNode true 18h]
2020-12-01 23:10:45 | CONSOLE at HUB ran [parent add groupc]
2020-12-02 00:36:08 | Trigger=DELETE, UUID=[a548...] Node=[group.groupc]
2020-12-02 00:36:08 | Trigger=DELETE, UUID=[a548...] Node=Timed.PermNode] <- This is the temporary node from above, and when removed also deleted the users group.
User 6214
2020-12-03 13:13:01 | CONSOLE at HUB ran [permission settemp Timed.PermNode true 18h]
2020-12-03 13:14:32 | CONSOLE at HUB ran [parent add groupb]
2020-12-03 16:43:20 | Trigger=DELETE, UUID=[6214...] Permission=group.groupb]
2020-12-03 16:43:20 | Trigger=DELETE, UUID=[6214...] Permission=Timed.PermNode] <- Once again when a temporary perm node is removed, the user also loses their group.
Again, I don't have knowledge of how your system works, nor do I really want to spend time tearing it apart and learning it, but after poking around for a few minutes I've got a (very) small idea of some concepts. Here's my thought on what's going on:
-
A user joins our network and is given an arbitrary permission node (these nodes are either temporary (assigned as such) or "permanent - but removed later manually via console command")
-
At some point while they are playing SOME EVENT occurs and decides to strip the "temporary" perm node described above, as well as whatever other groups/perms were added up until that point.
To help us better troubleshoot whatever SOME EVENT may be, I wrote table triggers to log every insert/delete from the user permissions table so I can see when this is occurring, but mainly initially added just to verify that the records were in fact in the table to begin with and not just somewhere uncommitted in memory (again wasn't aware of how LP works internally).
As you can see from the above examples, all of the times when a player loses permissions or groups, it occurs at the exact time that the "temporary" perm nodes are removed (which if you can't follow, "temporary" means either ACTUAL temporary(timed) nodes or nodes that are added and very shortly after removed) There has never been a case where a permission "randomly" falls off, it was just a term we used to describe the seemingly random loss of the permission.
I'm sure you read above, but we have many servers on our network that communicate via REDIS messaging with MYSQL storage (hence the use of triggers).
Correct me if I'm wrong on anything below this point, as this is pure speculation from the small insight I have.
From what I gather the messaging system simply notifies when a certain UUID has had something happen, which then makes a call to sync up it's local perms for that user. With that said, it COULD be possible that all servers are not in sync if they miss that message. HOWEVER, if they missed the message, they would only be aware of their "current state" for that user which would NOT include newer information.
Continuing on, if a server is out of sync (aka missing a new perm node such as group.groupa), it would still know about the "temporary" node that is about to expire. Internally a clock says "Do any users have any expired permissions?" at which point it loops through each user and if it finds anyone with an epoch time less than current, it runs the SaveUser() routine which dumps the player out to the database.
AH HA! So the issue is within the save user process saving an old copy of the user to the database like you said! Well, except for one pretty MAJOR flaw: it knows about the new permanent groups/perms!
You see, if your theory is that the server doesn't have the most updated information, it SHOULDN'T know that the user has (for example) "group.groupa", so when it goes to save the data to the database, it would call the SaveUser() routine but "permissions to remove list" would NOT include the permanent group, which means there should be no delete statement issued for "group.groupa".
This tells me that the server indeed DOES know about this new permission node, and for SOME REASON (this is where I'm hoping something clicks in your mind) some process decides to add the "group.groupa" node to the "permissions to remove list" ALONG WITH the temporary perm node that is being removed. My point here is that SOME EVENT happens that says SaveThisUser and something in LuckPerms decides to add the various perms/groups to the list of things to expire.
Now on to the fun part: ROOT CAUSES
Possible Cause 1: When an event comes in (such as the timer expiring actual time based perms, or a permission node was manually removed via a command and an update message is received) the LuckPerms system decides to run SaveUser() which for some reason decides that unrelated permissions/groups need to be expired along with the "temporary" nodes.
Possible Solution 1: When deleting "temporary" permissions and saving the user back to the database, LuckPerms should verify that it's ONLY expiring things that have a "time" value greater than zero (since anything else would be permanent and should not be getting stripped away) This however assumes that the system somehow things the ACTUAL permanent nodes are for some reason TIMED nodes (which they aren't), which I'm assuming is not going to be the case. Maybe in memory when they are added it has their "time" as 0 (0=non expiring perm) and some logic says "Yeah, 0 is less than the current unix timestamp, let's remove it!" Again, this is why I personally don't use EPOCH times, but I'm also not here to critique.
Possible Cause 2: The local LuckPerms server is out of sync with the database, and when calling the SaveUser() routine it notices "Hey the database told me this user has permission nodes XXXXX and YYYYY, but my copy doesn't show the user having those so I need to remove them (not knowing it's not up to date)"
Possible Solution 2: The LuckPerms server should NEVER "assume" that it knows better than the database. If the database says it has a permission/group node for a player, it should never get removed unless manually done so VIA command for that SPECIFIC node. As you can see by the timestamps, both the "temporary" and permanent nodes were removed in the exact same millisecond which tells me that LuckPerms queued+ran those delete statements together. Furthermore, since it's happening at the exact same time each time, this disproves your theory about us manually using any API calls (at least in my mind) since they times line up with the moments that your scheduler/task engine runs. However, this is assuming my basic knowledge is even remotely close to any processes that exist, and is where I'm hoping you can run through that flow in your head and identify the actual process as well as at what point this may occur.
Possble Cause 3: They MySQL server has a "hiccup" when looking up a users permissions when LuckPerms tries to load the current users permissions. When they connected the permissions loaded fine, but at the time of saving the database had a lock/deadlock/table lock/internal explosion/something else that caused it to return an empty data set. LuckPerms then says "Shit, the database says you don't have ANY of these perms, let's queue them up for deletion!" (which if this is the case, why would you delete something even though you already think it's not there to begin with?)
Possible Solution 3: If there is indeed some hiccup causing the database/LuckPerms to get out of sync, LuckPerms shouldn't try to delete permissions that don't line up with what the database says. If LuckPerms has a permission in memory for a user, and the database says they don't have it, the permission should just be dropped from memory and NOT added to a list of perms to delete.
Summary
Please keep in mind that I have very limited knowledge of your code, and have only spent about 10 minutes browsing it to come to most of my conclusions above. What I said could be WAY off from what the actual execution path is, and a lot of assumptions are the ways that I would expect it to operate.
With that said, I'd love to list more scenarios, but honestly I'd need your input to understand the process better. In my mind something in your code is bulk removing perms when the server decides to either a:(expire a timed perm) or b:(a permission is manually deleted via command). I can only continue to supply baseless guesses as to what part of your system is causing this issue, and have no explanation as to the cause. Perhaps it's missed REDIS messages causing the server to get out of sync, perhaps it's the MySQL server having a hiccup, perhaps someone is manually running MySQL statements and modifying the logs to look that way (VERY highly unlikely)
At the end of the day it's not a SUPER common occurrence, but on the same token we have only been running detailed tracing for a week. As we continue to experience these issues I will keep collecting data, but in the meantime I'm hoping some of this massive wall of text makes something in your head scream "I GOT IT!".
Thanks for the extra information & v. detailed report.
You bring up some good points, I'll have a think about how best to proceed and get back to you. :)
Again, no offense/disrespect meant at all. I just spend all day dealing with "not so happy" people and I've become a very grumpy person. No real rush so if this takes a while to fix so be it, it's just something we've always dealt with but has finally hit that level of "pain" where we want to get it resolved.
I think the likely cause of this issue will be fixed by the above commit.
I'm going to let it sit as a PR until I get a chance to properly test & re-review what I've written, but it seems to work ok at first glance :)
The PR has now been merged into master: 478fddc
I'd appreciate it if you could have a try using v5.2.55
or newer to see if it fixes the issues you're having.