LuckPerms

LuckPerms

41.4k Downloads

Retry database connection on connection failure

Eelviny opened this issue ยท 11 comments

commented

Add an option to make the server exit with a status code of 1 (basically just crash) if the connection to the external database can't be established on startup.

I run my servers in a Kubernetes cluster, and sometimes the MariaDB server will take a bit longer to start up than the Minecraft servers. As Kubernetes has a self-heal function, if the server exits, the scheduler will restart the server for me with an exponential backoff.

Currently all that happens is that the plugin gets disabled, which then requires manual intervention from me to restart the server.

commented

For reference for anyone having a similar problem, here's a solution that works outside of the plugin: https://medium.com/@xcoulon/initializing-containers-in-order-with-kubernetes-18173b9cc222

Basically create an initContainer that runs before the Minecraft server pod that checks to see if it can connect successfully to the database. If the initContainer fails to connect, exit 1 and the cluster will keep trying to run the initContainer until it succeeds.
Only after this will the Minecraft pod run.

commented

This wouldn't be a good thing at all.
At no point should a plugin intentionally crash or shutdown a server because it can't establish a Database connection or similar.
It's already enough that AuthMe does disable the server on a bad configuration and LP shouldn't follow such a trend either.

commented

Fair enough - I find this to be desirable behaviour in a microservices environment however, as I prefer to give the scheduler more freedom on placement. Either way, I'm suggesting a config option, not a default.

Maybe another option is a retry w/ backoff within Luckperms? From my experience I see that the plugin disables itself after the connection times out, instead of retrying. Correct me if I'm wrong.

commented

The plugin doesn't disable itself when it can't connect to the database, but rather denies the login of players, due to it being unable to connect to said database.
I can't see the point of this retry option here either as 99% of the time the database is already running before the server and not vice-versa.
You should have the database setup and running before you start the server, or other plugins using the remote database too might face the same issue.

commented

Strange, I'll attempt to reproduce the issue I'm seeing of the plugin getting disabled - most likely a bug causing it to crash.

@Andre601 I can definitely agree on the fact that 99% of the time the database will be running. From what I see, the vast majority of Minecraft servers run on a more traditional operating model of discrete, stateful machines.

This is proving difficult to translate into a newer containerised model as the plugin is making assumptions that services will always be online, and has no methods of automated recovery from failure. This is not always the case in Kubernetes, where occasionally the scheduler migrates services from one physical machine to the next.

Either way, I'm hoping that my use case gets considered.

commented

I mean having a config option would be fine tbh.

commented

I can imagine that some people could accidentally enable this setting without knowing what it means - the option could be available to add to the config file but not put into the file by default.

commented

An automatic retry for database connections would be an idea.
Maybe a setting like retryAttempts which is 0 by default?

commented

Maybe another option is a retry w/ backoff within Luckperms?

Seems like the best solution to me. I'll happily accept any good PR adding this behaviour. Ideally it would be optional, behind a configuration setting.

commented

That would be a decent idea. I recently had a case in the Discord where a member complained that after restarting the database LP would not automatically recover the SQL connection state.

commented

Please see my comment here: #2223 (comment)

Hopefully that ^ will also resolve this problem :)