[1.7.10] Exception in machine.lua after server reboot or chunk unload/load

Question

[1.7.10] Exception in machine.lua after server reboot or chunk unload/load

Prodavec opened this issue 4 years ago · 3 comments

Prodavec commented 4 years ago

Minecraft version: 1.7.10
OC JAR file: OpenComputers-MC1_7_10-1_7_5_6bb86d528-universal.jar
Environment: playing on dedicated server as a regular player (NOT server admin)

Hi! I've got an issue that very annoys me and some other players on our server utilizing power of OC. In short: If I have some kind of constant job working on the computer or the robot, it is highly likely that machine will throw an exception in machine.lua when chunk is reloaded on server reboot.

machine:1044: attempt to index a nil value (local 'result')
stack traceback:
/usr/bin/emmonitor.lua:654: in function </usr/bin/emmonitor.lua:652>
(...tail calls...)
machine:1044: in function machine:1042
(...tail calls...)
/lib/buffer.lua:52: in method 'flush'
/lib/buffer.lua:42: in function </lib/buffer.lua:33>
(...tail calls...)
/usr/bin/emmonitor.lua:95: in upvalue 'LogLine'
/usr/bin/emmonitor.lua:485: in function </usr/bin/emmonitor.lua:354>
[C]: in function 'xpcall'
machine:799: in global 'xpcall'
/usr/bin/emmonitor.lua:659: in main chunk
(...tail calls...)
[C]: in function 'xpcall'
machine:799: in global 'xpcall'
/lib/process.lua:63: in function </lib/process.lua:59>

Fulll log file: https://pastebin.com/Y3F0uYzj

Well, sure you want to know what is emmonitor.lua and what does it do? In general it is a LUA script running on the robot to control NuclearCraft's Fusion Reactor health, polling redstone signals and internat robot health like battery power, tool and inventory items. It is just while (true) do loop with non-blocking design allowing to pull signals (via event.pull()). Each loop iteration is being logged into the /var/log/emmonitor.log.

The architecture looks like this:

[global variables and settings]
[auxiliary functions]

MainLoop() // coroutine which does the job
{
    // poll redstone levels from the robot sides and log to the file
    // check inventory and log
    // if shit is found - use robot's tool and log
}

ExceptionHandler(msg)
{
    // write msg to the screen
    // write to /var/log/emmonitor_error.log using non-buffered write via fs.open()
    // reboot
}

PMainLoop() // Protected Main Loop
{
    return xpcall(MainLoop, ExceptionHandler)
}

main()
{
    // boot sequence: check robot compatibility, components avilability, version info and gui draw
    // create new coroutine PMainLoop()
    while PMainLoop() is suspended do:
      // resume coroutine
      // pull signals and wait for the new ones for the decent period of time: about 2-3 seconds
    end
}

// unprotected
main()

As you understand this code must work 24/7 without an unexpected issues. Things like HDD space for the log files are part of the gameplay and is not a problem. And it works pretty well if the chunk where the robot placed is always loaded. My record was about 120 hours of uptime. But when the server restarts or just everyone quit the server, the robot throws the exception listed above. I ensured that I have enough free RAM and HDD space but it doesn't matter because it may crash though.

This is how LogLine() looks like:

local function LogLine(str)
  if (not logging) then    -- logging is a global variable
    return false
  end

  local logFile = io.open(logFilePath, "a")    -- logFilePath is global var with log file path

  if (logFile ~= nil) then                       -- ensure the handle is valid
    logFile:write("["..os.date().."] "..str)    -- write log entry
    logFile:close()                                 -- close the handle

    return true
  end

  return false
end

I am closing the file handle each time I log the data to the file. As you can see in the full log file the robot stopped to write after this string
[05/05/79 21:54:36] Electric Pump is PRESENT and thown exception at the logFile:close(). It means the robot skipped one log entry with this message:

[<date/time>] Inventory HAS ENOUGH free space

and closed the log file without writing anything useful. Something like:

logFile:write("useful_data")
logFile:close()

logFile:write("bla-bla-bla_I_om_okay")
logFile:close()

logFile:write("we_are_still_good)
logFile:close()

...

logFile:write("you_will_never_see_this_line_in_the_log") // handle is VALID but data hasn't been written
logFile:close() // exception thrown

Using debug.traceback() I found that /lib/buffer.lua:52: in method 'flush' is just a string:
local result, reason = self.stream:write(tmp)
and
machine:1044: attempt to index a nil value (local 'result')
if not result[1] then
in the function processResult() called from somewhere, probably by invoke()
Obviously the local variable "result" is nil and here is unexpected behaviour.

Okay! But I decided to turn logging off. Completely. Just log the text right on the screen. It has taken some time, about 6 hours of constant logging from server, changing regions and returning to home back. And again and again.... and finally I've got the same error but at the other line of my script. This time I used custom traceback (just to give it a chance) but using default debug.traceback() the result is the completely same. Look at this

https://i.imgur.com/ZdYgMbo.png
it refers to CheckToolValidity() and then to line 1045 (function "error") in machine.lua
https://i.imgur.com/OExch0I.png
refers to line 1044

The order of the functions is:
[create_and_resume_coroutine] -> PMainLoop() -> MainLoop() -> CheckEquipment() -> CheckToolValidity() -> robot.durability() -> ...exception thrown in machine.lua:1044... -> ErrorHandler()

So it throws the error even if I don't use file system at all! But just read tool's durability. And it always happens when the chunk experiences the "shakes": loading/unloading, server reboot, reboot while player joining the chunk or vice versa when the player leaves the chunk and server reboots after.

Moreover on the first screenshot you can see weird message
a: file not found
But I do not use filesystem in that test. So I do not know the source of that message.

I found out WarpDrives software has pcall() wrapping their working thread and reboots PC if exception thrown. And this is how they... ehm... "fixed" the same problem. Moreover it is common practice in the community. You can see the title page of the program for example in Enantomorphic reactor controller program or Ship controller just because they force reboot the PC in the exception handler. But it is not solution actually. It is only workaround.

I need help to fix that exception to make LUA code more stable. I don't have access to the server (and server settings / logs) because I am not the admin but just a player which wants to make relatively stable systems in my base.

Prodavec · Answer 1 · 2022-01-22T19:55:09.000Z

Exception thrown when I even didn't run my script, at the shell's command line.

payonel · Answer 2 · 2022-06-08T14:53:25.000Z

can you disable bufferChanges in the opencomputers config for the server and retest?

Prodavec · Answer 3 · 2022-07-18T23:39:53.000Z

can you disable bufferChanges in the opencomputers config for the server and retest?

It seems it helps. Server admin turned that option to false and it fixed things. Thanks for the help but at this point buffered IO is not usable for us because of this behaviour (probablyt bug). At least at this moment I can run my robots 24/7 without worrying about exceptions thrown because of that weird error even if I don't use filesystem. But the message "a: file not found" is still present if your program works several days and you interrupt. Even if that program has simple while - do loop doing nothing but processing CTRL+C (soft interrupt) event.

Share to