unserializeJSON returns valid UTF-8 strings - Documentation/logic issue.

Question

unserializeJSON returns valid UTF-8 strings - Documentation/logic issue.

Wojbie opened this issue 6 months ago · 3 comments

Wojbie commented 6 months ago

Minecraft Version

1.20.1

Version

1.111.0

Details

Issue

When using serializeJSON on string containing characters above 128 they are represented using unicode escapes.

This means that when same string is fed into unserializeJSON it results in string that is valid UTF-8 string representation of input.

local a = string.char(36)
print(a)
local b = textutils.serialiseJSON(a)
print(b)
local c = textutils.unserialiseJSON(b)
print(c, c == a) print()

local a = string.char(129)
print(a)
local b = textutils.serialiseJSON(a)
print(b)
local c = textutils.unserialiseJSON(b)
print(c, c == a) print()

local a = string.char(255)
print(a)
local b = textutils.serialiseJSON(a)
print(b)
local c = textutils.unserialiseJSON(b)
print(c, c == a)

This result, while understandable feels quite alien to someone who does not expect this and could lead to confusion about what is going on here.
As one would expect things to be same after unserialisation as they were before serialisation

Solution

It can be solved via one extra step

local a = string.char(255,128,129,22,13,110)
print(a)
local b = textutils.serialiseJSON(a)
print(b)
local c = textutils.unserialiseJSON(b)
print(c, c == a)
local d = string.char(utf8.codepoint(c,1,-1))
print(d, d == a)

Documentation

But this extra step is non-oblivious to unexperienced CC user or ones without understanding of unicode/utf8 library in lua.

I feel like this issue could be solved by documenting this interaction and solution and/or adding a flag to unserialiseJSON where it make it do utf8 conversions back to CC characters automatically.

SquidDev · Answer 1 · 2024-07-01T12:15:28.000Z

I'm not really sure there's a good fix to the code here. JSON requires that strings (and the rest of the document) are well formed unicode¹. There's not any way to encode arbitrary binary data.

The correct behaviour of the JSON encoder/decoder probably would be that they force the input/output to be valid UTF-8, and error on anything else. But that's obviously impossible for backwards compatibility reasons.

I don't think adding a flag to enable this broken behaviour on the unserialiseJSON is the right solution, as that's just papering over existing problems. I suspect the only option is to just document this better.²

Our JSON parser is also wrong, in that it accepts textutils.unserialiseJSON('"\255"') == "\255" rather than returning a decode error. I suspect it's too late to fix that now. ↩
I guess we could also add a new cc.json module which handles unicode correctly, and deprecate the existing methods, but that feels a bit more dramatic. ↩

Wojbie · Answer 2 · 2024-07-01T13:40:24.000Z

I don't think adding a flag to enable this broken behavior on the unserialiseJSON is the right solution

I definitely agree with your point about not adding broken behavior or something custom as this can be just solved with documentation pointing it its outputting unicode encoded string and link to utf8 lib/info how to decode that.

The correct behaviour of the JSON encoder/decoder probably would be that they force the input/output to be valid UTF-8, and error on anything else. But that's obviously impossible for backwards compatibility reasons.

Our JSON parser is also wrong, in that it accepts textutils.unserialiseJSON('"\255"') == "\255" rather than returning a decode error. I suspect it's too late to fix that now.

As for this we could provide a bandaid for this issue with unicode_strings flag also for for unserialiseJSON where it would error if input string was not valid... While it does start to feel like php and would have to stay disabled by default for backwards compatibility it does mean that if someone wanted it to be more strict about its input as properly encoded Unicode?
I am assuming i remember correctly that is errors serialiseJSON if that flag is set and input is not valid unicode.

AnrDaemon · Answer 3 · 2024-08-05T10:49:23.000Z

If your source string happened to be valid UNICODE string, I see no issues.
But a documentation note is definitely required about encoding ARBITRARY data to JSON.
The accepted way of doing so is to base64encode the data to a string, and decode afterward.
An example wrapper code would also be helpful.

Share to