[1.76pr6] String Bugs

Question

[1.76pr6] String Bugs

BombBloke opened this issue 9 years ago · 43 comments

BombBloke commented 9 years ago

Not sure whether these count as "bugs", or even if it's practical/appropriate to "fix" them, but hey - let's recap.

At the moment:

All strings seem to send/receive correctly via modems (whereas they didn't used to; not sure when that got fixed, but hurray for that 👍 ).
Strings also seem fine when passed through commands API functions; all characters can be passed in and (where relevant) read back in some form. 👍
Strings saved via textmode file handles may have characters within certain ranges converted to ? (0x3F). 👎
Strings affected by the above also can't be sent/received via the http API. 👎

Assuming this won't be fixed/changed/whatever such that "all strings can be sent anywhere", I suggest a couple of workarounds:

textutils.serialise could be rigged to convert eg "ÿ" to ""\255"". textutils.unserialise is already capable of reverting this. (Edit: This is probably obvious to you, but leading zeros are important for numbers less than 100!)

textutils.urlEncode could be rigged to convert eg "ÿ" to "%FF".

BombBloke commented 9 years ago

👍

dan200 · Answer 1 · 2015-12-19T12:41:09.000Z

Which characters get lost when saving files? All 256 char values (with the possible exception of zero?) should be safe to save out.

Reading in is another matter. If the file contains UTF-8 encoded chars greater than 255, they're two big to fit in a lua character so have to be converted if we don't want to switch to a multibyte encoding (which we don't).

BombBloke · Answer 2 · 2015-12-19T14:33:44.000Z

The short answer is "everything after char 127 fails to read or write".

Most affected values are converted to 0x3F (when read or written), and if multiple such values are present in a string contiguously, they may be merged into 0x3F (one such set starts at char 224 followed by whatever, though there may be others; my tests have been lazy). Some single chars are converted into multi-byte representations during writing (which I'm assuming are UTF-8 for who-knows-what).

Writing 0x00 works only sometimes; it may simply be omitted. Not entirely sure of the circumstances required to break it yet. It presumably has to do with the chars surrounding it.

Really what I'd expect is for every char to be written as if it were being filtered through string.byte() to a binary-mode file handle (or in Java terms, pass someString.getBytes() to a BufferedOutputStream made out of a FileOutputStream object), and for every char to be read as if it were being filtered through string.char() from a binary-mode file handle (get all the bytes from a BufferedInputStream made out of a FileInputStream and convert to chars one by one). I assume there's some benefit to allowing Java's text-handling classes to try and guess the encoding, but I'm not sure as to what that'd be, as they uniformly seem to cause only problems...? I'm not that familiar with how the Lua "strings" LuaJ is handling are actually being... handled.

On the other hand, I notice fsFileHandle.writeLine() is making use of \r now, so that's pretty cool. :)

If it helps, here's a script that reports on fs accuracy (results go to results.txt):

http://pastebin.com/YSJ26Lw0

And here's the output under 1.76pr6:

http://pastebin.com/V7s3TegM

dan200 · Answer 3 · 2015-12-19T14:43:38.000Z

Wait, your script is expecting that writing out unicode chars in the range
in the range 0-255 will result in those exact bytes in the range 0-255 to
be written to the file.

This just isn't correct. In text mode, strings are written to and read
from files in UTF-8 format. The test you should really be doing is writing
a string in text mode, then reading it back in text mode, and ensuring no
characters are lost.

On 19 December 2015 at 14:33, Bomb Bloke [email protected] wrote:

The short answer is "everything after char 127 fails to read or write".
Most affected values are converted to 0x3F (when read or written), and if
multiple such values are present in a string contiguously, they may be
merged into 0x3F (one such set starts at char 224 followed by whatever,
though there may be others; my tests have been lazy). Some single chars are
converted into multi-byte representations during writing (which I'm
assuming are UTF-8 for who-knows-what).

Writing 0x00 works only sometimes; it may simply be omitted. Not entirely
sure of the circumstances required to break it yet. It presumably has to do
with the chars surrounding it.

Really what I'd expect is for every char to be written as if it were being
filtered through string.byte() to a binary-mode file handle (or in Java
terms, pass .getBytes() to a BufferedOutputStream made out of a
FileOutputStream object), and for every char to be read as if it were being
filtered though string.char() from a binary-mode file handle (get all the
bytes from a BufferedInputStream made out of a FileInputStream and convert
to chars). I assume there's some benefit to allowing Java's text-handling
classes to try and guess the encoding, but I'm not sure as to what that'd
be, as they uniformly seem to cause only problems...?

On the other hand, I notice fsFileHandle.writeLine() is making use of \r
now, so that's pretty cool. :)

If it helps, here's a script that reports on fs accuracy (results go to
results.txt):

http://pastebin.com/YSJ26Lw0

And here's the output under 1.76pr6:

http://pastebin.com/V7s3TegM

—
Reply to this email directly or view it on GitHub
#67 (comment)
.

dan200 · Answer 4 · 2015-12-19T14:49:37.000Z

Characters intact

BombBloke · Answer 5 · 2015-12-19T14:57:29.000Z

That test wouldn't be as thorough; if I write char 128 in text mode, what gets saved is char 63. So whether attempting to read that back in text mode is going to get me char 128 is a foregone conclusion.

By mixing modes, it's demonstrated that garbage produced by one (textmode out) isn't affecting the garbage produced by the other (textmode in).

I have no understanding as to why you'd want anything to do with UTF-8 when that's not the encoding system ComputerCraft's using. You're using one byte per character, and I'm still stuck in the mindset of "extended ASCII" editors, built for dealing with the likes of code page 437: which seems to me to be this sort of thing? Unless there IS a working converter somewhere at play, in which case I haven't been able to find it. Maybe it's platform dependant.

Can you provide an example as to how I "should" be using text mode file handles to read/write chars such 0xFF? If I simply save "h£llø WÖ®lÐ" into a file (using UTF-8, Unicode, Unicode big-endian...), loading it in the "edit" script mostly gives me a bunch of question marks.

dan200 · Answer 6 · 2015-12-19T15:01:16.000Z

Writing out character 128 results in two bytes being written to the file: the utf-8 encoding for 128, which is two bytes: 0xC2 followed by 0x80. Reading that 2-byte file back in results in character 128 again. I've demonstrated that working in my above screenshot.

dan200 · Answer 7 · 2015-12-19T15:02:22.000Z

For me, writing "h£llø WÖ®lÐ" into a file with edit, then saving and reopening it also in edit, results in no data loss. Does it for you?

dan200 · Answer 8 · 2015-12-19T15:05:26.000Z

Just wrote the same string in a new Sublime Text file, saved it with encoding UTF-8, and opened it in CC: all chars were present and correct.

dan200 · Answer 9 · 2015-12-19T15:06:28.000Z

also confirmed utf-8 is the default for sublime text when just pressing "save", which I presume is common for most text editors (don't know about Notepad)

BombBloke · Answer 10 · 2015-12-19T15:09:04.000Z

As my above script demonstrates, for me, writing out character 128 in text mode results in one byte within the output file, that being 0x3F (a question mark). I'm getting the impression that if you ran it you'd get an entirely different output file, and that this is all platform dependant.

I can't load the string "h£llø WÖ®lÐ" into edit (it gets mangled into question marks seemingly regardless of the encoding the file was generated with), but I can paste it. Saving it from edit results in "h�’ll? W??l?" going to the output file (0x 68 81 92 6C 6C 3F 20 57 3F 3F 6C 3F 0A).

dan200 · Answer 11 · 2015-12-19T15:15:10.000Z

I'll explain some reasoning here:

UTF-8 is used as the default encoding for text exchange, for the same reason it's the default encoding used by java's textreader/textwriter itself: because it's the most commonly used text encoding that supports all unicode characters, and it's backwards compatibile with ASCII, meaning CC will be able to produce files readable by most other text editors.

ISO/IEC_8859-1 is used as the internal text representation, because lua chars are only 1 byte wide, a variable width encoding would break a lot of user code, and this encoding fits in 1 byte/char, represents a lot of useful charactersis, and is easy to convert to unicode.

dan200 · Answer 12 · 2015-12-19T15:17:10.000Z

never mind, just noticed it writes to results.txt and not the console...

dan200 · Answer 13 · 2015-12-19T15:16:03.000Z

So I just ran your script and got no output... maybe there is a platform issue here.

dan200 · Answer 14 · 2015-12-19T15:20:45.000Z

I think previously LuaJ would try to use UTF-8 for the internal encoding as well, which none of the lua code was set up to expect, which is why we had stuff like the window.blit() crash as soon as you entered a non-ascii char

BombBloke · Answer 15 · 2015-12-19T15:19:28.000Z

Thanks for the explanation. I'm curious as to how long this intended conversion process between the two sets has been in ComputerCraft for? Is it new for 1.76?

dan200 · Answer 16 · 2015-12-19T15:22:00.000Z

Ok, so seems like there is actually a bug here: My results.txt is different from yours:
http://pastebin.com/Mx4fZ5rX

dan200 · Answer 17 · 2015-12-19T15:22:35.000Z

I'd expect the failures seen in my file given the incorrect assumptions your program makes, but i wouldn't expect those seen in yours. Which platform are you on?

Wojbie · Answer 18 · 2015-12-19T15:35:05.000Z

Just dropping in with my results.txt that are different from 2 of yours.
http://pastebin.com/F6JCeR1v

System Data:
Windows 7 Home Premium 64bit
Java 8 64bit (1.8.0_25 64bit according to f3)
Forge1.8-11.14.4.1577

BombBloke · Answer 19 · 2015-12-19T15:38:23.000Z

Indeed Dan, most (likely all) of the "errors" in your results have to do with me having no idea ComputerCraft was supposed to be performing certain textmode conversions, so they mean exactly what you thought my results meant: everything's probably ok, for you.

Thanks Wojbie, more results from other users should be useful. Your results are similar to mine, but yeah, there are some obvious differences in there.

Windows 8.1 x64
Forge 1.8, 11.14.4.1577
Java x64, 1.8.0_66 (just now updated from _40, same results).

I'm afraid it's 2:30am here at present, so I may have to leave off for the night. Thanks for looking into it.

dan200 · Answer 20 · 2015-12-19T15:49:19.000Z

Ok, I've written a better test program that should uncover actual bugs only:
http://pastebin.com/9QJKi0kp

Could each of you run it and report your results? On my system (Mac), it
only shows 2 errors: one for each of the two newline characters.

On 19 December 2015 at 15:38, Bomb Bloke [email protected] wrote:

Indeed, most (likely all) of the "errors" in your results have to do with
me having no idea ComputerCraft was supposed to be performing certain
textmode conversions, so they mean exactly what you thought my results
meant: everything's probably ok, for you.

Thank Wojbie, more results from other users should be useful. Your results
are similar to mine, but yeah, there are some obvious differences in there.

Windows 8.1 x64
Forge 1.8, 11.14.4.1577
Java x64, 1.8.0_66 (just now updated from _40, same results).

I'm afraid it's 2:30am here at present, so I may have to leave off for the
night. Thanks for looking into it.

—
Reply to this email directly or view it on GitHub
#67 (comment)
.

dmarcuse · Answer 21 · 2015-12-19T15:57:29.000Z

I ran it on Windows 10, with 64-bit JDK 8u51. I got 34 mismatches. Here is the output.

dan200 · Answer 22 · 2015-12-19T16:01:32.000Z

Interesting, it seems like the C1 control characters (the ones we map to
the teletext glyphs in CC) are getting turned into question marks by
something.. I wonder if that's the output or the input

On 19 December 2015 at 15:57, apemanzilla [email protected] wrote:

I ran it on Windows 10, with 64-bit JDK 8u51. I got 34 mismatches. Here
http://pastebin.com/KVYuNAN8 is the output.

—
Reply to this email directly or view it on GitHub
#67 (comment)
.

BombBloke · Answer 23 · 2015-12-20T01:29:24.000Z

Thanks folks, all looks good from here too. 👍

BombBloke · Answer 24 · 2015-12-20T01:41:47.000Z

Spoke too soon, sorry; forgot to test http. A paste containing h£llø WÖ®lÐ seems to read in ok:

http://pastebin.com/mAdtxtbU

But if I attempt to "put" it back, I end up with:

http://pastebin.com/DgHUWmqb

Wojbie · Answer 25 · 2015-12-19T16:07:47.000Z

Results from dan200 program:
http://pastebin.com/DZCWzKtD

It seems i got 81 mismatches.

System Data:
Windows 7 Home Premium 64bit
Java 8 64bit (1.8.0_25 64bit according to f3)
Forge1.8-11.14.4.1577

dan200 · Answer 26 · 2015-12-19T16:09:36.000Z

Sorry to be annoying, I wrote another version with better output, can you
run this instead:
http://pastebin.com/wXw9m2WG

On 19 December 2015 at 16:07, Wojbie [email protected] wrote:

Results from dan200 program:
http://pastebin.com/DZCWzKtD

System Data:
Windows 7 Home Premium 64bit
Java 8 64bit (1.8.0_25 64bit according to f3)
Forge1.8-11.14.4.1577

—
Reply to this email directly or view it on GitHub
#67 (comment)
.

dan200 · Answer 27 · 2015-12-19T16:11:21.000Z

It seems like the bug is in my assumptions: I assumed that
outputstreamwriter/reader always used UTF-8 by default, as it does in C#,
and as it seems to be doing on my Mac, when in actuality it's platform
specific. I'm preparing a new version now which explicitly specifies UTF-8
for all text I/O through fs and http.

On 19 December 2015 at 16:09, Daniel Ratcliffe [email protected] wrote:

Sorry to be annoying, I wrote another version with better output, can you
run this instead:
http://pastebin.com/wXw9m2WG

On 19 December 2015 at 16:07, Wojbie [email protected] wrote:

Results from dan200 program:
http://pastebin.com/DZCWzKtD

System Data:
Windows 7 Home Premium 64bit
Java 8 64bit (1.8.0_25 64bit according to f3)
Forge1.8-11.14.4.1577

—
Reply to this email directly or view it on GitHub
#67 (comment)
.

Wojbie · Answer 28 · 2015-12-19T16:12:31.000Z

Used new program you wrote. Results here:
http://pastebin.com/z0PXP95d

I still got 81 mismatches.

System Data:
Windows 7 Home Premium 64bit
Java 8 64bit (1.8.0_25 64bit according to f3)
Forge1.8-11.14.4.1577

dan200 · Answer 29 · 2015-12-19T16:14:49.000Z

Thanks. Looks like whatever encoding your output is using converts
everything 128-255 to '?'! Prepping pr7 now, let's see if it helps

On 19 December 2015 at 16:12, Wojbie [email protected] wrote:

Used new program you wrote. Results here:
http://pastebin.com/z0PXP95d

I still got 81 mismatches.

System Data:
Windows 7 Home Premium 64bit
Java 8 64bit (1.8.0_25 64bit according to f3)
Forge1.8-11.14.4.1577

—
Reply to this email directly or view it on GitHub
#67 (comment)
.

dan200 · Answer 30 · 2015-12-19T16:26:08.000Z

Alright, could both of you run the test program again with this new beta:
http://minecraft.curseforge.com/projects/computercraft/files/2271506/download
(It might take a few minutes for the curseforge link to activate)

On 19 December 2015 at 16:14, Daniel Ratcliffe [email protected] wrote:

Thanks. Looks like whatever encoding your output is using converts
everything 128-255 to '?'! Prepping pr7 now, let's see if it helps

On 19 December 2015 at 16:12, Wojbie [email protected] wrote:

Used new program you wrote. Results here:
http://pastebin.com/z0PXP95d

I still got 81 mismatches.

System Data:
Windows 7 Home Premium 64bit
Java 8 64bit (1.8.0_25 64bit according to f3)
Forge1.8-11.14.4.1577

—
Reply to this email directly or view it on GitHub
#67 (comment)
.

dmarcuse · Answer 31 · 2015-12-19T16:38:25.000Z

Only 2 mismatches with the new version. Output.

dan200 · Answer 32 · 2015-12-19T16:38:57.000Z

woot woot, Wojbie?

On 19 December 2015 at 16:38, apemanzilla [email protected] wrote:

Only 2 mismatches with the new version. Output
http://pastebin.com/TuMmLUdh.

—
Reply to this email directly or view it on GitHub
#67 (comment)
.

Wojbie · Answer 33 · 2015-12-19T16:43:38.000Z

Alf for next 2-4 hours sorry. Will test as soon as back.

xAnavrins · Answer 34 · 2015-12-19T17:12:39.000Z

My results with PR6 (34 mismatch)
http://pastebin.com/Ft8gxWtf

Results with PR7 (2 mismatch)
http://pastebin.com/vvfUk0ps

System (Exact same as Wojbie)
Windows 7 Home Premium 64bit
Java 8 64bit 1.8.0_25 64bit
Forge1.8-11.14.4.1577

Wojbie · Answer 35 · 2015-12-19T17:34:54.000Z

My results with PR6 (81 mismatches)
http://pastebin.com/z0PXP95d

My results with PR7 (2 mismatches)
http://pastebin.com/HfGNWjAw

System Data:
Windows 7 Home Premium 64bit
Java 8 64bit (1.8.0_25 64bit according to f3)
Forge1.8-11.14.4.1577

dan200 · Answer 36 · 2015-12-19T17:39:15.000Z

Woot woot, seems like we fixed an actual bug here :)
The 2 mismatches that remain are \r and \n, ie: if a file ends with a line ending, it gets cut off by file.readAll(). I don't mind this behaviour, because it means you can do print(file.readAll()) or otherfile.writeLine(file.readAll()) without getting a redundant newline added. Marking as closed. Thanks for everyone's help.

dan200 · Answer 37 · 2015-12-19T17:42:47.000Z

On reconsideration: I think the reason everyone was getting different numbers of errors was down to their windows language settings: the default 8-bit codepage varies by country, so different chars from the ISO/IEC_8859-1 codepage couldn't be represented and had to be converted to ?. With UTF-8, every char can be represented.

BombBloke · Answer 38 · 2015-12-20T23:13:05.000Z

Sounds good, thanks!

dan200 · Answer 39 · 2015-12-20T23:16:27.000Z

pr8 is out now

On 20 December 2015 at 23:13, Bomb Bloke [email protected] wrote:

Sounds good, thanks!

—
Reply to this email directly or view it on GitHub
#67 (comment)
.

BombBloke · Answer 40 · 2015-12-20T10:49:33.000Z

Forgot to test before, but if I attempt to re-"get" the latter paste, the symbols that should be £ and ® end up as 0x3F (question marks).

BombBloke · Answer 41 · 2015-12-20T11:55:54.000Z

Here's another tester script, for http posting:

http://pastebin.com/g1VZiYhm

It first generates a list of all chars and dumps them to charlist.txt.

It then grabs a copy of that list which I uploaded using my browser to charlist2.txt, and compares the files. I find this works fine for me, except for char 0, which downloads as a space (32); this is probably pastebin.com converting the char at the time of upload (if the raw paste data visible on the site is any indication).

Next it uploads charlist.txt to pastebin itself (you'll be prompted to check the site for CAPTCHA, as I started running into them), then re-downloads it as charlist3.txt. The files are then compared again - I get about 70 mismatches here; the uploaded file has a lot of question marks in it.

The script output gets mirrored in results.txt:

http://pastebin.com/i7B0uGUz

dan200 · Answer 42 · 2015-12-20T13:21:03.000Z

Thanks for the extra testing. I've now fixed these HTTP issues: they were caused by textutils.urlEncode() not properly encoding non-ASCII characters. It does now. I've also ensured the proper headers are set and read to get UTF-8 results from URLS (though you can modify these with the custom headers table if you wish).

I've also modified string.format("%q") (used by textutils.serialize) to escape all non-printable characters when when producing strings.

These changes will be in pr8.

Share to