[1.76pr6] String Bugs
BombBloke opened this issue · 43 comments
Not sure whether these count as "bugs", or even if it's practical/appropriate to "fix" them, but hey - let's recap.
At the moment:
- All strings seem to send/receive correctly via modems (whereas they didn't used to; not sure when that got fixed, but hurray for that 👍 ).
- Strings also seem fine when passed through commands API functions; all characters can be passed in and (where relevant) read back in some form. 👍
- Strings saved via textmode file handles may have characters within certain ranges converted to ? (0x3F). 👎
- Strings affected by the above also can't be sent/received via the http API. 👎
Assuming this won't be fixed/changed/whatever such that "all strings can be sent anywhere", I suggest a couple of workarounds:
textutils.serialise could be rigged to convert eg "ÿ" to ""\255"". textutils.unserialise is already capable of reverting this. (Edit: This is probably obvious to you, but leading zeros are important for numbers less than 100!)
textutils.urlEncode could be rigged to convert eg "ÿ" to "%FF".
Which characters get lost when saving files? All 256 char values (with the possible exception of zero?) should be safe to save out.
Reading in is another matter. If the file contains UTF-8 encoded chars greater than 255, they're two big to fit in a lua character so have to be converted if we don't want to switch to a multibyte encoding (which we don't).
The short answer is "everything after char 127 fails to read or write".
Most affected values are converted to 0x3F (when read or written), and if multiple such values are present in a string contiguously, they may be merged into 0x3F (one such set starts at char 224 followed by whatever, though there may be others; my tests have been lazy). Some single chars are converted into multi-byte representations during writing (which I'm assuming are UTF-8 for who-knows-what).
Writing 0x00 works only sometimes; it may simply be omitted. Not entirely sure of the circumstances required to break it yet. It presumably has to do with the chars surrounding it.
Really what I'd expect is for every char to be written as if it were being filtered through string.byte() to a binary-mode file handle (or in Java terms, pass someString.getBytes() to a BufferedOutputStream made out of a FileOutputStream object), and for every char to be read as if it were being filtered through string.char() from a binary-mode file handle (get all the bytes from a BufferedInputStream made out of a FileInputStream and convert to chars one by one). I assume there's some benefit to allowing Java's text-handling classes to try and guess the encoding, but I'm not sure as to what that'd be, as they uniformly seem to cause only problems...? I'm not that familiar with how the Lua "strings" LuaJ is handling are actually being... handled.
On the other hand, I notice fsFileHandle.writeLine() is making use of \r now, so that's pretty cool. :)
If it helps, here's a script that reports on fs accuracy (results go to results.txt):
And here's the output under 1.76pr6:
Wait, your script is expecting that writing out unicode chars in the range
in the range 0-255 will result in those exact bytes in the range 0-255 to
be written to the file.
This just isn't correct. In text mode, strings are written to and read
from files in UTF-8 format. The test you should really be doing is writing
a string in text mode, then reading it back in text mode, and ensuring no
characters are lost.
On 19 December 2015 at 14:33, Bomb Bloke [email protected] wrote:
The short answer is "everything after char 127 fails to read or write".
Most affected values are converted to 0x3F (when read or written), and if
multiple such values are present in a string contiguously, they may be
merged into 0x3F (one such set starts at char 224 followed by whatever,
though there may be others; my tests have been lazy). Some single chars are
converted into multi-byte representations during writing (which I'm
assuming are UTF-8 for who-knows-what).Writing 0x00 works only sometimes; it may simply be omitted. Not entirely
sure of the circumstances required to break it yet. It presumably has to do
with the chars surrounding it.Really what I'd expect is for every char to be written as if it were being
filtered through string.byte() to a binary-mode file handle (or in Java
terms, pass .getBytes() to a BufferedOutputStream made out of a
FileOutputStream object), and for every char to be read as if it were being
filtered though string.char() from a binary-mode file handle (get all the
bytes from a BufferedInputStream made out of a FileInputStream and convert
to chars). I assume there's some benefit to allowing Java's text-handling
classes to try and guess the encoding, but I'm not sure as to what that'd
be, as they uniformly seem to cause only problems...?On the other hand, I notice fsFileHandle.writeLine() is making use of \r
now, so that's pretty cool. :)If it helps, here's a script that reports on fs accuracy (results go to
results.txt):And here's the output under 1.76pr6:
—
Reply to this email directly or view it on GitHub
#67 (comment)
.
That test wouldn't be as thorough; if I write char 128 in text mode, what gets saved is char 63. So whether attempting to read that back in text mode is going to get me char 128 is a foregone conclusion.
By mixing modes, it's demonstrated that garbage produced by one (textmode out) isn't affecting the garbage produced by the other (textmode in).
I have no understanding as to why you'd want anything to do with UTF-8 when that's not the encoding system ComputerCraft's using. You're using one byte per character, and I'm still stuck in the mindset of "extended ASCII" editors, built for dealing with the likes of code page 437: which seems to me to be this sort of thing? Unless there IS a working converter somewhere at play, in which case I haven't been able to find it. Maybe it's platform dependant.
Can you provide an example as to how I "should" be using text mode file handles to read/write chars such 0xFF? If I simply save "h£llø WÖ®lÐ" into a file (using UTF-8, Unicode, Unicode big-endian...), loading it in the "edit" script mostly gives me a bunch of question marks.
Writing out character 128 results in two bytes being written to the file: the utf-8 encoding for 128, which is two bytes: 0xC2 followed by 0x80. Reading that 2-byte file back in results in character 128 again. I've demonstrated that working in my above screenshot.
For me, writing "h£llø WÖ®lÐ" into a file with edit, then saving and reopening it also in edit, results in no data loss. Does it for you?
Just wrote the same string in a new Sublime Text file, saved it with encoding UTF-8, and opened it in CC: all chars were present and correct.
also confirmed utf-8 is the default for sublime text when just pressing "save", which I presume is common for most text editors (don't know about Notepad)
As my above script demonstrates, for me, writing out character 128 in text mode results in one byte within the output file, that being 0x3F (a question mark). I'm getting the impression that if you ran it you'd get an entirely different output file, and that this is all platform dependant.
I can't load the string "h£llø WÖ®lÐ" into edit (it gets mangled into question marks seemingly regardless of the encoding the file was generated with), but I can paste it. Saving it from edit results in "h�’ll? W??l?" going to the output file (0x 68 81 92 6C 6C 3F 20 57 3F 3F 6C 3F 0A).
I'll explain some reasoning here:
UTF-8 is used as the default encoding for text exchange, for the same reason it's the default encoding used by java's textreader/textwriter itself: because it's the most commonly used text encoding that supports all unicode characters, and it's backwards compatibile with ASCII, meaning CC will be able to produce files readable by most other text editors.
ISO/IEC_8859-1 is used as the internal text representation, because lua chars are only 1 byte wide, a variable width encoding would break a lot of user code, and this encoding fits in 1 byte/char, represents a lot of useful charactersis, and is easy to convert to unicode.
So I just ran your script and got no output... maybe there is a platform issue here.
I think previously LuaJ would try to use UTF-8 for the internal encoding as well, which none of the lua code was set up to expect, which is why we had stuff like the window.blit() crash as soon as you entered a non-ascii char
Thanks for the explanation. I'm curious as to how long this intended conversion process between the two sets has been in ComputerCraft for? Is it new for 1.76?
Ok, so seems like there is actually a bug here: My results.txt is different from yours:
http://pastebin.com/Mx4fZ5rX
I'd expect the failures seen in my file given the incorrect assumptions your program makes, but i wouldn't expect those seen in yours. Which platform are you on?
Just dropping in with my results.txt that are different from 2 of yours.
http://pastebin.com/F6JCeR1v
System Data:
Windows 7 Home Premium 64bit
Java 8 64bit (1.8.0_25 64bit according to f3)
Forge1.8-11.14.4.1577
Indeed Dan, most (likely all) of the "errors" in your results have to do with me having no idea ComputerCraft was supposed to be performing certain textmode conversions, so they mean exactly what you thought my results meant: everything's probably ok, for you.
Thanks Wojbie, more results from other users should be useful. Your results are similar to mine, but yeah, there are some obvious differences in there.
Windows 8.1 x64
Forge 1.8, 11.14.4.1577
Java x64, 1.8.0_66 (just now updated from _40, same results).
I'm afraid it's 2:30am here at present, so I may have to leave off for the night. Thanks for looking into it.
Ok, I've written a better test program that should uncover actual bugs only:
http://pastebin.com/9QJKi0kp
Could each of you run it and report your results? On my system (Mac), it
only shows 2 errors: one for each of the two newline characters.
On 19 December 2015 at 15:38, Bomb Bloke [email protected] wrote:
Indeed, most (likely all) of the "errors" in your results have to do with
me having no idea ComputerCraft was supposed to be performing certain
textmode conversions, so they mean exactly what you thought my results
meant: everything's probably ok, for you.Thank Wojbie, more results from other users should be useful. Your results
are similar to mine, but yeah, there are some obvious differences in there.Windows 8.1 x64
Forge 1.8, 11.14.4.1577
Java x64, 1.8.0_66 (just now updated from _40, same results).I'm afraid it's 2:30am here at present, so I may have to leave off for the
night. Thanks for looking into it.—
Reply to this email directly or view it on GitHub
#67 (comment)
.
I ran it on Windows 10, with 64-bit JDK 8u51. I got 34 mismatches. Here is the output.
Interesting, it seems like the C1 control characters (the ones we map to
the teletext glyphs in CC) are getting turned into question marks by
something.. I wonder if that's the output or the input
On 19 December 2015 at 15:57, apemanzilla [email protected] wrote:
I ran it on Windows 10, with 64-bit JDK 8u51. I got 34 mismatches. Here
http://pastebin.com/KVYuNAN8 is the output.—
Reply to this email directly or view it on GitHub
#67 (comment)
.
Spoke too soon, sorry; forgot to test http. A paste containing h£llø WÖ®lÐ seems to read in ok:
But if I attempt to "put" it back, I end up with:
Results from dan200 program:
http://pastebin.com/DZCWzKtD
It seems i got 81 mismatches.
System Data:
Windows 7 Home Premium 64bit
Java 8 64bit (1.8.0_25 64bit according to f3)
Forge1.8-11.14.4.1577
Sorry to be annoying, I wrote another version with better output, can you
run this instead:
http://pastebin.com/wXw9m2WG
On 19 December 2015 at 16:07, Wojbie [email protected] wrote:
Results from dan200 program:
http://pastebin.com/DZCWzKtDSystem Data:
Windows 7 Home Premium 64bit
Java 8 64bit (1.8.0_25 64bit according to f3)
Forge1.8-11.14.4.1577—
Reply to this email directly or view it on GitHub
#67 (comment)
.
It seems like the bug is in my assumptions: I assumed that
outputstreamwriter/reader always used UTF-8 by default, as it does in C#,
and as it seems to be doing on my Mac, when in actuality it's platform
specific. I'm preparing a new version now which explicitly specifies UTF-8
for all text I/O through fs and http.
On 19 December 2015 at 16:09, Daniel Ratcliffe [email protected] wrote:
Sorry to be annoying, I wrote another version with better output, can you
run this instead:
http://pastebin.com/wXw9m2WGOn 19 December 2015 at 16:07, Wojbie [email protected] wrote:
Results from dan200 program:
http://pastebin.com/DZCWzKtDSystem Data:
Windows 7 Home Premium 64bit
Java 8 64bit (1.8.0_25 64bit according to f3)
Forge1.8-11.14.4.1577—
Reply to this email directly or view it on GitHub
#67 (comment)
.
Used new program you wrote. Results here:
http://pastebin.com/z0PXP95d
I still got 81 mismatches.
System Data:
Windows 7 Home Premium 64bit
Java 8 64bit (1.8.0_25 64bit according to f3)
Forge1.8-11.14.4.1577
Thanks. Looks like whatever encoding your output is using converts
everything 128-255 to '?'! Prepping pr7 now, let's see if it helps
On 19 December 2015 at 16:12, Wojbie [email protected] wrote:
Used new program you wrote. Results here:
http://pastebin.com/z0PXP95dI still got 81 mismatches.
System Data:
Windows 7 Home Premium 64bit
Java 8 64bit (1.8.0_25 64bit according to f3)
Forge1.8-11.14.4.1577—
Reply to this email directly or view it on GitHub
#67 (comment)
.
Alright, could both of you run the test program again with this new beta:
http://minecraft.curseforge.com/projects/computercraft/files/2271506/download
(It might take a few minutes for the curseforge link to activate)
On 19 December 2015 at 16:14, Daniel Ratcliffe [email protected] wrote:
Thanks. Looks like whatever encoding your output is using converts
everything 128-255 to '?'! Prepping pr7 now, let's see if it helpsOn 19 December 2015 at 16:12, Wojbie [email protected] wrote:
Used new program you wrote. Results here:
http://pastebin.com/z0PXP95dI still got 81 mismatches.
System Data:
Windows 7 Home Premium 64bit
Java 8 64bit (1.8.0_25 64bit according to f3)
Forge1.8-11.14.4.1577—
Reply to this email directly or view it on GitHub
#67 (comment)
.
Only 2 mismatches with the new version. Output.
woot woot, Wojbie?
On 19 December 2015 at 16:38, apemanzilla [email protected] wrote:
Only 2 mismatches with the new version. Output
http://pastebin.com/TuMmLUdh.—
Reply to this email directly or view it on GitHub
#67 (comment)
.
My results with PR6 (34 mismatch)
http://pastebin.com/Ft8gxWtf
Results with PR7 (2 mismatch)
http://pastebin.com/vvfUk0ps
System (Exact same as Wojbie)
Windows 7 Home Premium 64bit
Java 8 64bit 1.8.0_25 64bit
Forge1.8-11.14.4.1577
My results with PR6 (81 mismatches)
http://pastebin.com/z0PXP95d
My results with PR7 (2 mismatches)
http://pastebin.com/HfGNWjAw
System Data:
Windows 7 Home Premium 64bit
Java 8 64bit (1.8.0_25 64bit according to f3)
Forge1.8-11.14.4.1577
Woot woot, seems like we fixed an actual bug here :)
The 2 mismatches that remain are \r and \n, ie: if a file ends with a line ending, it gets cut off by file.readAll(). I don't mind this behaviour, because it means you can do print(file.readAll()) or otherfile.writeLine(file.readAll()) without getting a redundant newline added. Marking as closed. Thanks for everyone's help.
On reconsideration: I think the reason everyone was getting different numbers of errors was down to their windows language settings: the default 8-bit codepage varies by country, so different chars from the ISO/IEC_8859-1 codepage couldn't be represented and had to be converted to ?. With UTF-8, every char can be represented.
pr8 is out now
On 20 December 2015 at 23:13, Bomb Bloke [email protected] wrote:
Sounds good, thanks!
—
Reply to this email directly or view it on GitHub
#67 (comment)
.
Forgot to test before, but if I attempt to re-"get" the latter paste, the symbols that should be £ and ® end up as 0x3F (question marks).
Here's another tester script, for http posting:
It first generates a list of all chars and dumps them to charlist.txt.
It then grabs a copy of that list which I uploaded using my browser to charlist2.txt, and compares the files. I find this works fine for me, except for char 0, which downloads as a space (32); this is probably pastebin.com converting the char at the time of upload (if the raw paste data visible on the site is any indication).
Next it uploads charlist.txt to pastebin itself (you'll be prompted to check the site for CAPTCHA, as I started running into them), then re-downloads it as charlist3.txt. The files are then compared again - I get about 70 mismatches here; the uploaded file has a lot of question marks in it.
The script output gets mirrored in results.txt:
Thanks for the extra testing. I've now fixed these HTTP issues: they were caused by textutils.urlEncode() not properly encoding non-ASCII characters. It does now. I've also ensured the proper headers are set and read to get UTF-8 results from URLS (though you can modify these with the custom headers table if you wish).
I've also modified string.format("%q") (used by textutils.serialize) to escape all non-printable characters when when producing strings.
These changes will be in pr8.