Encoding problem in non-ascii characters
LadyCailinBot opened this issue · 9 comments
CMDHELPER-3171 - Reported by bexco2010
If we(non-ascii users) write non-ascii character(which is hangul(korean) in my case), CommandHelper won't show correct letters on UTF-8 Encoded File. This means If code is transfered to another computer which uses diffrent encoding(EUC-KR to EUC-JP for example), code with non-ascii characters will shows broken letters when executed.
So, CommandHelper needs to set default encoding to UTF-8 for non-ascii users.
I attached photo which shows I tried to message "안녕" to player in UTF-8 Encoded MethodScript file.
plus, If try to send non-ascii message to console(tmsg() or console() function), It shows broken message.
please check console output with non-ascii characters, too.
Thanks.
Comment by LadyCailin
I don't think this is a problem with MethodScript in general. I have supported UTF-8 in source code for many years now. I also double checked, and the console method uses the appropriate print mechanism to support printing it correctly on console.
However, I did find this online: https://www.spigotmc.org/threads/utf-8-support.100607/ which seems to suggest to me that perhaps the server itself does not support UTF-8 by default? Anyways, I think I need more information in order to look into this further, or perhaps @PseudoKnight could try this on his server, and see if he can replicate it. It appears anyways that this is a Minecraft related problem, not a MethodScript problem, per se. Here are some other higher order characters that can be tested with: æøå
Comment by bexco2010
@LadyCailin Console Output is not problem of Spigot Server. I tried non-ascii characters in interpreter mode. but it's not working either.
Comment by bexco2010
@LadyCailin File encoding problem fixed with -Dfile.encoding=UTF-8 parameter. but I think CommandHelper should have encoding option in file options. So we can define encoding type in file.
Comment by LadyCailin
Hmm, ok. Well, it works for me on Mac. What OS are you on? Looks like Windows? In general, I intend on supporting UTF-8 by default, I can look into supporting other character encodings as well, but anyways for your use case, MethodScript should just work out of the box. If you are in fact on Windows, I can take a look at that tomorrow when I have access to a Windows machine.
Comment by bexco2010
@LadyCailin Windows 10.
Comment by PseudoKnight
I can confirm this issue locally on my Windows 7 computer. Though Windows console and Power Shell can't display "안녕" at all on my machine, "æøå" is displayed as entirely different characters after being passed through console() in jar interpreter mode. However, the text logs are correct. This means it's only incorrect in the Windows console itself. But if I use a different method of logging to console, like just typing it in chat, then it displays correctly in console. So Windows console and Power Shell just aren't working with whatever method CH is using.
Comment by PseudoKnight
As an aside: I wonder why I can't display Hangul in my console. Even installing the Korean language pack and trying all the available fonts didn't work.
Comment by LadyCailin
So, I can confirm that this is only a problem in Windows, at least in the interpreter. It even works correctly when I use the Linux Subsystem for Windows, but not when using powershell or cmd.exe. If I run chcp 949
I can at least get type utf-test.txt
that has 안녕 in the file to display correctly, but when I try to copy paste, I get ?? in the console. Interestingly, I have a Norwegian keyboard, and typing ø doesn't work correctly, it just types an o character, and when typing it in interpreter, I get ?. In general, MethodScript globally supports UTF-8 by wrapping System.out in a UTF-8 compatible PrintStream, and this is what is needed to make it work in at least some cases. So, for test purposes, I added a println using both System.out, and my StreamUtils print stream, and I get the following:
(The second command is me copy pasting "안녕" in)
Interestingly, the default charset is different for WFL and cmd.exe:
(The window on the left is WFL, and the one on the right is default cmd.exe.)
This suggests that there are two problems, one, the font is unable to show correctly in cmd.exe. Even typing "æøå" results in "æ?å" without even bringing CH into it. Secondly, using any code page, either the default (in this case windows-1252), explicitely setting it to UTF-8, and even setting it to Cp850 (just to test) doesn't make it display even "æ?å", as one would expect, based on what it looks like typing it in. Cp850 is closest, but it displays a cent symbol, rather than the expected ?. I cycled through all the available fonts on my system, and I wasn't able to properly display ø. However, if I run chcp 850
then it displays ø correctly, (both in the input and the output) but it does not display the hangul correctly.
All this to say, I'm still at a loss as to how to make this work in Windows, but I can at least confirm that it is a problem unique to windows. I will have to continue my investigations later.