CC: Tweaked

CC: Tweaked

57M Downloads

How to provide Unicode character set support for CCT?

SilverSCode opened this issue ยท 7 comments

commented

When can I use CCT to ask my friends to say hello and they can understand what I'm saying?
I want to use CCT to display some character information, but obviously not everyone knows English

I know it's hard to solve the character width problem, but we shouldn't give up, right?

commented

Just as a bit of background, worth reading through the discussion on dan200/ComputerCraft#531 and dan200/ComputerCraft#532 first (maybe also #435, but IMO less useful). Some of the points raised aren't relevant (this was three years ago!), but many are.

The basic gist is that the issue isn't to do with character width (or unicode's many quirks) - though those will need to be solved in due course), but API design. I'm going to largely focus on drawing to the screen here, as file IO is much easier to work around:


CC 1.76 (??) switched ComputerCraft to use ISO-8859-1 as the character set when drawing to the screen (previously it was just ASCII with quirks). This is incompatible with UTF-8, which means we need a way to handle writing both character sets.

I can think of two options here:

  • Separate versions of all the methods! So effectively add term.writeUtf8 and term.blitUtf8, which are utf8 variants of the existing methods. Everything gets normalised to UTF-8 internally, but the surface API stays the same for existing programs.

    Obvious problem here, which dan200/ComputerCraft#532 hit, is that you then need UTF8 versions of a bunch more stuff - at least print, write and printError. And, well, that's obviously gross and painful to use!

  • The other option is for terminals to have modes, so you'd do term.setUtf8(true) (would also accept term.setMode) and then all further calls to term.write and friends should use UTF-8 encoded strings.

    This is definitely better in some ways, though it now means APIs need to be aware of what kind of terminal they're printing to, which is Not Ideal.


Basically the TLDR is "In principle yes, but I've no clue how it'd work". Ideas very much welcome!

commented

Another issue is how we handle Unicode strings. Lua uses strict 8-bit strings, which is alright for ASCII/ANSI text, but things get a bit murky when dealing with larger Unicode codepoints. Many other languages use wider strings, like JavaScript which uses UTF-16/UCS-2. This keeps operations on strings the same, but would require a language-breaking change.

The preferred way to do this in Lua 5.3+ is to use UTF-8 byte representation in a string + the utf8 library, which was added in cc-tweaked/Cobalt#29, but this adds its own set of problems. UTF-8 strings will need to be calculated with the UTF-8 library specifically, so #str will no longer cut it (which pretty much all code today uses), instead using utf8.len. In addition, string.char and string.byte will need to be replaced with utf8.char and utf8.codepoint. Basically, all string operations will become non-trivial once outside ASCII range.

Perhaps one way to solve this issue is to add a special UTFString type of sorts, that's a special table (or userdata?) that has string-like properties and can be used with the standard term functions. This type would hold the UTF-8 string inside the table, and would have the same methods as the string library but adjusted for UTF-8 operations. Operations such as __len and __concat can be overridden to function like a string as well. The term API would be able to accept these strings in the normal functions, and would use the Unicode content instead of trying to convert to a string. Finally, to make sure functions that explicitly require a string argument work fine, the __tostring metamethod would return the string in the normal ANSI representation, replacing any codepoints outside the range with ? (as is done in the fs library). (Speaking of the fs library, we'd likely want to add a flag to return either a normal string or a UTFString.)


Update: I've written a basic implementation of this UTFString type: https://gist.github.com/MCJack123/0b0c3e2656da8adb043096a3da306d69 One issue with this library is that it is unable to compare UTFStrings with normal strings due to Lua checking types first. This could be fixed by modifying the Lua runtime, but this may not be the best choice, unless UTFString was integrated into the runtime and the behavior was only overridden when the value is a UTFString. Also, pattern matching will have to be rewritten to accept multibyte character classes, but I'm not about to write an entire pattern matcher for a demo.

commented

Just as a bit of background, worth reading through the discussion on dan200/ComputerCraft#531 and dan200/ComputerCraft#532 first (maybe also #435, but IMO less useful). Some of the points raised aren't relevant (this was three years ago!), but many are.

The basic gist is that the issue isn't to do with character width (or unicode's many quirks) - though those will need to be solved in due course), but API design. I'm going to largely focus on drawing to the screen here, as file IO is much easier to work around:

CC 1.76 (??) switched ComputerCraft to use ISO-8859-1 as the character set when drawing to the screen (previously it was just ASCII with quirks). This is incompatible with UTF-8, which means we need a way to handle writing both character sets.

I can think of two options here:

  • Separate versions of all the methods! So effectively add term.writeUtf8 and term.blitUtf8, which are utf8 variants of the existing methods. Everything gets normalised to UTF-8 internally, but the surface API stays the same for existing programs.
    Obvious problem here, which Second try for utf support. dan200/ComputerCraft#532 hit, is that you then need UTF8 versions of a bunch more stuff - at least print, write and printError. And, well, that's obviously gross and painful to use!
  • The other option is for terminals to have modes, so you'd do term.setUtf8(true) (would also accept term.setMode) and then all further calls to term.write and friends should use UTF-8 encoded strings.
    This is definitely better in some ways, though it now means APIs need to be aware of what kind of terminal they're printing to, which is Not Ideal.

Basically the TLDR is "In principle yes, but I've no clue how it'd work". Ideas very much welcome!

Just as a bit of background, worth reading through the discussion on dan200/ComputerCraft#531 and dan200/ComputerCraft#532 first (maybe also #435, but IMO less useful). Some of the points raised aren't relevant (this was three years ago!), but many are.

The basic gist is that the issue isn't to do with character width (or unicode's many quirks) - though those will need to be solved in due course), but API design. I'm going to largely focus on drawing to the screen here, as file IO is much easier to work around:

CC 1.76 (??) switched ComputerCraft to use ISO-8859-1 as the character set when drawing to the screen (previously it was just ASCII with quirks). This is incompatible with UTF-8, which means we need a way to handle writing both character sets.

I can think of two options here:

  • Separate versions of all the methods! So effectively add term.writeUtf8 and term.blitUtf8, which are utf8 variants of the existing methods. Everything gets normalised to UTF-8 internally, but the surface API stays the same for existing programs.
    Obvious problem here, which Second try for utf support. dan200/ComputerCraft#532 hit, is that you then need UTF8 versions of a bunch more stuff - at least print, write and printError. And, well, that's obviously gross and painful to use!
  • The other option is for terminals to have modes, so you'd do term.setUtf8(true) (would also accept term.setMode) and then all further calls to term.write and friends should use UTF-8 encoded strings.
    This is definitely better in some ways, though it now means APIs need to be aware of what kind of terminal they're printing to, which is Not Ideal.

Basically the TLDR is "In principle yes, but I've no clue how it'd work". Ideas very much welcome!

dan200/ComputerCraft#532
I also found your discussion in

commented

dan200/ComputerCraft#532
I know it takes a lot of effort and choice to realize Unicode, but minecraft is a hot game all over the world. Many people who play CCT want to print prompt information in their familiar mother tongue on the screen

I'm a game player who doesn't know English. I can only use translation software to express what I want to say, so I may cause reading trouble to many people. I'm sorry

commented

I've started experimenting with Unicode support on a branch of CraftOS-PC. It is extremely incomplete, but it serves as a PoC of Unicode in CC.

This branch is a test for the possibility of Unicode support in future
versions of ComputerCraft.

Unicode strings are stored in a new UTFString type. This type is a
wrapper around a C++ UTF-32 string, with the same methods
as the base string library. UTFStrings can be created by calling
UTFString with a string, plus an optional second argument to
specify whether the string should be interpreted as raw UTF-8 data.

A UTFString is mostly compatible with normal strings, and a VM
extension allows them to be compared with strings. They also
have a version of the string library available, but note that they are
not compatible with string itself. You will need to replace direct
string calls with method calls, e.g. string.sub(str, i) -> str:sub(i).

To support showing Unicode characters on-screen, the terminal has
been adjusted to use UTF-32 characters, and characters outside the
normal CC-ANSI range will be rendered with a (bundled) monospace Unicode
font. (This may be changed in the future to use system or other fonts.)
Unicode input is accepted as well by sending a UTFString as the second
parameter to the char event.

The term and fs APIs have been updated to support UTFStrings.
Passing a UTFString to term is the only way to write Unicode text to
the screen - normal strings keep the same behavior.
If you pass a UTFString to fs functions, the path will be converted to
UTF-8 when accessing the filesystem. There is also a new mode modifier,
'u', that will cause read methods to return a UTFString instead of a
normal string. This allows reading Unicode text safely. 'u' is mutually
exclusive with 'b', and is redundant with write modes because normal
text-write file handles accept both types of strings.

On the Lua side, some functions have been updated so far. print,
cc.expect, and cc.pretty support them pretty well, and edit has
been slightly adjusted but it doesn't work very well right now.
Otherwise, it remains pretty untested.

Unicode text demo

commented

yet another man been upset about computer cannot print out CJK characters and found this issue.

IMO, having another set of functions handling unicode text input/output would be more preferred, if backward compatibility is top 1 thing to be concerned (I would rather changing all text io function to accept unicode string only if not concerning backward compat). Those existing scripts could still work and those new scripts that requires unicode handling should simply require a module that provides new functions. utf8.* or something else should be good to go, as mentioned in dan200/ComputerCraft#532.

Also, I found that Java String with codepoint larger than 255 will be wiped out into ? when converted into LuaString (written in Cobalt code). So a java function is also needed for addon devs to convert unicode-containing string into byte representation before it gets into Lua environment.

EDIT: seems terminal still need to know whether it is supposed to read/write a string in ISO-8599-1 or UTF8, so a 'utf8 mode' switch should be a thing for this purpose. Also unicode char rendering concerns me badly as currently CC uses a size of 6x9 for glyph, which is wider than half-width for 9x9 full-width glyph, or shorter than half-width for 12x12 full-width glyph. If backward compatibility is retained on glyph size too, then full-width characters have to be in 12x9 size, which would make those slightly wider than normal