CJKV Information Processing, 2nd Edition — Hacker News Books

jll29 · 2022-08-20 · Original thread

Regardless of official terminology, there are two levels:

1. Map a character to a unique number in a character set (in Unicode: called codepoint)

2. Map a number that represents a character in a character set to a bit pattern for storage (transiently or persistently, internally or externally). Unicode code points can be bit-encoded in various ways: UTF8, UCS2 and UCS4/UTF32.

The original code points permit the same character to be represented in various ways, which makes equality checks non-trivial: for instance a character like "ä" can be represented as a single character or alternatively as a composition of "a" + umlaut accent (2 characters).

So far, this is all about plain text, so we are not talking about font families or character properties (bold, italics, underlined) or orientation (super-script, sup-script).

Ken Lunde's opus magnum is the standard book on representing text in various languages other than English, with a focus on Asian languages: https://www.oreilly.com/library/view/cjkv-information-proces...

Get the best books from Hacker News each week