February 09, 2021

Falsehoods programmers believe about plain text

Falsehoods Programmers Believe About Plain Text

Show/Hide counter-examples and discussion.

Yes, the title is hyperbolic. It's also standard, following a pattern set by many examples. A less formulaic and clickbaity explanation: this is a list of assumptions that programmers have sometimes relied on in their code. The assumptions allowed the code to be simpler and more efficient, right up until the code broke upon encountering a counter-example. Bugs in real systems have been caused by all of these assumptions.

All of these assumptions are wrong.

    Non-technical

  1. The Latin alphabet has 26 letters.

    Except they all have an uppercase form and a lowercase form, so there are really 52 different letters.

  2. Ignoring case, the Latin alphabet has 26 letters.

    What about á, ç, è, ǐ, ñ, ô, ü, etc.?

  3. Ignoring case and accents, the Latin alphabet has 26 letters.

    What about ø, ŋ, ł, ı, ħ, đ, ð, etc.?

  4. Yes, but ignoring variants, the Latin alphabet has 26 base letters.

    What about ligatures such as æ and œ, and digraphs such as ch, ng, and th, which are treated as base letters in some languages?

  5. Seriously, ignoring variants and combinations, the Latin alphabet has 26 base letters.

    English and several other languages used to have the letter thorn (Þ, þ). Although it has been replaced by the digraph th in most languages, modern Icelandic still uses it. Technically, this letter isn't from the Latin alphabet because it was borrowed from a runic alphabet. But the Icelandic alphabet is still a Latin alphabet, and contains this letter.

  6. Modern English doesn't use any of those except case.

    English has borrowed lots of words from languages which do use those, especially French, and hasn't always dropped the foreign bits. Borrowed words such as "exposé" and "résumé" would be confused with native words if the accents were removed. And let's not forget that proper names are commonly spelled with their native accents and ligatures intact.

  7. Ignoring foreign names and borrowings, Modern English doesn't use any of those except case.

    English used to use a diaeresis to mark a vowel pronounced separately from its neighbor, as in "Noël", "coöperate", and "naïve". Also, very rarely, a grave accent was used in English to mark a vowel as being non-silent, as in the adjective "learnèd". In some scattered places these traditions are preserved, despite the lack of a diaeresis key or a grave accent key on standard US keyboards.

  8. There's no such thing as "the English/French/German/Spanish/etc. alphabet": they use the Latin alphabet.

    Except they're all different from each other.

  9. There's no such thing as "the English alphabet": English uses the Latin alphabet.

    Technically, the modern English alphabet is identical to the basic Latin alphabet.

  10. Accented letters are never used as distinct letters in an alphabet.

    Counterexamples: Ñ in the Spanish alphabet, five letters in the Romanian alphabet.

  11. Ligatures are never used as distinct letters in an alphabet.

    Counterexample: Æ in the Danish/Norwegian alphabet.

  12. Digraphs are never used as distinct letters in an alphabet.

    Counterexample: seven letters in the Welsh alphabet.

  13. Trigraphs are never used as distinct letters in an alphabet.

    Counterexample: dzs in the Hungarian alphabet.

  14. Letter variants used in an alphabet always immediately follow their base letter in the alphabetic order.

    Counterexamples: the Swedish alphabet, where all the letter variants are at the end of the alphabet.

  15. Every alphabet derived from the Latin alphabet puts the base letters in the same order.

    Counterexamples: in the Estonian alphabet Z is between S and T, and in the Hawaiian alphabet all the vowels come first, then the consonants.

  16. The alphabet of each language is fixed and unchanging.

    Counterexample: capital ẞ was declared a new official letter of the German alphabet in 2017. There's still some ambivalence about whether accented characters and ligatures are truly part of the German alphabet or not.

    There's often a historical progression from digraph to ligature to base letter, or from digraph to accented letter to base letter. The only remaining sign that W used to be a ligature is its name.

    Several European languages, including English, used to have a "Long s" derived from Roman cursive writing: s in the middle of a word was joined with the following letter, while s at the end of a word wasn't, so they looked different enough to develop into two different letters. (The usage rules were more complicated than that, but that's the essential origin.) In English the mid-word variant was eventually replaced by the word-final variant. In German the mid-word variant merged with s to form the ligature ß. In a parallel development in Greek, the letter σ is written ς at the end of a word. In all three cases, the corresponding upper-case letter was written the same way no matter where it was in a word.

  17. Latin alphabets are used to write every language.

    Also known as "Everyone both knows the Latin alphabet and knows how to Romanize their own language."

    Some other widely used alphabets are Cyrillic, Hangul, Armenian, Greek, and Georgian.

  18. All writing systems are alphabets.

    List of writing systems by number of users. The top ten contain only three alphabets, and cover five major types of writing system.

  19. All writing systems have at most a few dozen characters.

    Syllabaries are a type of writing system where every syllable has its own symbol, instead of every phoneme. They typically have 50 to 500 unique symbols.

  20. All writing systems have at most a few hundred characters.
  21. All writing systems have a fixed inventory of characters.

    The Chinese, Ancient Egyptians, and Mayans, among others, use or used different characters for each word and affix. These logographic writing systems are open: new characters are continuously being invented to write new words. Full literacy is generally thought to require knowledge of 3-to-4 thousand characters, but, as in English, comprehensive dictionaries may contain tens of thousands of entries, most of which have their own character or characters. Characters also regularly fall out of use in logographic writing systems.

  22. One language won't be written using two or more writing systems.

    Counterexamples:

    • Serbian, Azeri, and several other languages are written in either Cyrillic or Latin.
    • Uzbek is written in Cyrillic, Latin, or Arabic script.
    • Hindustani is usually written using Devanagari in India and using Arabic script in Pakistan. Programmers aren't the only ones confused by this: some western sources refer to Hindi (Hindustani written with Devanagari) and Urdu (Hindustani written with Arabic) as separate languages, even though the spoken forms are mutually intelligible.
  23. Mutually unintelligible spoken languages can't use a mutually intelligible writing system.

    The same sources which call Hindi and Urdu different languages also tend to call "Chinese" one language, because Mandarin, Wu, Min, Yue (a.k.a. Cantonese), etc. are all written with the same writing system. It's like lumping all the Romance languages together as "Latin". To be fair: the written forms are mutually intelligible, mostly, even though the spoken languages are as different from each other as the Romance languages are. This is the big advantage of logographic writing systems. (If you're having trouble understanding this, mathematical notation is technically a logographic writing system: "1 + 1 = 2" means the same thing no matter which language you pronounce it in.)

  24. Authors won't need to quote text using other writing systems.

    Counter-examples: language teaching materials, multi-lingual manuals, annotated foreign literature, scholarly articles about writing systems, languages, history, or archeology, mathematics texts. (Mathematical notation is technically a different writing system.) Articles about the history of mathematics.

  25. One mono-lingual text won't contain multiple writing systems.

    Japanese mixes four: characters borrowed from Chinese (kanji) for many word roots, a syllabary (hiragana) for native Japanese words, affixes, and grammatical particles, a second syllabary (katakana) for onomatopoeia and foreign borrowings, and Latin characters for acronyms and foreign names.

  26. Characters go from left to right in horizontal lines and lines go from top to bottom on a page.

    Counter-examples: Arabic and Hebrew characters go from right to left in horizontal lines. Traditionally Chinese, Japanese, and Korean characters go from top to bottom in vertical lines from right to left on a page, though they're flexible enough to use other directions. Mongolian uses vertical lines going from left to right on a page. Bottom to top characters and lines are rare, but there are some obscure examples. See Wikipedia for more, including boustrophedon and mirror text.

  27. Quoted text is always written in the same direction as surrounding text.

    Nope. It's displayed in the quote's native direction, even if that's different from the surrounding text. This can get gnarly in multi-lingual, multi-level nested quotes, not to mention when horizontal text is quoted in vertical text or vice-versa.

    More evidence that mathematical notation is a different writing system: digits are read left-to-right even in right-to-left writing systems.

    Technical

  28. Characters are bytes (or ASCII + code page)
  29. Characters are two bytes (or UTF-16 code units).
  30. Characters are integers (or Unicode code points).
  31. Characters are the basic parts of a writing system (or graphemes).
  32. Characters in <programming language> are <one of the above>.

    In the beginning was ASCII, and every grapheme was encoded by one byte, so there was no difference between byte, code unit, code point, and grapheme: they were all called "character". As time passed, all four things became different from each other, yet "character" continued to refer to all of them, and there was much confusion.

    All of the programming languages below have built-in data types or libraries available for working with all four things. They are classified here based on the most common built-in data type named "character" or "char", or, if no such type exists, on what's counted by the most common string length function. (One of many sources.)

    Characters are bytes:
    C
    C++
    Go
    Lua
    PHP
    Ruby
    Characters are UTF-16 code units (two bytes each):
    C#
    Java
    JavaScript
    Objective-C
    Python 3.2 and earlier "narrow" builds
    Visual Basic
    Characters are Unicode code points:
    Perl 5
    Python 3.3+, 3.2 and earlier "wide" builds
    R
    Characters are graphemes:
    Perl 6
    Swift

    JavaScript example: '\u{1f41b}'.length
    Result: 2

    The example above constructs a string from a single Unicode code point (the BUG emoji), but the default string is a list of UTF-16 code units, so the default length function reports the number of two-byte code units in the string. Two-byte code unit strings are particularly insidious because almost all characters are represented by one code unit, so counter-examples are rarely encountered.

    Note that databases also vary in this way, so a string column with a maximum length might not be using the same definition of "length" as your programming language.

  33. Text files can be opened and processed without an encoding.

    Most programming languages appear to provide a way to do this, but they're really opening files using a default encoding set by the operating system, compiler, interpreter, or virtual machine. This often leads to programs which behave differently on different computers (e.g. a file is saved on one computer, emailed to someone else, and corrupted when opened using the same software on a computer with a different default encoding), so now it's strongly recommended that you explicitly set an encoding either for the whole application or for every text file you touch.

  34. The encoding of plain text can be guessed.

    Unfortunately, there are dozens of different encodings in common use, each of which map the same patterns of bits to different characters. 95% of problems with encodings seem to be from software trying to decode text using the wrong encoding, usually leading to mojibake.

  35. The encoding of plain text can be discovered by examining the text.

    Plain text does not contain a simple message stating its encoding. (Rich text, e.g. HTML, PDF, MS Word files, etc., should and sometimes does.) 100% reliable automatic charset detection is impossible in principle, and highly reliable charset detection seems to be impractical in practice too.

  36. Text in a database doesn't have an encoding.
  37. Text in a database has the same encoding as the rest of the system.

    Every database has an encoding that's used for all its text. Libraries used to access databases usually deal with this transparently (e.g. automatically encoding text going into the database and decoding text coming out of the database). Difficulties can occur when an application and a database are using incompatible encodings, or when a low-level programmer assumes they're using the same encoding when they aren't.

  38. Unicode has an elegant and harmonious design, otherwise it wouldn't be the most widely used encoding.

    Unicode is not, technically, an encoding, it's a standard which includes some encodings. UTF-8 and UTF-16 are Unicode encodings of the Unicode character set.

    The primary goal of Unicode is to make it easy to convert text from any other encoding into a Unicode encoding and back again without loss of information. The Unicode standard describes a lot of design principles which they would like to follow (chapter 2, section 2.2), but in practice some encodings and character sets break these principles, so Unicode has incorporated into itself all the design flaws of every text encoding system ever created. Even when encodings and character sets are designed without internal flaws, different encoding systems have used incompatible design principles. So to succeed in its primary goal, Unicode has incorporated several incompatible systems into itself, as well as all the flaws of the individual systems. This has resulted in a combinatorial explosion of interactions between the different systems. Really, it's a little surprising that Unicode usually just works.

  39. All bytes are characters.
  40. All sequences of bytes are strings.

    The old "ASCII + code page" system had the enviable property that all bytes were characters and all characters were bytes. When Unicode was conceived, the original plan was just to replace each byte with two bytes, so every two-byte code-unit would be a character and every character would be two bytes. Surely 65,536 characters would be enough for everyone? This encoding system is now referred to as "UCS-2", but was originally just called "Unicode". A lot of the systems which use two-byte code units internally were conceived of during this time, notably Java, JavaScript, and Windows.

    Once it became clear that 65,536 characters would not be enough, thanks to the requirement to be compatible with all pre-existing character sets, rather than force everyone to switch to a 4-byte code unit (what a waste of space!) variable-length encodings were introduced. UTF-8 was designed this way from the ground up, but UTF-16 is essentially a hack to add variable-length encoding to UCS-2 so systems using two-byte code units wouldn't also have to be redesigned from the ground up. The hack was to reserve two groups of 1,024 characters each as "high surrogates" and "low surrogates" to encode 1,024 x 1,024 = 1,048,576 more characters as pairs of two-byte code units. The UTF-16 hack is why Unicode code points are limited to 65,356 + 1,048,576 = 1,114,112 code-points, making U+10FFFF the largest legal code-point.

    Variable-length encodings make it impossible to maintain the 1-to-1 relationship between code-units and characters, and the UTF-16 hack creates some additional issues for UTF-8 and other Unicode encodings. Here's a list of common issues:

    • Out-of-range code points, e.g. anything > U+10FFFF could technically be encoded in UTF-8 but cannot be encoded in UTF-16.
    • Truncated characters, e.g. a high surrogate not followed by a low surrogate, or a UTF-8 start byte not followed by enough continuation bytes.
    • Decapitated characters, e.g. a low surrogate not preceded by a high surrogate, or isolated or excess UTF-8 continuation bytes.
    • Surrogates in anything except UTF-16, e.g. when converting a surrogate pair from UTF-16 to UTF-8, the result should be a single UTF-8 character, not two surrogates.
    • Overlong encodings, e.g. anything under U+80 should be encoded as one byte in UTF-8, not 2, 3, or 4.

    Most of these things weren't illegal in the original specifications, so implementations which accidentally - or in some cases intentionally - emitted now-invalid characters and strings were common. As they were found to cause compatibility issues and were invalidated by updates to the specs, some systems were repaired but others remain in such wide-spread use that their particular brand of invalidness has been documented and named.

  41. A code point represents exactly one character.
  42. A code point represents at most two or three characters.
  43. A code point never represents a whole word, phrase, or sentence.
  44. There's a limit to the amount of text which can be represented by one code point.

    Ligatures and digraphs are part of some character sets. So are occasional weird things like Roman numerals (e.g. VIII as a single code point). Therefore Unicode has code points for them too. Technically there exists a Unicode code point which encodes the largest number of characters, but you never know what might be added to the next version. The runner-up is 8 characters in one code point (U+FDFB), a phrase in Arabic meaning roughly "may His glory be glorified". The current record-holder is 18 characters, in U+FDFA, a phrase in Arabic meaning roughly "may God honor him and grant him peace". But only because U+FDFD is specially treated as a symbol which can't be decomposed. If it weren't, it would decompose into about 35 characters.

  45. A code point represents at least one whole character.

    Some characters have parts, for example a base letter and an accent or a base letter and an attached or overlaid mark. Some encodings encode each part as its own code point, and represent a character as a list of its parts.

  46. There's a limit to the number of code points needed to represent a whole character.

    Have you encountered Zalgo text yet? (A classic example.)

    Seriously though, there are widely used writing systems where every character has many parts, like the Brahmic scripts (used widely in and around India), which attach vowel symbols to consonant symbols, or the Korean alphabet, which combines consonants and vowels into syllable blocks. The maximum seems to be three four five N parts per character.

  47. A code point represents a character or part of a character.

    Arguable counter-examples: the space "character" and other whitespace code points, unassigned code points (which may represent a character in the future), and private-use code points (which may-or-may-not be used to represent a character).

    Counter-examples which may still have a visible effect: zero-width characters, line and paragraph markers, layout and format control characters, the replacement character.

    Definite counter-examples: control codes, UTF-16 surrogate code points, byte-order marks, and the unassigned code points which have been pre-designated as non-character code points.

  48. A code point represents something.

    There are three control codes (80, 81, and 99) which were proposed in a draft standard, but discarded as ill-advised and thus were never agreed upon or implemented by anyone. But in a bit of bad luck, the draft escaped into the wild, and we're stuck with them forever.

  49. Code points are unambiguous about which character they represent.

    Typewriters saved keys by merging several similar-looking characters into one, early encodings replicated these hacks, and now we're stuck with them forever. E.g. U+002D for hyphens, dashes, and minus signs, U+0027 for apostrophes, left single quotes, and right single quotes, and U+0022 for left and right double quotation marks.

    This causes bugs through the incorrect assumption that all instances of a character represent the same thing. For example, code which treats hyphens and dashes as minus signs or vice versa, or code which treats apostrophes as single quotes or vice versa.

  50. Different code points represent different characters.

    Counter-examples: Greek capital letter Omega 'Ω' (U+03A9) and the Ohm sign 'Ω' (U+2126), Latin capital letter A with ring above 'Å' (U+00C5) and Angstrom sign 'Å' (U+212B), semicolon ';' (U+003B) and Greek question mark ';' (U+037E).

    This isn't just about the fact there are code points to unambiguously represent left and right quotation marks, while at the same time the ambiguous typewriter quotations marks still exist. Nor are these characters from different writing systems which happen to be visually identical. They're examples of the same characters being encoded in different ways depending on how they're used, e.g. in a word vs. as a symbol. Some may argue these are, in fact, distinct characters, but in most character sets they aren't. Even if they are, see the next entry.

  51. A character can be represented in one and only one way.

    Counter-example: Ą́ (U+0041 U+0328 U+0301) and Ą́ (U+0041 U+0301 U+0328)

    The parts of some characters can be encoded in different orders. In this case, the accent and the hook may be encoded in either order. In some writing systems, almost all characters have multiple parts and this issue is common.

  52. Strings with different lengths can't be equal.

    Counter-example: Á (U+0041 U+0301) and Á (U+00C1).

    Parts can be encoded separately, or each combination of parts can get its own code point. The latter makes sense when there are only a few legal combinations of parts, but the former can be more efficient when many combinations are possible.

  53. Text can be processed without normalization.

    Many character sets are designed so any given character can be encoded in one and only one way. However, because Unicode's primary goal is to make it equally easy to convert text from any character set into Unicode, it necessarily deals with character sets which were not designed this way. Even if every character set were designed this way, Unicode would still have multiple ways to encode the same character, because different character sets use incompatible strategies. For example, when one character set encodes 'ñ' as a single code point and another as 'n' + ' ̃' then Unicode must allow both ways, or else fail its primary goal.

    In other words, in Unicode there are often several different ways to encode the same character. This makes the algorithm to decide whether two Unicode strings are equal or not much more complicated.

    To deal with this, the Unicode standard defines several "normalization forms", each of which uses one and only one way to encode each character, and an algorithm ("normalization") to convert an arbitrary string into one of the standard forms. Before comparing two strings, you must convert them both to the same normalization form or else the comparison won't work correctly on some pairs of strings.

  54. Canonical normalization of text isn't necessary.

    The main problem with characters which can be encoded in multiple ways is that when you display them, they look exactly alike but behave differently in a lot of ways. For example, identical strings may be sorted into different positions in a list, de-duplication won't remove all duplicates, and search will return some results but not others. Converting every string which enters your system into a canonical normalization form avoids all these problems.

  55. Compatibility normalization of text isn't necessary.

    Most character sets are designed to leave the exact form of each character to some other layer(s) of the system. Details like size, serifs, style (e.g. italic or bold), subscripts and superscripts, ligatures, and so on, are controlled by HTML tags, CSS, fonts, text markup formats, and so on. However, some character sets don't adhere to this practice and include, for example, different code points for the numeral 2 and superscript 2, different variants of a character for use at the beginning, middle, or end of a word, or color variants of a character.

    Naturally, Unicode includes all of them.

    Compatibility character variants can cause many of the same problems as canonically identical characters, except the differences between these characters are (usually) visible as well. Converting strings to a compatibility normalization form solves these problems, although it must only be done behind the scenes since important visual differences are removed. This is analogous to converting a set of strings to lowercase (or uppercase) for case-insensitive sorting or matching.

  56. Compatibility normalization fixes all problems with look-alike characters.

    Compatibility normalization only renders two versions of the same character identical. It doesn't merge similar-looking characters from the same writing system (e.g. 1iIlL or oO0) or identical-looking characters from different writing systems, e.g. Cyrillic О and Greek Ο look just like Latin O (U+041E, U+039F, and U+004F respectively), as do some other uppercase letters (ABCEHIKMOPTX).

  57. Concatenating normalized strings results in a normalized string.

    Beware cases where the first code point in the second string represents part of a character. However, only the characters immediately adjacent to the join ever need to be renormalized, so concatenating normalized strings can at least be done efficiently, unless you're dealing with Zalgo text.

  58. Changing the case of a normalized string results in a normalized string.

    Counterexample: ǰ̣ (U+01F0 U+0323) uppercases to J̣̌ (U+004A U+030C U+0323), whose canonical normal form is J̣̌ (U+004A U+0323 U+030C).

  59. Strings don't need to be normalized before changing their case.

    Counterexample: ancient Greek's iota subscript can cause trouble. E.g. ᾷ (U+03B1 U+0345 U+0342) uppercases to ΑΙ͂ (U+0391 U+0399 U+0342). The normal form would keep U+0342 on the Alpha. (Admittedly, this is an extremely narrow edge case.)

  60. Text can be processed without a locale.

    Also known as: "None of our code is internationalized."

    To a first approximation, a "locale" is a code specifying the language of a text. However, there are also well-documented differences in how the same language is used in different places (e.g. between British and American English), so locale optionally encodes regional dialects (e.g. en-GB for English Great Britain and en-US for English United States). The writing system used may also be encoded, e.g. sr-Cyrl or sr-Latn for Serbian in Cyrillic or Latin. Locale codes can get even more specific, e.g. de-DE-u-co-phonebk is German (Deutsch) in Germany (Deutschland) with the Unicode sorting algorithm (collation) for German phonebooks, which is different than the sorting algorithm used in German dictionaries.

    Most computers have a default locale along with a default encoding. In many programming languages, text processing algorithms use the computer's default locale, but can optionally use a locale passed in explicitly. This causes the same problems as using a default encoding: programmers forget the default exists, and bugs occur when text is transferred between computers with different default locales.

    In effect, the default encoding and default locale are hidden global variables, with all the associated problems. Text processing functions which do not accept an encoding or a locale should use a global constant encoding and locale.

  61. Locale isn't necessary for changing case.

    In Turkish and Azeri, the uppercase of 'i' is 'İ' and the lowercase of 'I' is 'ı'. In Lithuanian, the lowercase of 'Ĩ' isn't 'ĩ', it's an 'i' with a tilde above the dot, with similar rules for other accents on both 'i' and 'j'. Both of these could be considered simplifications of the weird exceptional way the dot on 'i' and 'j' is treated in most languages.

    In other parts of Unicode, problems of this sort are handled differently. E.g. D with stroke, Eth, and retroflex D are uppercase characters which look identical but have different lowercase mappings. It's as though there were unique Turkish I and i characters instead of the current situation where I and i have different properties depending on the locale.

    An example of how this can go wrong, in Java: languageCode.toLowerCase()

    Normally this works, and an uppercase language code like "EN" gets converted to "en". However, toLowerCase() uses the system's default locale, so when this software is run on a Turkish system, suddenly "IT" (Italian) becomes "ıt". The solution is to explicitly specify a Locale, so the same thing will happen on every system: languageCode.toLowerCase(Locale.ROOT)

  62. Locale isn't necessary for sorting and searching text.

    Counterexample: German sorts 'ä', 'ö', and 'ü' either as if the diacritic wasn't there (for regular words, e.g. in dictionaries) or as if they were 'ae', 'oe', and 'ue' respectively (for names, e.g. in atlases and telephone books).

    More generally, different languages alphabetize the same letters in different orders and treat different letters as equal or unequal during search.

    Beyond locale, there are all sorts of special cases for sorting and searching text. You can't rely on default string equality and comparison operators.

    bird < Bird < birds, cafe < café < cafes (case- and accent-insensitive search and sort)
    "page 1" < "page 2" < "page 10" < "page 20" (lexicographic vs numeric, especially when they're in the same text)
    French: cote < côte < coté < côté (can't just compare one letter at a time)
    Japanese: カー < カア, but キア < キー (non-alphabetic writing systems may have exotic-looking rules that make sense in context)

    Chinese character strings are often sorted phonetically. The same character can have different pronunciations in different parts of the same string, and in different languages and dialects, so typically an independent phonetic representation of each string is required.

  63. Locale isn't necessary for splitting text into characters.

    Digraphs such as 'ch' in Czech and Slovak, 'ij' in Dutch, or 'ng' in Tagalog have their own keyboard keys, occupy their own spots in their alphabets, and must not be split despite being encoded in the same way as text which must be split in other locales.

  64. Locale isn't necessary for splitting text into words.

    Some languages don't use whitespace between words. Closer to home, Early Modern English and Swedish use colons instead of apostrophes in contractions, and English considers contractions to be one word while French considers them separate words.

  65. Locale isn't necessary for line-breaking.

    ... unless you want to know where hyphens can be inserted into words. This isn't just about differences between languages, e.g. en-GB is more strict about where hyphens are allowed than en-US, while in East Asian text line breaks may occur between any character - even in the middle of Latin words embedded in the text. Another example is Korean, which uses different line-breaking styles in formal and informal documents.

    There's also the issue of knowing when it's safe to remove hyphens from words. Consider the difference between "resort" and "re-sort".

  66. Locale isn't necessary to quote text.

    There are a bewildering number of different quotation marks, not to mention different conventions for using the same quotation marks.

  67. Locale isn't necessary for punctuation marks.

    Also known as: "Punctuation marks are cross-linguistic, so I don't have to internationalize/translate them."

    Just one counter-example: French often places narrow non-breaking spaces between punctuation marks and words, e.g. between between a colon and the preceding word, or between quotation marks and the adjacent quoted words. There are many, many more counter-examples, including languages which use particular punctuation marks for completely different purposes, languages which use only a subset of the "standard" Latin punctuation marks, and languages which use completely unfamiliar punctuation marks.

  68. There are two cases: upper-case and lower-case.

    Thanks to ligatures (one character representing two or more letters), there's a third case, title-case, where the first letter of the ligature is upper-case and the rest are lower-case. Technically there could be a variety of other upper-case/lower-case combinations, but title-case is the only one which has made it into Unicode. (So far!)

    Also, many writing systems and individual characters are unicase, lacking case entirely.

  69. There's a one-to-one correspondence between upper- and lower-case characters.

    Also known as "This is guaranteed to fit back in the database."

    Counterexample: the upper-case of German 'ß' is 'SS'. ('ß' is a ligature of a long s and a "round" s. In 2017 an upper-case ligature, ẞ, was officially adopted.)

    Counterexample: the lower-case of Greek 'Σ' is 'ς' at the end of a word and 'σ' elsewhere.

    By convention, all superscript and subscript letters in Unicode are considered lower-case, despite looking like upper-case letters. Even when the corresponding lower-case superscript or subscript letters are also in Unicode. This makes it tricky to change the case of superscripts and subscripts.

    Other examples can be found in the IPA block, where some Unicode characters are considered lower-case despite lacking matching upper-case characters.

  70. Only letters have case.

    Counter-examples: Roman numerals (numbers), circled letters (symbols), and superscript or subscript letters (combining marks).

    Technically numerals have case, in the sense that text figures are the same height as lowercase letters and have ascenders and descenders but lining figures are all the same height as uppercase letters. However, it's rare to see both used in the same text. Even Unicode doesn't have both types. (Yet!)

    Bonus: Regular expressions

  71. [a-zA-Z] will match any letter.

    Note that accented letters, ligatures, and non-Latin writing systems exist.

  72. [0-9] will match any numeral.

    Writing systems often come with their own numeral systems.

  73. [ \t\n\r] or \s will match any whitespace character.

    Don't forget the non-breaking space, thin spaces and their non-breaking variants, other newline characters, etc. Try matching the Unicode property \p{White_Space} instead.

  74. \p{L} or \p{Letter} will match any letter.
  75. \p{Lu} or \p{Uppercase_Letter} will match any uppercase character.
  76. \p{Ll} or \p{Lowercase_Letter} will match any lowercase character.
  77. Matching the Unicode General_Category property is the right thing to do.

    \p{L} is actually syntactic sugar for \p{GC=L}, which in turn is equivalent to \p{General_Category=Letter}. \p{Alphabetic}, on the other hand, means \p{Alphabetic=Yes}. Note how L and Letter are on the right side of the equals sign, while Alphabetic is on the left.

    Each Unicode code point belongs to exactly one General_Category, so all the weird edge cases which rightly belong to multiple categories are arbitrarily assigned to just one of the categories they could have been assigned to. Sometimes partitioning code points like this makes sense, but Unicode provides a long list of boolean properties which handle all the edge cases for the times when you want to match everything which looks even remotely like what you want. Examples include \p{White_Space}, \p{Alphabetic} for letters, \p{Uppercase}, \p{Lowercase}, properties for various types of punctuation, and so on. Unfortunately, thanks to syntactic sugar these are easily confused with General_Category values.

Show/Hide counter-examples and discussion.

3 Comments:

At February 10, 2021, Blogger Fraxas said...

I appreciate this reference. Thanks for taking the time!

 
At April 27, 2022, Anonymous Anonymous said...

That last one about Unicode Letter vs Alphabetic was absolute gold!

Not to mention all the other absolute gold ones in the list, but that last one was a real eye-opener.

Thank you so much for compiling this list!

 
At December 18, 2024, Anonymous Anonymous said...

Much of it is a matter of definition and understanding of "necessity". Within a specific constraint (domain, standard, etc.), which in practice always applies, many of these statements will be true.

 

Post a Comment

<< Home