This page lists some notes about how I think the interface of strings should be. Hopefully this will end up in a blog post some day.
Unicode is very complex. It would be foolish to think that strings can offer a simple interface to Unicode, such as an array of characters.
Avoid the lone term "character" at all cost. It's ambiguous and nobody agrees on what it means, just like "OOP" and "FP".
How strings are represented in memory is not relevant to its interface. You could use UTF-8, UTF-32, a rope of smaller strings, whatever. It may be configurable by the programmer for performance reasons, but that should by no means imply a default view (see below).
Strings are not sequences of anything. Not sequences of bytes, not sequences of "characters" (whatever that may be), not sequences of code points, etc. You can't loop through strings. You can't index strings. Strings are highly opaque.
While strings are not sequences themselves, you can extract sequences from them:
- given an encoding, you can extract a sequence of code units (bytes for UTF-8, shorts for UTF-16);
- you can extract a sequence of code points;
- you can extract a sequence of abstract characters;
- you can extract a sequence of graphemes;
These views can be implemented lazily or eagerly. You can loop through and index these views.
There should be no default view. The programmer should be forced to think about which one they need. Therefore, they should understand their domain (which includes the basics of Unicode). Likewise, operations like "substring" should take a unit as their argument, instead of defaulting to some arbitrary one.
module Code_point : sig type t val from_int : int -> t option (* not all ints are valid code points *) val to_int : t -> int end = struct (* … *) end module Grapheme : sig (* … *) end = struct (* … *) end module String : sig type t val encode_utf8 : t -> byte list val encode_utf16 : t -> short list val code_points : t -> Code_point.t list val graphemes : t -> Grapheme.t list (* … *) end = struct (* … *) end