Strings

From foldr
Jump to: navigation, search

This page lists some notes about how I think the interface of strings should be. Hopefully this will end up in a blog post some day.

Unicode is very complex. It would be foolish to think that strings can offer a simple interface to Unicode, such as an array of characters.

Characters

Avoid the lone term "character" at all cost. It's ambiguous and nobody agrees on what it means, just like "OOP" and "FP".

Memory representation

How strings are represented in memory is not relevant to its interface. You could use UTF-8, UTF-32, a rope of smaller strings, whatever. It may be configurable by the programmer for performance reasons, but that should by no means imply a default view (see below).

Sequences

Strings are not sequences of anything. Not sequences of bytes, not sequences of "characters" (whatever that may be), not sequences of code points, etc. You can't loop through strings. You can't index strings. Strings are highly opaque.

Views

While strings are not sequences themselves, you can extract sequences from them:

  • given an encoding, you can extract a sequence of code units (bytes for UTF-8, shorts for UTF-16);
  • you can extract a sequence of code points;
  • you can extract a sequence of abstract characters;
  • you can extract a sequence of graphemes;
  • etc.

These views can be implemented lazily or eagerly. You can loop through and index these views.

There should be no default view. The programmer should be forced to think about which one they need. Therefore, they should understand their domain (which includes the basics of Unicode). Likewise, operations like "substring" should take a unit as their argument, instead of defaulting to some arbitrary one.

Summary

module Code_point : sig
  type t
  val from_int : int -> t option (* not all ints are valid code points *)
  val to_int   : t -> int
end = struct (* … *) end

module Grapheme : sig (* … *) end = struct (* … *) end

module String : sig
  type t
  val encode_utf8  : t -> byte list
  val encode_utf16 : t -> short list
  val code_points  : t -> Code_point.t list
  val graphemes    : t -> Grapheme.t list
  (* … *)
end = struct (* … *) end