Strings

From foldr
Jump to: navigation, search

This page lists some notes about what the interface of a string type should be. Unicode is very complex. It would be foolish to think that strings can offer a simple interface to Unicode, such as an array of characters.

Terminology

A string is a bunch of text. Strings are not sequences. The term "character" is often used, despite it not being well-defined, and should be avoided.

Encoding

The way strings are encoded in memory should not influence its interface, although it may be configurable by the programmer for performance reasons. While strings are not sequences themselves, sequences can be extracted from them. These sequences can be implemented lazily or eagerly, and can be iterated over. Some examples of sequences would be: lists of code points, lists of graphemes, lists of UTF-8 code units.

Summary

A reference interface is shown below.

module Code_point : sig
  type t
  val from_int : int -> t option (* not all ints are valid code points *)
  val to_int   : t -> int
end = struct (* … *) end

module Grapheme : sig (* … *) end = struct (* … *) end

module String : sig
  type t
  val encode_utf8  : t -> byte list
  val encode_utf16 : t -> short list
  val code_points  : t -> Code_point.t list
  val graphemes    : t -> Grapheme.t list
  (* … *)
end = struct (* … *) end