(libc.info.gz) Extended Char Intro

(libc.info.gz) Character Set Handling
(libc.info.gz) Charset Function Overview
 
 6.1 Introduction to Extended Characters
 =======================================
 
 A variety of solutions is available to overcome the differences between
 character sets with a 1:1 relation between bytes and characters and
 character sets with ratios of 2:1 or 4:1.  The remainder of this section
 gives a few examples to help understand the design decisions made while
 developing the functionality of the C library.
 
    A distinction we have to make right away is between internal and
 external representation.  "Internal representation" means the
 representation used by a program while keeping the text in memory.
 External representations are used when text is stored or transmitted
 through some communication channel.  Examples of external
 representations include files waiting in a directory to be read and
 parsed.
 
    Traditionally there has been no difference between the two
 representations.  It was equally comfortable and useful to use the same
 single-byte representation internally and externally.  This comfort
 level decreases with more and larger character sets.
 
    One of the problems to overcome with the internal representation is
 handling text that is externally encoded using different character sets.
 Assume a program that reads two texts and compares them using some
 metric.  The comparison can be usefully done only if the texts are
 internally kept in a common format.
 
    For such a common format (= character set) eight bits are certainly
 no longer enough.  So the smallest entity will have to grow: "wide
 characters" will now be used.  Instead of one byte per character, two or
 four will be used instead.  (Three are not good to address in memory and
 more than four bytes seem not to be necessary).
 
    As shown in some other part of this manual, a completely new family
 has been created of functions that can handle wide character texts in
 memory.  The most commonly used character sets for such internal wide
 character representations are Unicode and ISO 10646 (also known as UCS
 for Universal Character Set).  Unicode was originally planned as a
 16-bit character set; whereas, ISO 10646 was designed to be a 31-bit
 large code space.  The two standards are practically identical.  They
 have the same character repertoire and code table, but Unicode specifies
 added semantics.  At the moment, only characters in the first '0x10000'
 code positions (the so-called Basic Multilingual Plane, BMP) have been
 assigned, but the assignment of more specialized characters outside this
 16-bit space is already in progress.  A number of encodings have been
 defined for Unicode and ISO 10646 characters: UCS-2 is a 16-bit word
 that can only represent characters from the BMP, UCS-4 is a 32-bit word
 than can represent any Unicode and ISO 10646 character, UTF-8 is an
 ASCII compatible encoding where ASCII characters are represented by
 ASCII bytes and non-ASCII characters by sequences of 2-6 non-ASCII
 bytes, and finally UTF-16 is an extension of UCS-2 in which pairs of
 certain UCS-2 words can be used to encode non-BMP characters up to
 '0x10ffff'.
 
    To represent wide characters the 'char' type is not suitable.  For
 this reason the ISO C standard introduces a new type that is designed to
 keep one character of a wide character string.  To maintain the
 similarity there is also a type corresponding to 'int' for those
 functions that take a single wide character.
 
  -- Data type: wchar_t
      This data type is used as the base type for wide character strings.
      In other words, arrays of objects of this type are the equivalent
      of 'char[]' for multibyte character strings.  The type is defined
      in 'stddef.h'.
 
      The ISO C90 standard, where 'wchar_t' was introduced, does not say
      anything specific about the representation.  It only requires that
      this type is capable of storing all elements of the basic character
      set.  Therefore it would be legitimate to define 'wchar_t' as
      'char', which might make sense for embedded systems.
 
      But in the GNU C Library 'wchar_t' is always 32 bits wide and,
      therefore, capable of representing all UCS-4 values and, therefore,
      covering all of ISO 10646.  Some Unix systems define 'wchar_t' as a
      16-bit type and thereby follow Unicode very strictly.  This
      definition is perfectly fine with the standard, but it also means
      that to represent all characters from Unicode and ISO 10646 one has
      to use UTF-16 surrogate characters, which is in fact a
      multi-wide-character encoding.  But resorting to
      multi-wide-character encoding contradicts the purpose of the
      'wchar_t' type.
 
  -- Data type: wint_t
      'wint_t' is a data type used for parameters and variables that
      contain a single wide character.  As the name suggests this type is
      the equivalent of 'int' when using the normal 'char' strings.  The
      types 'wchar_t' and 'wint_t' often have the same representation if
      their size is 32 bits wide but if 'wchar_t' is defined as 'char'
      the type 'wint_t' must be defined as 'int' due to the parameter
      promotion.
 
      This type is defined in 'wchar.h' and was introduced in Amendment 1
      to ISO C90.
 
    As there are for the 'char' data type macros are available for
 specifying the minimum and maximum value representable in an object of
 type 'wchar_t'.
 
  -- Macro: wint_t WCHAR_MIN
      The macro 'WCHAR_MIN' evaluates to the minimum value representable
      by an object of type 'wint_t'.
 
      This macro was introduced in Amendment 1 to ISO C90.
 
  -- Macro: wint_t WCHAR_MAX
      The macro 'WCHAR_MAX' evaluates to the maximum value representable
      by an object of type 'wint_t'.
 
      This macro was introduced in Amendment 1 to ISO C90.
 
    Another special wide character value is the equivalent to 'EOF'.
 
  -- Macro: wint_t WEOF
      The macro 'WEOF' evaluates to a constant expression of type
      'wint_t' whose value is different from any member of the extended
      character set.
 
      'WEOF' need not be the same value as 'EOF' and unlike 'EOF' it also
      need _not_ be negative.  In other words, sloppy code like
 
           {
             int c;
             ...
             while ((c = getc (fp)) < 0)
               ...
           }
 
      has to be rewritten to use 'WEOF' explicitly when wide characters
      are used:
 
           {
             wint_t c;
             ...
             while ((c = wgetc (fp)) != WEOF)
               ...
           }
 
      This macro was introduced in Amendment 1 to ISO C90 and is defined
      in 'wchar.h'.
 
    These internal representations present problems when it comes to
 storing and transmittal.  Because each single wide character consists of
 more than one byte, they are affected by byte-ordering.  Thus, machines
 with different endianesses would see different values when accessing the
 same data.  This byte ordering concern also applies for communication
 protocols that are all byte-based and therefore require that the sender
 has to decide about splitting the wide character in bytes.  A last (but
 not least important) point is that wide characters often require more
 storage space than a customized byte-oriented character set.
 
    For all the above reasons, an external encoding that is different
 from the internal encoding is often used if the latter is UCS-2 or
 UCS-4.  The external encoding is byte-based and can be chosen
 appropriately for the environment and for the texts to be handled.  A
 variety of different character sets can be used for this external
 encoding (information that will not be exhaustively presented
 here-instead, a description of the major groups will suffice).  All of
 the ASCII-based character sets fulfill one requirement: they are
 "filesystem safe."  This means that the character ''/'' is used in the
 encoding _only_ to represent itself.  Things are a bit different for
 character sets like EBCDIC (Extended Binary Coded Decimal Interchange
 Code, a character set family used by IBM), but if the operating system
 does not understand EBCDIC directly the parameters-to-system calls have
 to be converted first anyhow.
 
    * The simplest character sets are single-byte character sets.  There
      can be only up to 256 characters (for 8 bit character sets), which
      is not sufficient to cover all languages but might be sufficient to
      handle a specific text.  Handling of a 8 bit character sets is
      simple.  This is not true for other kinds presented later, and
      therefore, the application one uses might require the use of 8 bit
      character sets.
 
    * The ISO 2022 standard defines a mechanism for extended character
      sets where one character _can_ be represented by more than one
      byte.  This is achieved by associating a state with the text.
      Characters that can be used to change the state can be embedded in
      the text.  Each byte in the text might have a different
      interpretation in each state.  The state might even influence
      whether a given byte stands for a character on its own or whether
      it has to be combined with some more bytes.
 
      In most uses of ISO 2022 the defined character sets do not allow
      state changes that cover more than the next character.  This has
      the big advantage that whenever one can identify the beginning of
      the byte sequence of a character one can interpret a text
      correctly.  Examples of character sets using this policy are the
      various EUC character sets (used by Sun's operating systems,
      EUC-JP, EUC-KR, EUC-TW, and EUC-CN) or Shift_JIS (SJIS, a Japanese
      encoding).
 
      But there are also character sets using a state that is valid for
      more than one character and has to be changed by another byte
      sequence.  Examples for this are ISO-2022-JP, ISO-2022-KR, and
      ISO-2022-CN.
 
    * Early attempts to fix 8 bit character sets for other languages
      using the Roman alphabet lead to character sets like ISO 6937.
      Here bytes representing characters like the acute accent do not
      produce output themselves: one has to combine them with other
      characters to get the desired result.  For example, the byte
      sequence '0xc2 0x61' (non-spacing acute accent, followed by
      lower-case 'a') to get the "small a with acute" character.  To get
      the acute accent character on its own, one has to write '0xc2 0x20'
      (the non-spacing acute followed by a space).
 
      Character sets like ISO 6937 are used in some embedded systems such
      as teletex.
 
    * Instead of converting the Unicode or ISO 10646 text used
      internally, it is often also sufficient to simply use an encoding
      different than UCS-2/UCS-4.  The Unicode and ISO 10646 standards
      even specify such an encoding: UTF-8.  This encoding is able to
      represent all of ISO 10646 31 bits in a byte string of length one
      to six.
 
      There were a few other attempts to encode ISO 10646 such as UTF-7,
      but UTF-8 is today the only encoding that should be used.  In fact,
      with any luck UTF-8 will soon be the only external encoding that
      has to be supported.  It proves to be universally usable and its
      only disadvantage is that it favors Roman languages by making the
      byte string representation of other scripts (Cyrillic, Greek, Asian
      scripts) longer than necessary if using a specific character set
      for these scripts.  Methods like the Unicode compression scheme can
      alleviate these problems.
 
    The question remaining is: how to select the character set or
 encoding to use.  The answer: you cannot decide about it yourself, it is
 decided by the developers of the system or the majority of the users.
 Since the goal is interoperability one has to use whatever the other
 people one works with use.  If there are no constraints, the selection
 is based on the requirements the expected circle of users will have.  In
 other words, if a project is expected to be used in only, say, Russia it
 is fine to use KOI8-R or a similar character set.  But if at the same
 time people from, say, Greece are participating one should use a
 character set that allows all people to collaborate.
 
    The most widely useful solution seems to be: go with the most general
 character set, namely ISO 10646.  Use UTF-8 as the external encoding and
 problems about users not being able to use their own language adequately
 are a thing of the past.
 
    One final comment about the choice of the wide character
 representation is necessary at this point.  We have said above that the
 natural choice is using Unicode or ISO 10646.  This is not required, but
 at least encouraged, by the ISO C standard.  The standard defines at
 least a macro '__STDC_ISO_10646__' that is only defined on systems where
 the 'wchar_t' type encodes ISO 10646 characters.  If this symbol is not
 defined one should avoid making assumptions about the wide character
 representation.  If the programmer uses only the functions provided by
 the C library to handle wide character strings there should be no
 compatibility problems with other systems.
Info Catalog
(libc.info.gz) Character Set Handling
(libc.info.gz) Charset Function Overview
automatically generated by info2html