(libc.info.gz) glibc iconv Implementation

Info Catalog (libc.info.gz) Other iconv Implementations (libc.info.gz) Generic Charset Conversion
 
 6.5.4 The 'iconv' Implementation in the GNU C Library
 -----------------------------------------------------
 
 After reading about the problems of 'iconv' implementations in the last
 section it is certainly good to note that the implementation in the GNU
 C Library has none of the problems mentioned above.  What follows is a
 step-by-step analysis of the points raised above.  The evaluation is
 based on the current state of the development (as of January 1999).  The
 development of the 'iconv' functions is not complete, but basic
 functionality has solidified.
 
    The GNU C Library's 'iconv' implementation uses shared loadable
 modules to implement the conversions.  A very small number of
 conversions are built into the library itself but these are only rather
 trivial conversions.
 
    All the benefits of loadable modules are available in the GNU C
 Library implementation.  This is especially appealing since the
 interface is well documented (see below), and it, therefore, is easy to
 write new conversion modules.  The drawback of using loadable objects is
 not a problem in the GNU C Library, at least on ELF systems.  Since the
 library is able to load shared objects even in statically linked
 binaries, static linking need not be forbidden in case one wants to use
 'iconv'.
 
    The second mentioned problem is the number of supported conversions.
 Currently, the GNU C Library supports more than 150 character sets.  The
 way the implementation is designed the number of supported conversions
 is greater than 22350 (150 times 149).  If any conversion from or to a
 character set is missing, it can be added easily.
 
    Particularly impressive as it may be, this high number is due to the
 fact that the GNU C Library implementation of 'iconv' does not have the
 third problem mentioned above (i.e., whenever there is a conversion from
 a character set A to B and from B to C it is always possible to convert
 from A to C directly).  If the 'iconv_open' returns an error and sets
 'errno' to 'EINVAL', there is no known way, directly or indirectly, to
 perform the wanted conversion.
 
    Triangulation is achieved by providing for each character set a
 conversion from and to UCS-4 encoded ISO 10646.  Using ISO 10646 as an
 intermediate representation it is possible to "triangulate" (i.e.,
 convert with an intermediate representation).
 
    There is no inherent requirement to provide a conversion to ISO 10646
 for a new character set, and it is also possible to provide other
 conversions where neither source nor destination character set is
 ISO 10646.  The existing set of conversions is simply meant to cover all
 conversions that might be of interest.
 
    All currently available conversions use the triangulation method
 above, making conversion run unnecessarily slow.  If, for example,
 somebody often needs the conversion from ISO-2022-JP to EUC-JP, a
 quicker solution would involve direct conversion between the two
 character sets, skipping the input to ISO 10646 first.  The two
 character sets of interest are much more similar to each other than to
 ISO 10646.
 
    In such a situation one easily can write a new conversion and provide
 it as a better alternative.  The GNU C Library 'iconv' implementation
 would automatically use the module implementing the conversion if it is
 specified to be more efficient.
 
 6.5.4.1 Format of 'gconv-modules' files
 .......................................
 
 All information about the available conversions comes from a file named
 'gconv-modules', which can be found in any of the directories along the
 'GCONV_PATH'.  The 'gconv-modules' files are line-oriented text files,
 where each of the lines has one of the following formats:
 
    * If the first non-whitespace character is a '#' the line contains
      only comments and is ignored.
 
    * Lines starting with 'alias' define an alias name for a character
      set.  Two more words are expected on the line.  The first word
      defines the alias name, and the second defines the original name of
      the character set.  The effect is that it is possible to use the
      alias name in the FROMSET or TOSET parameters of 'iconv_open' and
      achieve the same result as when using the real character set name.
 
      This is quite important as a character set has often many different
      names.  There is normally an official name but this need not
      correspond to the most popular name.  Beside this many character
      sets have special names that are somehow constructed.  For example,
      all character sets specified by the ISO have an alias of the form
      'ISO-IR-NNN' where NNN is the registration number.  This allows
      programs that know about the registration number to construct
      character set names and use them in 'iconv_open' calls.  More on
      the available names and aliases follows below.
 
    * Lines starting with 'module' introduce an available conversion
      module.  These lines must contain three or four more words.
 
      The first word specifies the source character set, the second word
      the destination character set of conversion implemented in this
      module, and the third word is the name of the loadable module.  The
      filename is constructed by appending the usual shared object suffix
      (normally '.so') and this file is then supposed to be found in the
      same directory the 'gconv-modules' file is in.  The last word on
      the line, which is optional, is a numeric value representing the
      cost of the conversion.  If this word is missing, a cost of 1 is
      assumed.  The numeric value itself does not matter that much; what
      counts are the relative values of the sums of costs for all
      possible conversion paths.  Below is a more precise description of
      the use of the cost value.
 
    Returning to the example above where one has written a module to
 directly convert from ISO-2022-JP to EUC-JP and back.  All that has to
 be done is to put the new module, let its name be ISO2022JP-EUCJP.so, in
 a directory and add a file 'gconv-modules' with the following content in
 the same directory:
 
      module  ISO-2022-JP//   EUC-JP//        ISO2022JP-EUCJP    1
      module  EUC-JP//        ISO-2022-JP//   ISO2022JP-EUCJP    1
 
    To see why this is sufficient, it is necessary to understand how the
 conversion used by 'iconv' (and described in the descriptor) is
 selected.  The approach to this problem is quite simple.
 
    At the first call of the 'iconv_open' function the program reads all
 available 'gconv-modules' files and builds up two tables: one containing
 all the known aliases and another that contains the information about
 the conversions and which shared object implements them.
 
 6.5.4.2 Finding the conversion path in 'iconv'
 ..............................................
 
 The set of available conversions form a directed graph with weighted
 edges.  The weights on the edges are the costs specified in the
 'gconv-modules' files.  The 'iconv_open' function uses an algorithm
 suitable for search for the best path in such a graph and so constructs
 a list of conversions that must be performed in succession to get the
 transformation from the source to the destination character set.
 
    Explaining why the above 'gconv-modules' files allows the 'iconv'
 implementation to resolve the specific ISO-2022-JP to EUC-JP conversion
 module instead of the conversion coming with the library itself is
 straightforward.  Since the latter conversion takes two steps (from
 ISO-2022-JP to ISO 10646 and then from ISO 10646 to EUC-JP), the cost is
 1+1 = 2.  The above 'gconv-modules' file, however, specifies that the
 new conversion modules can perform this conversion with only the cost of
 1.
 
    A mysterious item about the 'gconv-modules' file above (and also the
 file coming with the GNU C Library) are the names of the character sets
 specified in the 'module' lines.  Why do almost all the names end in
 '//'?  And this is not all: the names can actually be regular
 expressions.  At this point in time this mystery should not be revealed,
 unless you have the relevant spell-casting materials: ashes from an
 original DOS 6.2 boot disk burnt in effigy, a crucifix blessed by St.
 Emacs, assorted herbal roots from Central America, sand from Cebu, etc.
 Sorry!  *The part of the implementation where this is used is not yet
 finished.  For now please simply follow the existing examples.  It'll
 become clearer once it is.  -drepper*
 
    A last remark about the 'gconv-modules' is about the names not ending
 with '//'.  A character set named 'INTERNAL' is often mentioned.  From
 the discussion above and the chosen name it should have become clear
 that this is the name for the representation used in the intermediate
 step of the triangulation.  We have said that this is UCS-4 but actually
 that is not quite right.  The UCS-4 specification also includes the
 specification of the byte ordering used.  Since a UCS-4 value consists
 of four bytes, a stored value is affected by byte ordering.  The
 internal representation is _not_ the same as UCS-4 in case the byte
 ordering of the processor (or at least the running process) is not the
 same as the one required for UCS-4.  This is done for performance
 reasons as one does not want to perform unnecessary byte-swapping
 operations if one is not interested in actually seeing the result in
 UCS-4.  To avoid trouble with endianness, the internal representation
 consistently is named 'INTERNAL' even on big-endian systems where the
 representations are identical.
 
 6.5.4.3 'iconv' module data structures
 ......................................
 
 So far this section has described how modules are located and considered
 to be used.  What remains to be described is the interface of the
 modules so that one can write new ones.  This section describes the
 interface as it is in use in January 1999.  The interface will change a
 bit in the future but, with luck, only in an upwardly compatible way.
 
    The definitions necessary to write new modules are publicly available
 in the non-standard header 'gconv.h'.  The following text, therefore,
 describes the definitions from this header file.  First, however, it is
 necessary to get an overview.
 
    From the perspective of the user of 'iconv' the interface is quite
 simple: the 'iconv_open' function returns a handle that can be used in
 calls to 'iconv', and finally the handle is freed with a call to
 'iconv_close'.  The problem is that the handle has to be able to
 represent the possibly long sequences of conversion steps and also the
 state of each conversion since the handle is all that is passed to the
 'iconv' function.  Therefore, the data structures are really the
 elements necessary to understanding the implementation.
 
    We need two different kinds of data structures.  The first describes
 the conversion and the second describes the state etc.  There are really
 two type definitions like this in 'gconv.h'.
 
  -- Data type: struct __gconv_step
      This data structure describes one conversion a module can perform.
      For each function in a loaded module with conversion functions
      there is exactly one object of this type.  This object is shared by
      all users of the conversion (i.e., this object does not contain any
      information corresponding to an actual conversion; it only
      describes the conversion itself).
 
      'struct __gconv_loaded_object *__shlib_handle'
      'const char *__modname'
      'int __counter'
           All these elements of the structure are used internally in the
           C library to coordinate loading and unloading the shared.  One
           must not expect any of the other elements to be available or
           initialized.
 
      'const char *__from_name'
      'const char *__to_name'
           '__from_name' and '__to_name' contain the names of the source
           and destination character sets.  They can be used to identify
           the actual conversion to be carried out since one module might
           implement conversions for more than one character set and/or
           direction.
 
      'gconv_fct __fct'
      'gconv_init_fct __init_fct'
      'gconv_end_fct __end_fct'
           These elements contain pointers to the functions in the
           loadable module.  The interface will be explained below.
 
      'int __min_needed_from'
      'int __max_needed_from'
      'int __min_needed_to'
      'int __max_needed_to;'
           These values have to be supplied in the init function of the
           module.  The '__min_needed_from' value specifies how many
           bytes a character of the source character set at least needs.
           The '__max_needed_from' specifies the maximum value that also
           includes possible shift sequences.
 
           The '__min_needed_to' and '__max_needed_to' values serve the
           same purpose as '__min_needed_from' and '__max_needed_from'
           but this time for the destination character set.
 
           It is crucial that these values be accurate since otherwise
           the conversion functions will have problems or not work at
           all.
 
      'int __stateful'
           This element must also be initialized by the init function.
           'int __stateful' is nonzero if the source character set is
           stateful.  Otherwise it is zero.
 
      'void *__data'
           This element can be used freely by the conversion functions in
           the module.  'void *__data' can be used to communicate extra
           information from one call to another.  'void *__data' need not
           be initialized if not needed at all.  If 'void *__data'
           element is assigned a pointer to dynamically allocated memory
           (presumably in the init function) it has to be made sure that
           the end function deallocates the memory.  Otherwise the
           application will leak memory.
 
           It is important to be aware that this data structure is shared
           by all users of this specification conversion and therefore
           the '__data' element must not contain data specific to one
           specific use of the conversion function.
 
  -- Data type: struct __gconv_step_data
      This is the data structure that contains the information specific
      to each use of the conversion functions.
 
      'char *__outbuf'
      'char *__outbufend'
           These elements specify the output buffer for the conversion
           step.  The '__outbuf' element points to the beginning of the
           buffer, and '__outbufend' points to the byte following the
           last byte in the buffer.  The conversion function must not
           assume anything about the size of the buffer but it can be
           safely assumed the there is room for at least one complete
           character in the output buffer.
 
           Once the conversion is finished, if the conversion is the last
           step, the '__outbuf' element must be modified to point after
           the last byte written into the buffer to signal how much
           output is available.  If this conversion step is not the last
           one, the element must not be modified.  The '__outbufend'
           element must not be modified.
 
      'int __is_last'
           This element is nonzero if this conversion step is the last
           one.  This information is necessary for the recursion.  See
           the description of the conversion function internals below.
           This element must never be modified.
 
      'int __invocation_counter'
           The conversion function can use this element to see how many
           calls of the conversion function already happened.  Some
           character sets require a certain prolog when generating
           output, and by comparing this value with zero, one can find
           out whether it is the first call and whether, therefore, the
           prolog should be emitted.  This element must never be
           modified.
 
      'int __internal_use'
           This element is another one rarely used but needed in certain
           situations.  It is assigned a nonzero value in case the
           conversion functions are used to implement 'mbsrtowcs' et.al.
           (i.e., the function is not used directly through the 'iconv'
           interface).
 
           This sometimes makes a difference as it is expected that the
           'iconv' functions are used to translate entire texts while the
           'mbsrtowcs' functions are normally used only to convert single
           strings and might be used multiple times to convert entire
           texts.
 
           But in this situation we would have problem complying with
           some rules of the character set specification.  Some character
           sets require a prolog, which must appear exactly once for an
           entire text.  If a number of 'mbsrtowcs' calls are used to
           convert the text, only the first call must add the prolog.
           However, because there is no communication between the
           different calls of 'mbsrtowcs', the conversion functions have
           no possibility to find this out.  The situation is different
           for sequences of 'iconv' calls since the handle allows access
           to the needed information.
 
           The 'int __internal_use' element is mostly used together with
           '__invocation_counter' as follows:
 
                if (!data->__internal_use
                     && data->__invocation_counter == 0)
                  /* Emit prolog.  */
                  ...
 
           This element must never be modified.
 
      'mbstate_t *__statep'
           The '__statep' element points to an object of type 'mbstate_t'
           ( Keeping the state).  The conversion of a stateful
           character set must use the object pointed to by '__statep' to
           store information about the conversion state.  The '__statep'
           element itself must never be modified.
 
      'mbstate_t __state'
           This element must _never_ be used directly.  It is only part
           of this structure to have the needed space allocated.
 
 6.5.4.4 'iconv' module interfaces
 .................................
 
 With the knowledge about the data structures we now can describe the
 conversion function itself.  To understand the interface a bit of
 knowledge is necessary about the functionality in the C library that
 loads the objects with the conversions.
 
    It is often the case that one conversion is used more than once
 (i.e., there are several 'iconv_open' calls for the same set of
 character sets during one program run).  The 'mbsrtowcs' et.al.
 functions in the GNU C Library also use the 'iconv' functionality, which
 increases the number of uses of the same functions even more.
 
    Because of this multiple use of conversions, the modules do not get
 loaded exclusively for one conversion.  Instead a module once loaded can
 be used by an arbitrary number of 'iconv' or 'mbsrtowcs' calls at the
 same time.  The splitting of the information between conversion-
 function-specific information and conversion data makes this possible.
 The last section showed the two data structures used to do this.
 
    This is of course also reflected in the interface and semantics of
 the functions that the modules must provide.  There are three functions
 that must have the following names:
 
 'gconv_init'
      The 'gconv_init' function initializes the conversion function
      specific data structure.  This very same object is shared by all
      conversions that use this conversion and, therefore, no state
      information about the conversion itself must be stored in here.  If
      a module implements more than one conversion, the 'gconv_init'
      function will be called multiple times.
 
 'gconv_end'
      The 'gconv_end' function is responsible for freeing all resources
      allocated by the 'gconv_init' function.  If there is nothing to do,
      this function can be missing.  Special care must be taken if the
      module implements more than one conversion and the 'gconv_init'
      function does not allocate the same resources for all conversions.
 
 'gconv'
      This is the actual conversion function.  It is called to convert
      one block of text.  It gets passed the conversion step information
      initialized by 'gconv_init' and the conversion data, specific to
      this use of the conversion functions.
 
    There are three data types defined for the three module interface
 functions and these define the interface.
 
  -- Data type: int (*__gconv_init_fct) (struct __gconv_step *)
      This specifies the interface of the initialization function of the
      module.  It is called exactly once for each conversion the module
      implements.
 
      As explained in the description of the 'struct __gconv_step' data
      structure above the initialization function has to initialize parts
      of it.
 
      '__min_needed_from'
      '__max_needed_from'
      '__min_needed_to'
      '__max_needed_to'
           These elements must be initialized to the exact numbers of the
           minimum and maximum number of bytes used by one character in
           the source and destination character sets, respectively.  If
           the characters all have the same size, the minimum and maximum
           values are the same.
 
      '__stateful'
           This element must be initialized to a nonzero value if the
           source character set is stateful.  Otherwise it must be zero.
 
      If the initialization function needs to communicate some
      information to the conversion function, this communication can
      happen using the '__data' element of the '__gconv_step' structure.
      But since this data is shared by all the conversions, it must not
      be modified by the conversion function.  The example below shows
      how this can be used.
 
           #define MIN_NEEDED_FROM         1
           #define MAX_NEEDED_FROM         4
           #define MIN_NEEDED_TO           4
           #define MAX_NEEDED_TO           4
 
           int
           gconv_init (struct __gconv_step *step)
           {
             /* Determine which direction.  */
             struct iso2022jp_data *new_data;
             enum direction dir = illegal_dir;
             enum variant var = illegal_var;
             int result;
 
             if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0)
               {
                 dir = from_iso2022jp;
                 var = iso2022jp;
               }
             else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0)
               {
                 dir = to_iso2022jp;
                 var = iso2022jp;
               }
             else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0)
               {
                 dir = from_iso2022jp;
                 var = iso2022jp2;
               }
             else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0)
               {
                 dir = to_iso2022jp;
                 var = iso2022jp2;
               }
 
             result = __GCONV_NOCONV;
             if (dir != illegal_dir)
               {
                 new_data = (struct iso2022jp_data *)
                   malloc (sizeof (struct iso2022jp_data));
 
                 result = __GCONV_NOMEM;
                 if (new_data != NULL)
                   {
                     new_data->dir = dir;
                     new_data->var = var;
                     step->__data = new_data;
 
                     if (dir == from_iso2022jp)
                       {
                         step->__min_needed_from = MIN_NEEDED_FROM;
                         step->__max_needed_from = MAX_NEEDED_FROM;
                         step->__min_needed_to = MIN_NEEDED_TO;
                         step->__max_needed_to = MAX_NEEDED_TO;
                       }
                     else
                       {
                         step->__min_needed_from = MIN_NEEDED_TO;
                         step->__max_needed_from = MAX_NEEDED_TO;
                         step->__min_needed_to = MIN_NEEDED_FROM;
                         step->__max_needed_to = MAX_NEEDED_FROM + 2;
                       }
 
                     /* Yes, this is a stateful encoding.  */
                     step->__stateful = 1;
 
                     result = __GCONV_OK;
                   }
               }
 
             return result;
           }
 
      The function first checks which conversion is wanted.  The module
      from which this function is taken implements four different
      conversions; which one is selected can be determined by comparing
      the names.  The comparison should always be done without paying
      attention to the case.
 
      Next, a data structure, which contains the necessary information
      about which conversion is selected, is allocated.  The data
      structure 'struct iso2022jp_data' is locally defined since, outside
      the module, this data is not used at all.  Please note that if all
      four conversions this modules supports are requested there are four
      data blocks.
 
      One interesting thing is the initialization of the '__min_' and
      '__max_' elements of the step data object.  A single ISO-2022-JP
      character can consist of one to four bytes.  Therefore the
      'MIN_NEEDED_FROM' and 'MAX_NEEDED_FROM' macros are defined this
      way.  The output is always the 'INTERNAL' character set (aka UCS-4)
      and therefore each character consists of exactly four bytes.  For
      the conversion from 'INTERNAL' to ISO-2022-JP we have to take into
      account that escape sequences might be necessary to switch the
      character sets.  Therefore the '__max_needed_to' element for this
      direction gets assigned 'MAX_NEEDED_FROM + 2'.  This takes into
      account the two bytes needed for the escape sequences to single the
      switching.  The asymmetry in the maximum values for the two
      directions can be explained easily: when reading ISO-2022-JP text,
      escape sequences can be handled alone (i.e., it is not necessary to
      process a real character since the effect of the escape sequence
      can be recorded in the state information).  The situation is
      different for the other direction.  Since it is in general not
      known which character comes next, one cannot emit escape sequences
      to change the state in advance.  This means the escape sequences
      that have to be emitted together with the next character.
      Therefore one needs more room than only for the character itself.
 
      The possible return values of the initialization function are:
 
      '__GCONV_OK'
           The initialization succeeded
      '__GCONV_NOCONV'
           The requested conversion is not supported in the module.  This
           can happen if the 'gconv-modules' file has errors.
      '__GCONV_NOMEM'
           Memory required to store additional information could not be
           allocated.
 
    The function called before the module is unloaded is significantly
 easier.  It often has nothing at all to do; in which case it can be left
 out completely.
 
  -- Data type: void (*__gconv_end_fct) (struct gconv_step *)
      The task of this function is to free all resources allocated in the
      initialization function.  Therefore only the '__data' element of
      the object pointed to by the argument is of interest.  Continuing
      the example from the initialization function, the finalization
      function looks like this:
 
           void
           gconv_end (struct __gconv_step *data)
           {
             free (data->__data);
           }
 
    The most important function is the conversion function itself, which
 can get quite complicated for complex character sets.  But since this is
 not of interest here, we will only describe a possible skeleton for the
 conversion function.
 
  -- Data type: int (*__gconv_fct) (struct __gconv_step *, struct
           __gconv_step_data *, const char **, const char *, size_t *,
           int)
      The conversion function can be called for two basic reason: to
      convert text or to reset the state.  From the description of the
      'iconv' function it can be seen why the flushing mode is necessary.
      What mode is selected is determined by the sixth argument, an
      integer.  This argument being nonzero means that flushing is
      selected.
 
      Common to both modes is where the output buffer can be found.  The
      information about this buffer is stored in the conversion step
      data.  A pointer to this information is passed as the second
      argument to this function.  The description of the 'struct
      __gconv_step_data' structure has more information on the conversion
      step data.
 
      What has to be done for flushing depends on the source character
      set.  If the source character set is not stateful, nothing has to
      be done.  Otherwise the function has to emit a byte sequence to
      bring the state object into the initial state.  Once this all
      happened the other conversion modules in the chain of conversions
      have to get the same chance.  Whether another step follows can be
      determined from the '__is_last' element of the step data structure
      to which the first parameter points.
 
      The more interesting mode is when actual text has to be converted.
      The first step in this case is to convert as much text as possible
      from the input buffer and store the result in the output buffer.
      The start of the input buffer is determined by the third argument,
      which is a pointer to a pointer variable referencing the beginning
      of the buffer.  The fourth argument is a pointer to the byte right
      after the last byte in the buffer.
 
      The conversion has to be performed according to the current state
      if the character set is stateful.  The state is stored in an object
      pointed to by the '__statep' element of the step data (second
      argument).  Once either the input buffer is empty or the output
      buffer is full the conversion stops.  At this point, the pointer
      variable referenced by the third parameter must point to the byte
      following the last processed byte (i.e., if all of the input is
      consumed, this pointer and the fourth parameter have the same
      value).
 
      What now happens depends on whether this step is the last one.  If
      it is the last step, the only thing that has to be done is to
      update the '__outbuf' element of the step data structure to point
      after the last written byte.  This update gives the caller the
      information on how much text is available in the output buffer.  In
      addition, the variable pointed to by the fifth parameter, which is
      of type 'size_t', must be incremented by the number of characters
      (_not bytes_) that were converted in a non-reversible way.  Then,
      the function can return.
 
      In case the step is not the last one, the later conversion
      functions have to get a chance to do their work.  Therefore, the
      appropriate conversion function has to be called.  The information
      about the functions is stored in the conversion data structures,
      passed as the first parameter.  This information and the step data
      are stored in arrays, so the next element in both cases can be
      found by simple pointer arithmetic:
 
           int
           gconv (struct __gconv_step *step, struct __gconv_step_data *data,
                  const char **inbuf, const char *inbufend, size_t *written,
                  int do_flush)
           {
             struct __gconv_step *next_step = step + 1;
             struct __gconv_step_data *next_data = data + 1;
             ...
 
      The 'next_step' pointer references the next step information and
      'next_data' the next data record.  The call of the next function
      therefore will look similar to this:
 
             next_step->__fct (next_step, next_data, &outerr, outbuf,
                               written, 0)
 
      But this is not yet all.  Once the function call returns the
      conversion function might have some more to do.  If the return
      value of the function is '__GCONV_EMPTY_INPUT', more room is
      available in the output buffer.  Unless the input buffer is empty
      the conversion, functions start all over again and process the rest
      of the input buffer.  If the return value is not
      '__GCONV_EMPTY_INPUT', something went wrong and we have to recover
      from this.
 
      A requirement for the conversion function is that the input buffer
      pointer (the third argument) always point to the last character
      that was put in converted form into the output buffer.  This is
      trivially true after the conversion performed in the current step,
      but if the conversion functions deeper downstream stop prematurely,
      not all characters from the output buffer are consumed and,
      therefore, the input buffer pointers must be backed off to the
      right position.
 
      Correcting the input buffers is easy to do if the input and output
      character sets have a fixed width for all characters.  In this
      situation we can compute how many characters are left in the output
      buffer and, therefore, can correct the input buffer pointer
      appropriately with a similar computation.  Things are getting
      tricky if either character set has characters represented with
      variable length byte sequences, and it gets even more complicated
      if the conversion has to take care of the state.  In these cases
      the conversion has to be performed once again, from the known state
      before the initial conversion (i.e., if necessary the state of the
      conversion has to be reset and the conversion loop has to be
      executed again).  The difference now is that it is known how much
      input must be created, and the conversion can stop before
      converting the first unused character.  Once this is done the input
      buffer pointers must be updated again and the function can return.
 
      One final thing should be mentioned.  If it is necessary for the
      conversion to know whether it is the first invocation (in case a
      prolog has to be emitted), the conversion function should increment
      the '__invocation_counter' element of the step data structure just
      before returning to the caller.  See the description of the 'struct
      __gconv_step_data' structure above for more information on how this
      can be used.
 
      The return value must be one of the following values:
 
      '__GCONV_EMPTY_INPUT'
           All input was consumed and there is room left in the output
           buffer.
      '__GCONV_FULL_OUTPUT'
           No more room in the output buffer.  In case this is not the
           last step this value is propagated down from the call of the
           next conversion function in the chain.
      '__GCONV_INCOMPLETE_INPUT'
           The input buffer is not entirely empty since it contains an
           incomplete character sequence.
 
      The following example provides a framework for a conversion
      function.  In case a new conversion has to be written the holes in
      this implementation have to be filled and that is it.
 
           int
           gconv (struct __gconv_step *step, struct __gconv_step_data *data,
                  const char **inbuf, const char *inbufend, size_t *written,
                  int do_flush)
           {
             struct __gconv_step *next_step = step + 1;
             struct __gconv_step_data *next_data = data + 1;
             gconv_fct fct = next_step->__fct;
             int status;
 
             /* If the function is called with no input this means we have
                to reset to the initial state.  The possibly partly
                converted input is dropped.  */
             if (do_flush)
               {
                 status = __GCONV_OK;
 
                 /* Possible emit a byte sequence which put the state object
                    into the initial state.  */
 
                 /* Call the steps down the chain if there are any but only
                    if we successfully emitted the escape sequence.  */
                 if (status == __GCONV_OK && ! data->__is_last)
                   status = fct (next_step, next_data, NULL, NULL,
                                 written, 1);
               }
             else
               {
                 /* We preserve the initial values of the pointer variables.  */
                 const char *inptr = *inbuf;
                 char *outbuf = data->__outbuf;
                 char *outend = data->__outbufend;
                 char *outptr;
 
                 do
                   {
                     /* Remember the start value for this round.  */
                     inptr = *inbuf;
                     /* The outbuf buffer is empty.  */
                     outptr = outbuf;
 
                     /* For stateful encodings the state must be safe here.  */
 
                     /* Run the conversion loop.  'status' is set
                        appropriately afterwards.  */
 
                     /* If this is the last step, leave the loop.  There is
                        nothing we can do.  */
                     if (data->__is_last)
                       {
                         /* Store information about how many bytes are
                            available.  */
                         data->__outbuf = outbuf;
 
                        /* If any non-reversible conversions were performed,
                           add the number to '*written'.  */
 
                        break;
                      }
 
                     /* Write out all output that was produced.  */
                     if (outbuf > outptr)
                       {
                         const char *outerr = data->__outbuf;
                         int result;
 
                         result = fct (next_step, next_data, &outerr,
                                       outbuf, written, 0);
 
                         if (result != __GCONV_EMPTY_INPUT)
                           {
                             if (outerr != outbuf)
                               {
                                 /* Reset the input buffer pointer.  We
                                    document here the complex case.  */
                                 size_t nstatus;
 
                                 /* Reload the pointers.  */
                                 *inbuf = inptr;
                                 outbuf = outptr;
 
                                 /* Possibly reset the state.  */
 
                                 /* Redo the conversion, but this time
                                    the end of the output buffer is at
                                    'outerr'.  */
                               }
 
                             /* Change the status.  */
                             status = result;
                           }
                         else
                           /* All the output is consumed, we can make
                               another run if everything was ok.  */
                           if (status == __GCONV_FULL_OUTPUT)
                             status = __GCONV_OK;
                      }
                   }
                 while (status == __GCONV_OK);
 
                 /* We finished one use of this step.  */
                 ++data->__invocation_counter;
               }
 
             return status;
           }
 
    This information should be sufficient to write new modules.  Anybody
 doing so should also take a look at the available source code in the GNU
 C Library sources.  It contains many examples of working and optimized
 modules.
 
Info Catalog (libc.info.gz) Other iconv Implementations (libc.info.gz) Generic Charset Conversion
automatically generated by info2html