(aspell.info.gz) Words With Symbols in Them

Info Catalog (aspell.info.gz) Compound Words (aspell.info.gz) Language Related Issues (aspell.info.gz) Unicode Normalization
 
 C.2 Words With Spaces or Other Symbols in Them
 ==============================================
 
 Many languages, including English, have words with non-letter symbols in
 them.  For example the apostrophe.  These symbols generally appear in
 the middle of a word, but they can also appear at the end, such as in an
 abbreviation.  If a symbol can _only_ appear as part of a word then
 Aspell can treat it as if it were a letter.
 
    However, the problem is most of these symbols have other uses.  For
 example, the apostrophe is often used as a single quote and the
 abbreviations marker is also used as a period.  Thus, Aspell cannot
 blindly treat them as if they were letters.
 
    Aspell currently handles the case where the symbol can only appear in
 the middle of the word fairly well.  It simply assumes that if there is
 a letter both before and after the symbol than it is part of the word.
 This works most of the time but it is not fool proof.  For example,
 suppose the user forgot to leave a space after the period:
 
        ... and the dog went up the tree.Then the cat ...
 
 Aspell would think "tree.Then" is one word.  A better solution might be
 to then try to check "tree" and "Then" separately.  But what if one of
 them is not in the dictionary?  Should Aspell assume "tree.Then" is one
 word?
 
    The case where the symbol can appear at the beginning or end of the
 word is more difficult to deal with.  The symbol may or may not
 actually be part of the word.  Aspell currently handles this case by
 first trying to spell check the word with the symbol and if that fails,
 try it without.  The problem is, if the word is misspelled, should
 Aspell assume the symbol belongs with the word or not?  Currently
 Aspell assumes it does, which is not always the correct thing to do.
 
    Numbers in words present a different challenge to Aspell.  If Aspell
 treats numbers as letters then every possible number a user might write
 in a document must be specified in the dictionary.  This could easily
 be solved by having special code to assume all numbers are correctly
 spelled.  Yet, what about something like "4th".  Since the "th" suffix
 can appear after any number we are left with the same problem.  The
 solution would be to have a special symbol for "any number".
 
    Words with spaces in them, such as foreign phrases, are even more
 trouble to deal with.  The basic problem is that when tokenizing a
 string there is no good way to keep phrases together. One solution is to
 use trial and error.  If a word is not in the dictionary try grouping it
 with the previous or next word and see if the combined word is in the
 dictionary.  But what if the combined word is not, should the misspelled
 word be grouped when looking for suggestions?  One solution is to also
 store each part of the phrase in the dictionary, but tag it as part of a
 phrase and not an independent word.
 
    To further complicate things, most applications that use spell
 checkers are accustom to parsing the document themselves and sending it
 to the spell checker a word at a time.  In order to support words with
 spaces in them a more complicated interface will be required.
 
Info Catalog (aspell.info.gz) Compound Words (aspell.info.gz) Language Related Issues (aspell.info.gz) Unicode Normalization
automatically generated by info2html