(aspell.info.gz) Words With Symbols in Them
Info Catalog
(aspell.info.gz) Compound Words
(aspell.info.gz) Language Related Issues
(aspell.info.gz) Unicode Normalization
C.2 Words With Spaces or Other Symbols in Them
==============================================
Many languages, including English, have words with non-letter symbols in
them. For example the apostrophe. These symbols generally appear in
the middle of a word, but they can also appear at the end, such as in an
abbreviation. If a symbol can _only_ appear as part of a word then
Aspell can treat it as if it were a letter.
However, the problem is most of these symbols have other uses. For
example, the apostrophe is often used as a single quote and the
abbreviations marker is also used as a period. Thus, Aspell cannot
blindly treat them as if they were letters.
Aspell currently handles the case where the symbol can only appear in
the middle of the word fairly well. It simply assumes that if there is
a letter both before and after the symbol than it is part of the word.
This works most of the time but it is not fool proof. For example,
suppose the user forgot to leave a space after the period:
... and the dog went up the tree.Then the cat ...
Aspell would think "tree.Then" is one word. A better solution might be
to then try to check "tree" and "Then" separately. But what if one of
them is not in the dictionary? Should Aspell assume "tree.Then" is one
word?
The case where the symbol can appear at the beginning or end of the
word is more difficult to deal with. The symbol may or may not
actually be part of the word. Aspell currently handles this case by
first trying to spell check the word with the symbol and if that fails,
try it without. The problem is, if the word is misspelled, should
Aspell assume the symbol belongs with the word or not? Currently
Aspell assumes it does, which is not always the correct thing to do.
Numbers in words present a different challenge to Aspell. If Aspell
treats numbers as letters then every possible number a user might write
in a document must be specified in the dictionary. This could easily
be solved by having special code to assume all numbers are correctly
spelled. Yet, what about something like "4th". Since the "th" suffix
can appear after any number we are left with the same problem. The
solution would be to have a special symbol for "any number".
Words with spaces in them, such as foreign phrases, are even more
trouble to deal with. The basic problem is that when tokenizing a
string there is no good way to keep phrases together. One solution is to
use trial and error. If a word is not in the dictionary try grouping it
with the previous or next word and see if the combined word is in the
dictionary. But what if the combined word is not, should the misspelled
word be grouped when looking for suggestions? One solution is to also
store each part of the phrase in the dictionary, but tag it as part of a
phrase and not an independent word.
To further complicate things, most applications that use spell
checkers are accustom to parsing the document themselves and sending it
to the spell checker a word at a time. In order to support words with
spaces in them a more complicated interface will be required.
Info Catalog
(aspell.info.gz) Compound Words
(aspell.info.gz) Language Related Issues
(aspell.info.gz) Unicode Normalization
automatically generated by
info2html