(aspell.info.gz) Phonetic Code

Info Catalog (aspell.info.gz) Compiling the Word List (aspell.info.gz) Adding Support For Other Languages (aspell.info.gz) The Simple Soundslike
 
 7.3 Phonetic Code
 =================
 
 Aspell is in fact the spell checker that comes up with the best
 suggestions if it finds an unknown word.  One reason is that it does
 not just compare the word with other words in the dictionary (like
 Ispell does) but also uses phonetic comparisons with other words.
 
    The new table driven phonetic code is very flexible and setting up
 phonetic transformation rules for other languages is not difficult but
 there can be a number of stumbling blocks -- that's why I wrote this
 section.
 
    The main phonetic code is free of any language specific code and
 should be powerful enough to allow setting up rules for any language.
 Anything which is language specific is kept in a plain text file and
 can easily be edited.  So it's even possible to write phonetic
 transformation rules if you don't have any programming skills.  All you
 need to know is how words of the language are written and how they are
 pronounced.
 
 7.3.1 Syntax of the transformation array
 ----------------------------------------
 
 In the translation array there are two strings on each line; the first
 one is the search string (or switch name) and the second one is the
 replacement string (or switch parameter).  The line
 
      version   VERSION
 
 is also required to appear somewhere in the translation array.  The
 version string can be anything but it should be changed whenever a new
 version of the translation array is released.  This is important
 because it will keep Aspell from using a compiled dictionary with the
 wrong set of rules.  For example, if when coming up with suggestion for
 `hallo', Aspell will use the new rules to come up with the soundslike
 say `H*L*', but if `hello' is stored in the dictionary using the old
 rules as `HL' instead of `H*L*' Aspell will never be able to come up
 with `hello'.  So to solve this problem Aspell checks if the version
 strings match and aborts with an error if they don't.  Thus it is
 important to update it whenever a new version of the translation array
 is released.  This is only a problem with the main word list as the
 personal word lists are now stored as simple word lists with a single
 header line (i.e. no soundslike data).
 
    Each non switch line represents one replacement (transformation)
 rule.  Words beginning with the same letter must be grouped together;
 the order inside this group does not depend on alphabetical issues but
 it gives priorities; the higher the rule the higher the priority.
 That's why the first rule that matches is applied.  In the following
 example:
 
      GH   _
      G    K
 
 `GH -> _' has higher priority than `G -> K'
 
    `_' represents the empty string "".  If `GH -> _' came after `G ->
 K', the second rule would never match because the algorithm would stop
 searching for more rules after the first match.  The above rules
 transform any `GH' to an empty string (delete them) and transforms any
 other `G' to `K'.
 
    At the end of the first string of a line (the search string) there
 may optionally stand a number of characters in brackets.  One (only
 one!)  of these characters must fit.  It's comparable with the `[ ]'
 brackets in regular expressions.  The rule `DG(EIY) -> J' for example
 would match any `DGE', `DGI' and `DGY' and replace them with `J'.  This
 way you can reduce several rules to one.
 
    Before the search string, one or more dashes `-' may be placed.
 Those search strings will be matched totally but only the beginning of
 the string will be replaced.  Furthermore, for these rules no follow-up
 rule will be searched (what this is will be explained later).  The rule
 `TCH-- '-> _ will match any word containing `TCH' (like `match') but
 will only replace the first character `T' with an empty string.  The
 number of dashes determines how many characters from the end will not
 be replaced.  After the replacement, the search for transformation
 rules continues with the not replaced `CH'!
 
    If a `<' is appended to the search string, the search for
 replacement rules will continue with the replacement string and not with
 the next character of the word.  The rule `PH< -> F' for example would
 replace `PH' with `F' and then again start to search for a replacement
 rule for `F...'.  If there would also be rules like `FO '-> `O' and `F
 -> _' then words like `PHOXYZ' would be transformed to `OXYZ' and any
 occurrences of `PH' that are not followed by an `O' will be deleted like
 `PHIXYZ -> IXYZ'.  The second replacement however is not applied if the
 priority of this rule is lower than the priority of the first rule.
 
    Priorities are added to a rule by putting a number between 0 and 9 at
 the end of the search string, for example `ING6 -> N'.  The higher the
 number the higher is the priority.
 
    Priorities are especially important for the previously mentioned
 follow-up rules.  Follow-up rules are searched beginning from the last
 string of the first search string.  This is a bit complicated but I
 hope this example will make it clearer:
 
      CHS      X
      CH       G
 
      HAU--1   H
 
      SCH      SH
 
    In this example `CHS' in the word `FUCHS' would be transformed to
 `X'.  If we take the word `DURCHSCHNITT' then things look a bit
 different.  Here `CH' belongs together and `SCH' belongs together and
 both are spoken separately.  The algorithm however first finds the
 string `CHS' which may not be transformed like in the previous word
 `FUCHS'.  At this point the algorithm can find a follow-up rule.  It
 takes the last character of the first matching rule (`CHS') which is
 `S' and looks for the next match, beginning from this character.  What
 it finds is clear: It finds `SCH -> SH', which has the same priority
 (no priority means standard priority, which is 5).  If the priority is
 the same or higher the follow-up rule will be applied.  Let's take a
 look at the word `SCHAUKEL'.  In this word `SCH' belongs together and
 may not be taken apart.  After the algorithm has found `SCH '-> `SH' it
 searches for a follow-up rule for `H+'`AUKEL'.  It finds `HAU--1 -> H',
 but does not apply it because its priority is lower than the one of the
 first rule.  You see that this is a very powerful feature but it also
 can easily lead to mistakes.  If you really don't need this feature you
 can turn it off by putting the line:
 
      followup      0
 
 at the beginning of the phonetic table file.  As mentioned, for rules
 containing a `-' no follow-up rules are searched but giving such rules
 a priority is not totally senseless because they can be follow-up rules
 and in that case the priority makes sense again.  Follow-up rules of
 follow-up rules are not searched because this is in fact not needed
 very often.
 
    The control character `^' says that the search string only matches
 at the beginning of words so that the rule `RH -> R' will only apply to
 words like `RHESUS' but not `PERHAPS'.  You can append another `^' to
 the search string.  In that case the algorithm treats the rest of the
 word totally separately from the first matched string at the beginning.
 This is useful for prefixes whose pronunciation does not depend on the
 rest of the word and vice versa like `OVER^^' in English for example.
 
    The same way as `^' works does `$' only apply to words that end with
 the search string.  `GN$ -> N' only matches on words like `SIGN' but
 not `SIGNUM'.  If you use `^' and `$' together, both of them must fit
 `ENOUGH^$ -> NF' will only match the word `ENOUGH' and nothing else.
 
    Of course you can combine all of the mentioned control characters but
 they must occur in this order: `< - priority ^ $'.  All characters must
 be written in CAPITAL letters.
 
    If absolutely no rule can be found -- might happen if you use strange
 characters for which you don't have any replacement rule -- the next
 character will simply be skipped and the search for replacement rules
 will continue with the rest of the word.
 
    If you want double letters to be reduced to one you must set up a
 rule like `LL- -> L'.  If double letters in the resulting phonetic word
 should be allowed, you must place the line:
 
      collapse_result     0
 
 at the beginning of your transformation table file; otherwise set the
 value to `1'.  The English rules for example strip all vowels from
 words and so the word "GOGO" would be transformed to "K" and not to
 "KK" (as desired) if `collapse_result' is set to 1.  That's why the
 English rules have `collapse_result' set to `0'.
 
    By default, all accents are removed from a word before it is matched
 to the soundslike rules.  If you do not want this then add the line
 
      remove_accents      0
 
    at the beginning of your file.  The exact definition of an accent is
 language dependent and is controlled via the character set file.  If you
 set remove_accents to '0' then you should also set "store-as" to "lower"
 in the language data file (not the phonetic transformation file)
 otherwise Aspell will have problems when both the accented and the
 de-accented version of a word appearing in the dictionary; it will
 consider one of them as incorrectly spelled.
 
 7.3.2 How do I start finally?
 -----------------------------
 
 Before you start to write an array of transformation rules, you should
 be aware that you have to do some work to make sure that things you do
 will result in correct transformation rules.
 
 7.3.2.1 Things that come in handy
 .................................
 
 First of all, you need to have a large word list of the language you
 want to make phonetics for.  It should contain about as many words as
 the dictionary of the spell checker.  If you don't have such a list,
 you will probably find an Ispell dictionary at
 `http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html' which will
 help you.  You can then make affix expansion via `ispell -e' and then
 pipe it through `tr " " "\n"' to put one word on each line.  After that
 you eventually have to convert special characters like `é' from
 Ispell's internal representation to latin1 encoding.  `sed s/e'/é/g'
 for example would replace all `e'' with `é'.
 
    The second is that you know how to use regular expressions and know
 how to use `grep'.  You should for example know that:
 
      grep ^[^aeiou]qu[io] wordlist | less
 
 will show you all words that begin with any character but `a', `e',
 `i', `o' or `u' and then continue with `qui' or `quo'.  This stuff is
 important for example to find out if a phonetic replacement rule you
 want to set up is valid for all words which match the expression you
 want to replace.  Taking a look at the regex(7) man page is a good idea.
 
 7.3.2.2 What the phonetic code should do
 ........................................
 
 Normal text comparison works well as long as the typer misspells a word
 because he pressed one key he didn't really want to press.  In these
 cases, mostly one character differs from the original word.
 
    In cases where the writer didn't know about the correct spelling of
 the word, the word may have several characters that differ from the
 original word but usually the word would still sound like the original.
 Someone might think that `tough' is spelled `taff'.  No spell checker
 without phonetic code will come to the idea that this might be `tough',
 but a spell checker who knows that `taff' would be pronounced like
 `tough' will make good suggestions to the user.  Another example could
 be `funetik' and `phonetic'.
 
    From these examples you can see that the phonetic transformation
 should not be too fussy and too precise.  If you implement a whole
 phonetic dictionary as you can find it in books this will not be very
 useful because then there could still be many characters differing from
 the misspelled and the desired word.  What you should do if you
 implement the phonetic transformation table is to reduce the number of
 used letters to the only really necessary ones.
 
    Characters that sound similar should be reduced to one.  In the
 English language for example `Z' sounds like `S' and that's why the
 transformation rule `Z -> S' is present in the replacement table.  "PH
 is spoken like "F and so we have a `PH -> F' rule.
 
    If you take a closer look you will even see that vowels sound very
 similar in the English language: `contradiction', `cuntradiction',
 `cantradiction' or `centradiction' in fact sound nearly the same, don't
 they? Therefore the English phonetic replacement rules not only reduce
 all vowels to one but even remove them all (removing is done by just
 setting up no rule for those letters).  The phonetic code of
 "contradiction" is "KNTRTKXN" and if you try to read this
 letter-monster loud you will hear that it still sound a bit like
 `contradiction'.  You also see that "D" is transformed to "T" because
 they nearly sound the same.
 
    If you think you have found a regularity you should _always_ take
 your word list and `grep' for the corresponding regular expression you
 want to make a transformation rule for.  An example: If you come to the
 idea that all English words ending on `ough' sound like `AF' at the end
 because you think of `enough' and `tough'.  If you then `grep' for the
 corresponding regular expression by `grep -i ough$ wordlist' you will
 see that the rule you wanted to set up is not correct because the rule
 doesn't fit to words like `although' or `bough'.  So you have to define
 your rule more precisely or you have to set up exceptions if the number
 of words that differ from the desired rule is not too big.
 
    Don't forget about follow-up rules which can help in many cases but
 which also can lead to confusion and unwanted side effects.  It's also
 important to write exceptions in front of the more general rules (`GH'
 before `G' etc.).
 
    If you think you have set up a number of rules that may produce some
 good results try them out! If you run Aspell as `aspell
 --lang=YOUR_LANGUAGE pipe' you get a prompt at which you can type in
 words.  If you just type words Aspell checks them and eventually makes
 suggestions if they are misspelled.  If you type in `$$Sw WORD' you
 will see the phonetic transformation and you can test out if your work
 does what you want.
 
    Another good way to check that changes you make to your rules don't
 have any bad side effects is to create another list from your word list
 which contains not only the word of the word list but also the
 corresponding phonetic version of this word on the same line.  If you
 do this once before the change and once after the change you can make a
 diff (see `man diff') to see what _really_ changed.  To do this use the
 command `aspell --lang=YOUR_LANGUAGE soundslike'.  In this mode Aspell
 will output the the original word and then its soundslike separated by
 a tab character for each word you give it.  If you are interested in
 seeing how the algorithm works you can download a set of useful
 programs from
 `http://members.xoom.com/maccy/spell/phonet-utils.tar.gz'.  This
 includes a program that produces a list as mentioned above and another
 program which illustrates how the algorithm works.  It uses the same
 transformation table as Aspell and so it helps a lot during the process
 of creating a phonetic transformation table for Aspell.
 
    During your work you should write down your basic ideas so that other
 people are able to understand what you did (and you still know about it
 after a few weeks).  The English table has a huge documentation
 appended as an example.
 
    Now you can start experimenting with all the things you just read and
 perhaps set up a nice phonetic transformation table for your language
 to help Aspell to come up with the best correction suggestions ever
 seen also for your language.  Take a look at the Aspell homepage to see
 if there is already a transformation table for your language.  If there
 is one you might also take a look at it to see if it could be improved.
 
    If you think that this section helped you or if you think that this
 is just a waste of time you can send any feedback to
 <bjoern.jacke@gmx.de>.
 
Info Catalog (aspell.info.gz) Compiling the Word List (aspell.info.gz) Adding Support For Other Languages (aspell.info.gz) The Simple Soundslike
automatically generated by info2html