Package WordUtils :: Module dawg
[show private | hide private]
[frames | no frames]

Module WordUtils.dawg

Implementation of a DAWG and associated trie.

What is a DAWG?

What is a DAWG? The acronym DAWG stands for "Directed Acyclic Word Graph." Some searching on USENET yields this explanation:
  [A t]rie is a tree based data structure for compact storage of a
  dictionary. Words that start with the same letters share a path in the
  Trie. For example, a Trie containing SIN, SINE, SING, PIN, PINE and PING
  looks like this:

                *
               / \
              P   S
             /     \
            I       I
           /         \
          N           N
         / \         / \
        E   G       E   G

  A leaf denotes the end of a word, but there are also word ends at inner
  nodes; usually a word end is represented with an extra flag in every node.

  A lot of space can be saved by merging identical subtrees, like this:

                *
               / \
              P   S
               \ /  
                I
                |
                N
               / \
              E   G

  The resulting data structure is sometimes called a DAWG (directed
  acyclic word graph). Note that this merging step doesn't affect the
  algorithms on the tree at all, since there is no pointer to the
  parent of a node.
Each node in a trie has a value, and a list of other nodes that come after it. One can determine whether a word is in the trie by searching down the tree structure. If the word has been built and the current node is marked as an endpoint or is a leaf, then the word is in the trie.

Implementation Notes

The method used below to compress a trie into a dawg is fairly much home-grown. "Hints" on the web have discussed working upwards from the bottom of the tree, and what we've come up with works off that. Check the documentation for the Dawg.compress method for more details. Many thanks to Dave Lofquist for help with the algorithm.

Most of the public methods in the Dawg object are implemented in what I think of as hierarchical fashion. What this means is, a public method will be implemented almost entirely in terms of other, private methods. Since the Dawg object supports both on-disk and in-memory trees, for instance, most of the public methods are implemented in terms of two different private methods, where the public method doesn't do much except choose between them.

I think the method naming convention should be obvious, but I'll spell it out anyway. Single-word methods are lower-case only, i.e. lower(). Multiple-word methods are lowerUpper(). The in-memory variant of a method starts with _mem, i.e. _mem_lowerUpper(), while the on-disk variant starts with _disk, i.e. _disk_lowerUpper(). A recursively called method ends with _r, i.e. _mem_lowerUpper_r(). For the most part, any sub-methods associated with a public method are grouped with it, unless they stand on their own.

Note that ON_DISK dawgs are inherently read-only (i.e., you cannot insert new words into or remove words from ON_DISK dawgs). They're also not going to be as fast as IN_MEMORY dawgs. Use them when you want to conserve memory and when performance isn't your top priority.

Database Format

The Dawg.save method provides the option to store the a dawg on disk in either a wordlist or "database" format. The "database" format is essentially an on-disk representation of the in-memory dawg, using file pointers instead of object references. Once a database has been saved, you have the option of using it to load an in-memory dawg, or instead accessing the database directly from disk, using very little memory in the process.

If you spend the time to compress a dawg generated from a large wordlist, you will definitely want to save it in database format, to avoid having to recompress the dawg every time you create it.

The database file format is not intended to be human readable. However, it should be portable across operating systems, since I use the Python marshal module to write the data to disk. As of now, I have not done extensive testing on platforms other than Linux.

Documentation Notes

In the docstrings below, I tend to use 'trie' and 'dawg' interchangeably. Technically speaking, they're not really the same thing, although I think it's fair to say that a dawg is a trie, but not vice-versa. In this code, the only difference is that a dawg is compressed, so a brand new Dawg object is really a trie.

I only provide docstrings for public interface methods. I figure that the arguments to the private methods should be fairly obvious given the documentation for the public methods.

Keep in mind that any backslash characters in Epydoc markup must be escaped, so in the markup you'll always see two of them together. This is sometimes a bit confusing.

Author: Kenneth J. Pronovici <pronovic@ieee.org>

Classes
Dawg Class representing a Directed Acyclic Word Graph, or DAWG.
Node Class representing a node in an in-memory trie or dawg.

Exceptions
DawgError General exception class for this module.

Generated by Epydoc 2.1 on Tue Apr 4 21:55:58 2006 http://epydoc.sf.net