Package WordUtils :: Module tree
[show private | hide private]
[frames | no frames]

Module WordUtils.tree

Implementation of a ternary search tree and related functionality.

What is a ternary search tree?

What is a ternary search tree? The article Ternary Search Trees by Jon Bentley and Bob Sedgewick in Doctor Dobb's Journal, April 1998, discusses them. The article has this to say:
  When you have to store a set of strings, what data structure should
  you use? You could use hash tables, which sprinkle the strings
  throughout an array. Access is fast, but information about relative
  order is lost. Another option is the use of binary search trees,
  which store strings in order, and are fairly fast. Or you could use
  digital search tries, which are lightning fast, but use lots of
  space.

  ...[T]ernary search trees...combine the time efficiency of digital
  tries with the space efficiency of binary search trees. The
  resulting structure is faster than hashing for many typical search
  problems, and supports a broader range of useful problems and
  operations. Ternary searches are faster than hashing and more
  powerful, too.

Implementation Notes

The implementation below is conceptually based on the data structure described in the Ternary Search Trees article, although I have made some changes because the article's implementation is focused toward C. The Python objects are undoubtedly larger than their C counterparts, but there is some code simplificiation due to Python's more sophisticated data structures. The Python implementation is likely also somewhat slower than the C implementation, although I do not have any benchmarks.

Most of the public methods in the TernarySearchTree object are implemented in what I think of as hierarchical fashion. What this means is, a public method will be implemented almost entirely in terms of other, private methods. Since the TernarySearchTree object supports both on-disk and in-memory trees, for instance, most of the public methods are implemented in terms of two different private methods, where the public method doesn't do much except choose between them.

On first glace through this code, it may appear obvious that I should just create two different objects (an in-memory tree and an on-disk tree) to clean things up. Actually, since there is very little in-memory structure for an on-disk tree (just a filename, really) there would be very little benefit to reorganizing the code. As it stands, it's not too hard to follow.

I think the method naming convention should be obvious, but I'll spell it out anyway. Single-word methods are lower-case only, i.e. lower(). Multiple-word methods are lowerUpper(). The in-memory variant of a method starts with _mem, i.e. _mem_lowerUpper(), while the on-disk variant starts with _disk, i.e. _disk_lowerUpper(). A recursively called method ends with _r, i.e. _mem_lowerUpper_r(). For the most part, any sub-methods associated with a public method are grouped with it, unless they stand on their own.

Note that ON_DISK trees are inherently read-only (i.e., you cannot insert new words into or remove words from ON_DISK trees). They're also not going to be as fast as IN_MEMORY trees. Use them when you want to conserve memory and when performance isn't your top priority.

The TernarySearchTree.compress method exists for the TernarySearchTree object. However, it is a no-op right now. I have provided it (along with its regression tests) so that it will be easy to support compression in the future, if I can implement a decent compression algorithm.

Database Format

The TernarySearchTree.save method provides the option to store a tree on disk in either a wordlist or "database" format. The "database" format is essentially an on-disk representation of the in-memory tree, using file pointers instead of object references. Once a database has been saved, you have the option of using it to load an in-memory tree, or instead accessing the database directly from disk, using very little memory in the process.

The database file format is not intended to be human readable. However, it should be portable across operating systems, since I use the Python marshal module to write the data to disk. As of now, I have not done extensive testing on platforms other than Linux.

Documentation Notes

I only provide docstrings for public interface methods. I figure that the arguments to the private methods should be fairly obvious given the documentation for the public methods.

Keep in mind that any backslash characters in Epydoc markup must be escaped, so in the markup you'll always see two of them together. This is sometimes a bit confusing.

Author: Kenneth J. Pronovici <pronovic@ieee.org>

Classes
Node Class representing a node in an in-memory ternary search tree.
TernarySearchTree Class representing a ternary search tree.

Exceptions
TreeError General exception class for this module.

Generated by Epydoc 2.1 on Tue Apr 4 21:55:58 2006 http://epydoc.sf.net