Package WordUtils :: Module dawg :: Class Dawg
[show private | hide private]
[frames | no frames]

Type Dawg

object --+
         |
        Dawg


Class representing a Directed Acyclic Word Graph, or DAWG.
Method Summary
  __init__(self, type, db, wl)
Default constructor.
  add(self, key)
Adds a key (word) to an existing trie/dawg.
  buildIndex(self)
Creates indexing for a DAWG.
  clearIndex(self)
Clears indexing for a DAWG.
  clone(self, node)
Creates a new trie object, which is a clone of the trie starting at the given node.
  compress(self, ratio)
Compresses a trie into a dawg, optionally returning compression ratio.
  leaves(self)
Builds a list of leaf nodes (nodes with no children) in the trie.
  level(self, level)
Builds a list of nodes that are a certain level above the bottom of a trie.
  list(self)
Returns a sorted list representation of the trie/dawg.
  loadFile(self, db, wl)
Reads a wordlist or database file into a dawg.
  loadList(self, list)
Reads a Python list into a dawg.
  patternSearch(self, pattern)
Searches for a pattern in a trie/dawg.
  remove(self, key)
Removes a key (word) from a dawg.
  save(self, db, wl, overwrite)
Writes a dawg to disk as a wordlist and/or a portable "database" format.
  search(self, key)
Searches for a complete string in a trie/dawg.
  size(self)
Returns the dawg's size in terms of nodes.
  traverse(self, func)
Traverses a dawg and executes func(word) for each word.
    Inherited from object
  __delattr__(...)
x.__delattr__('name') <==> del x.name
  __getattribute__(...)
x.__getattribute__('name') <==> x.name
  __hash__(x)
x.__hash__() <==> hash(x)
  __new__(T, S, ...)
T.__new__(S, ...) -> a new object with type S, a subtype of T
  __reduce__(...)
helper for pickle
  __reduce_ex__(...)
helper for pickle
  __repr__(x)
x.__repr__() <==> repr(x)
  __setattr__(...)
x.__setattr__('name', value) <==> x.name = value
  __str__(x)
x.__str__() <==> str(x)

Property Summary
  db: Location of database on disk.
  indexed: Indicates whether the dawg has been indexed.
  root: Root node of trie/dawg.
  type: Dawg type (IN_MEMORY or ON_DISK).
  wl: Location of wordlist on disk.

Method Details

__init__(self, type=0, db=None, wl=None)
(Constructor)

Default constructor.

See the loadFile method for details on how the db and wl arguments are used.
Parameters:
type - type of dawg to be created
           (type=one of [IN_MEMORY, ON_DISK])
db - database on disk to be used to create dawg
           (type=valid filesytem path as a string)
wl - wordlist on disk to be used to create dawg
           (type=valid filesystem path as a string)
Raises:
DawgError - under exception circumstances
Overrides:
__builtin__.object.__init__

add(self, key)

Adds a key (word) to an existing trie/dawg.

This function inserts into a dawg by building a list of the words in the dawg, inserting the new key into the list, and then re-building the dawg based on the new list. Obviously, this is rather memory intensive and inefficient. The whole point of using a dawg in the first place is to avoid having a list of words around. However, I haven't found any other good way to do this.

You may not really need this functionality. Normally, a dawg would be filled using the loadFile method. This add method would only be used to add a new word to an existing trie/dawg. If you only need to add a few words to an existing dawg, consider whether you might be better off keeping around a small Python list or dictionary to hold the extra word, instead of adding to the dawg.

Remember, once you add a word into a compressed dawg, it will no longer be compressed.
Raises:
DawgError - under exception circumstances

buildIndex(self)

Creates indexing for a DAWG.

This method generates excess indexing information for each node. Included in the indexing information is a list of the node's parents, and an indication of the node's depth from the top of the trie. This information is mostly used by the compress method, but the method to generate it has been made public since it could conceivably be useful to clients under some special circumstances.

This method can only be used for IN_MEMORY dawgs.
Raises:
DawgError - under exception circumstances

clearIndex(self)

Clears indexing for a DAWG.

This method can only be used for IN_MEMORY dawgs.
Raises:
DawgError - under exception circumstances

clone(self, node)

Creates a new trie object, which is a clone of the trie starting at the given node.

Note that the cloned trie will not be the "same" trie from the perspective of the Node.compare method, since the actual nodes will be duplicated.

This method can only be used for IN_MEMORY dawgs.
Parameters:
node - Node to clone from
           (type=Node instance)
Returns:
head of cloned trie as a Node instance
Raises:
DawgError - under exception circumstances

compress(self, ratio=False)

Compresses a trie into a dawg, optionally returning compression ratio.

The compression method is fairly simple. It may not be ideal in terms of performance, but it should yield correct results.

Before doing anything else, we index the trie using the buildIndex method. The indexing process provides each node with information about who its parents are. This information isn't needed for anything other than compression, so it isn't kept around most of the time.

When indexing is complete, the first pass at at compression can begin. The main purpose of the first pass is to merge identical leaves. We build a list of leaves, and merge any leaves that have the same value. All leaves are by definition endpoints, and can have no children, so there's no need to do any comparisons against anything other than value.

After all of the leaves have been merged, we move onto the second pass at compression. This pass always looks one "level" up from the current level. For instance, the first time through, it looks at the parents of the leaves. We'll compare each of the nodes, and if any two are identical [1], we'll merge them. When this process is complete, we'll construct a new list, up one level (i.e. parents of the nodes we just operated on, which the first time through would be the parents of the parents of the leaves) and start all over again. The whole thing continues until the new list of parents contains just one node, the top node. There is no other valid exit condition, since the trie does not necessarily have a constant depth (because our input words may vary in length).

[1] It is important to note that for the second pass, two nodes can only be considered identical if they have the same value, the same endpoint flag and exactly the same set of children - not children with the same values, but the exact same children. Since we work from the bottom to the top, any two nodes that meet these conditions must represent subtries that are recursively identical, all of the way down to the bottom of the trie. Such nodes can then safely be merged without losing any information.

On large wordlists, a large portion of the nodes in the resulting trie will be leaves, so a large space savings can be achieved just by merging down to the ideal 26 leaves (one for each letter). The remainder of the processing trims even more space. The compressed dawg often contains only 25% as many unique nodes as the original uncompressed trie did. The algorithm appears to yield databases on disk that are 1-2 times as large as the original wordlist used to construct the dawg, which makes it feasible for a program using this dawg to distribute databases, not wordlists.

The algorithm is reasonably fast for moderate-sized lists of words (a few minutes for 80k words on my Duron 850 with 800 MB of RAM). However, as the list grows larger, things slow down considerably.

I have already eliminated most of the obvious bottlenecks. For instance, instead of using lists that must be appended to, I use dictionaries when I can, and then generate the lists on the fly as I need them. This got me an order of magnitude improvement in processing time (that 80k list used to take most of five hours to complete). More time spent with the Python profiler has helped me eliminate a few other silly sources of inefficiency. However, I'm now at the point where I can't find anything else obvious with the profiler. I think that for this to get any faster, it's going to take a major algorithmic change, rather than a tweak. I would be happy if someone were to prove me wrong. :-)

This method may only be used on IN_MEMORY trees.
Parameters:
ratio - indicates whether to return compression ratio information
           (type=boolean True/False)
Returns:
  • compression relative to original, as a percentage, if ratio is True
  • None if ratio is False
Raises:
DawgError - under exception circumstances

leaves(self)

Builds a list of leaf nodes (nodes with no children) in the trie.
Returns:
list of leaves at Node instances
Raises:
DawgError - under exception circumstances

level(self, level=0)

Builds a list of nodes that are a certain level above the bottom of a trie.

We start at the leaves, and then move up a certain number of levels up from the leaves. For instance, level 0 is the leaves themselves, level 1 is the set of the parents of all of the leaves, level 2 is the parents of that set, etc. We can only do this if the DAWG has been indexed, so if it hasn't been yet, we do it here.

Note: the results from this method are probably only useful when the method is called against an entirely uncompressed trie, since a given node might be at more than one level in a compressed dawg. This is mostly a debugging method. Also, bear in mind that the "level" in this method is not the same as the depth value generated by the buildIndex method.
Returns:
list of nodes as Node instances
Raises:
DawgError - under exception circumstances

list(self)

Returns a sorted list representation of the trie/dawg.
Returns:
sorted list of words in the dawg
Raises:
DawgError - under exception circumstances

loadFile(self, db=None, wl=None)

Reads a wordlist or database file into a dawg.

Either a database or a wordlist may be used, but not both. If a wordlist is used, it is assumed to be properly sorted on disk (i.e. the words must be in alphabetical order). Neither file will ever be written to.

Calling this function on a dawg that has been previously initialized will overwrite the contents of that dawg.

You may not choose to load a wordlist into an ON_DISK dawg. First, create an IN_MEMORY dawg using the wordlist. Then, save the IN_MEMORY dawg as a database. Then, you can load use the database on disk as the source for your ON_DISK dawg.
Parameters:
db - database on disk to be used to initialize dawg
           (type=valid filesytem path as a string)
wl - wordlist on disk to be used to initialize dawg
           (type=valid filesystem path as a string)
Raises:
DawgError - under exception circumstances

loadList(self, list)

Reads a Python list into a dawg.

Calling this function on a dawg that has been previously initialized will overwrite the contents of that dawg.

You may only load a list into an IN_MEMORY dawg. If you need the list in an on-disk dawg, create an IN_MEMORY dawg using the list. Then, save the IN_MEMORY dawg as a database. Then, you can load the database on disk as the source for your ON_DISK dawg.
Parameters:
list - Python list to be used to initialize dawg
           (type=list of strings)
Raises:
DawgError - under exception circumstances

patternSearch(self, pattern)

Searches for a pattern in a trie/dawg.

A pattern uses the . (period) character to represent a wildcard. So, for instance, the pattern '.ai.' would match the words 'rail', 'hail', 'mail', etc. if those words were in the trie/dawg.
Returns:
list of words in the trie/dawg which match the pattern
Raises:
DawgError - under exception circumstances

remove(self, key)

Removes a key (word) from a dawg.

This function removes a word from a dawg by building a list of the words in the dawg, removing the key from the list, and then re-building the dawg based on the new list. Obviously, this is rather memory intensive and inefficient. The whole point of using a dawg in the first place is to avoid having a list of words around. However, I haven't found any other good way to do this.

You may not really need this functionality. If you only need to remove a few words from an existing dawg, consider whether you might be better off just keeping around a small Python list or dictionary to hold those words. You could then use the list or dictionary to validate the results returned from the patternSearch method.

Remember, once you remove a word from a compressed dawg, it will no longer be compressed.
Raises:
DawgError - under exception circumstances

save(self, db=None, wl=None, overwrite=False)

Writes a dawg to disk as a wordlist and/or a portable "database" format.

Both the wl and db arguments may be provided. This will cause the method to write both a wordlist and a database to disk.

Note: if the dawg was created as ON_DISK, then this method cannot be used to overwrite the in-use database on disk.
Parameters:
db - database to be written to on disk
           (type=valid filesytem path as a string)
wl - wordlist to be written to on disk
           (type=valid filesystem path as a string)
overwrite - indicates whether to overwrite an existing file on disk
           (type=boolean True/False)
Raises:
DawgError - under exception circumstances

search(self, key)

Searches for a complete string in a trie/dawg.
Parameters:
key - string to search for
           (type=string, of any length)
Returns:
boolean indicating whether the key was found in the trie/dawg
Raises:
DawgError - under exception circumstances

size(self)

Returns the dawg's size in terms of nodes.

This method works by creating a dictionary containing an entry for each unique node. It's rather inefficient, and you may not want to use it in production code.

This method can only be used for IN_MEMORY dawgs.
Returns:
tuple of (unique nodes, node references)
Raises:
DawgError - under exception circumstances

traverse(self, func=None)

Traverses a dawg and executes func(word) for each word.

If no function is provided, then the default function is:
  lambda x: sys.stdout.write("%s\n" % x)
which will print out each word on its own line (the words will be unordered).
Parameters:
func - function to be called when an endpoint is reached
           (type=reference to a function which takes one argument, a word)
Raises:
DawgError - under exception circumstances

Property Details

db

Location of database on disk.

indexed

Indicates whether the dawg has been indexed.

root

Root node of trie/dawg.

type

Dawg type (IN_MEMORY or ON_DISK).

wl

Location of wordlist on disk.

Generated by Epydoc 2.1 on Tue Apr 4 21:55:58 2006 http://epydoc.sf.net