Building a prefix trie

Prefix trie‘s are cool data structures that let you compress a dictionary of words based on their shared prefix. If you think about it, this makes a lot of sense. Why store abs, abbr, and abysmal when you only need to store a,b,b,r,s,y,s,m,a,l. Only storing what you have to (based on prefix) in this example gives you a 70% compression ratio! Not too bad, and it would only get better the more words you added.

The classical way of dealing with prefix tries is to store the suffixes using a map, but for fun I tried something different and used a list.

Data

The main data structure I had was like this

type Key a = [a]

data Trie key = Node (Maybe key) [Trie key] Bool deriving (Show, Eq, Read)

So a trie is really just a node that has a list of other tries (which would be its suffixes). You can see an example of the structure with the word “foo”:

[Node (Just 'f') 
     [Node (Just 'o') 
        [Node (Just 'o') [] True] 
     False] 
 False]

It’s printable to the screen, equatable, and can be serialized from text. It also contains a boolean field representing if it is the end of a word or not. You can see the boolean at the last “o” of the word, indicating the word terminates. All the other characters are non-terminating fields.

Searching

Let’s add a helper method to find a key in a list of tries. Remember that each root can contain a list of other trie’s, so it’ll be nice to be able to say “hey, we’re at root X which has a list of possible suffix starts Y. Let’s find the next root from the list of Y and return it“.

{-
    Finds a key in the trie list that matches the target key
-}
findKey :: (Eq t) => t -> [Trie t] -> Maybe (Trie t)
findKey key tries = L.find (\(Node next _ _) -> next == Just key) tries

It takes an equatable value, and a list of Trie’s of those values, and finds the key returned as an option type.

We could even now write code to find a Trie

{-
    Takes a key list and finds the trie that fullfils that prefix
-}
findTrie :: (Eq t) => Key t -> [Trie t] -> Maybe (Trie t)
findTrie [] _ = Nothing
findTrie (x:[]) tries = findKey x tries 
findTrie (x:xs) tries = findKey x tries >>= nextTrie
    where nextTrie (Node _ next _) = findTrie xs next

Since findKey is an option type, we can leverage the option monadic bind operator >>=. This operator will pass a result to the next function only if the result is a Just type. This is a way to safely process ‘nullable’ items.

Insertion

Now the insertion code, which honestly was the most complicated.

insert :: (Eq t) => Key t -> [Trie t] -> [Trie t]
insert [] t = t
insert (x:xs) tries = 
    case findKey x tries of 
        Nothing -> [(Node (Just x) (insert xs [])) isEndWord]++tries
        Just value -> 
            let (Node key next word) = value
            in [Node key (insert xs next) (toggleWordEnd word)]++(except value)
    where 
        except value = (L.filter ((/=) value) tries)
        isEndWord = if xs == [] then True else False
        toggleWordEnd old = if xs == [] then True else old

The trick to pure functional insertion is that you need to rebuild the entire data structure WHILE you search for where you want to actually add things.

In the first case where you try to find a key but didn’t find it, you want to add the key as a neighbor to the other trees. So if you have a root word of “abd” but you add “def“, the root character of “d” doesn’t exist yet and needs to be added, hence creating a new list and appending it to the original tries list. Then you recursively go through the new element and add the remaining list. This is why the creation of the new element has a recursive call back to the insert function with the remaining list xs. Also, since you created the new element as part of a recursive call, you know this element is already being properly appended to its root.

But, if the key already exists then that means you found a suffix that matches. At this point you need to bubble through the inner suffix tree to add the remaining items you want (basically following the suffix trie down to a point where you haven’t already found a suffix root). Once thing to remember is that you also need to recreate the key you found, excluding the original element (hence the except value since we want to exclude the old version of the key before).

On top of that, if you’ve already processed a root, but your word now ends at a root, you need to toggle if that suffix is a word completion. For example, if I’ve already added “abcdef” then “f” is the end of the word. But now if I add “abc” as a word, then both “c” AND “f” are end words (so “c“‘s end word boolean needs to be toggled to true).

I found this concept to be a little complicated because its both finding, updating, and rebuilding the data structure all at the same time, but with practice it does get more natural to trace through.

Getting the words

Once you can build a trie though, we can leverage haskells list comprehensions to fold the words back out

{- 
    Gives you all the available words in the trie list
-}
allWords :: [Trie b] -> [[b]]
allWords tries = 
    let raw = rawWords tries
    in map (flatMap id) raw
    where 
        flatMap f = Fold.concatMap (Fold.toList . f)
        rawWords tries = [key:next
                            | (Node key suffixes isWord) <- tries
                            , next <- 
                                if isWord then 
                                    []:(rawWords suffixes)
                                else 
                                    rawWords suffixes]

Since we have a list of roots, we can go through each suffix and append the root character to it. If we encounter a word, we can pop in an empty list to create a delimiter. The list comprehension will do every combination of every root with its suffixes, so we will get each word.

But, this makes a list of characters and so I wanted to flatmap the list to create strings.

Guessing a suffix

We can even do auto complete stuff now! Let’s look at how to guess what the available suffixes would be:

{- 
    This function takes a key list and returns the 
    full words that match the key list. But, since we already 
    know about the source key that matches (the input) we can
    prepend that information to any suffix information that is left.
    If the node found that matches the original query is a word in itself
    we can prepend that too since its a special case
-}
guess :: (Eq a) => Key a -> [Trie a] -> Maybe [Key a]
guess word trie = 
    findTrie word trie >>= \(Node _ next isWord) -> 
    return $ (source isWord) ++ (prependOriginal word $ allWords next)
    where 
        source isWord = if isWord then [word] else []
        prependOriginal word = map (\elem -> word ++ elem)

The idea is we have a string, and our original roots. We then look for the suffixes and when we find the target trie that ends at the word we have, we can find all the words after that and prepend our original word.

Examples

Lets look at some basic examples of how to use the Trie:

*PrefixTree> tries = build ["word", "wooordz", "happy", "hoppy"]

*PrefixTree> guess "h" tries
Just ["hoppy","happy"]

*PrefixTree> allWords tries
["hoppy","happy","wooordz","word"]

*PrefixTree> tries
[Node (Just 'h') [Node (Just 'o') [Node (Just 'p') [Node (Just 'p') [Node (Just 'y') [] True] False] False] False,Node (Just 'a') [Node (Just 'p') [Node (Just 'p') [Node (Just 'y') [] True] False] False] False] False,Node (Just 'w') [Node (Just 'o') [Node (Just 'o') [Node (Just 'o') [Node (Just 'r') [Node (Just 'd') [Node (Just 'z') [] True] False] False] False] False,Node (Just 'r') [Node (Just 'd') [] True] False] False] False]

*PrefixTree> guess "ha" tries
Just ["happy"]

*PrefixTree> guess "hi" tries
Nothing

Conclusion

It would be a lot faster to have leveraged a hash map instead of a list for the suffixes, since each time I need to find the suffix root I have to do an O(n) traversal, instead of an O(1), but since this isn’t a production trie I’m not worried about it. It was still a fun exercise!

For a full example check my github.

Post a comment

You may use the following HTML:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>