Search word in huge text using Trie

Hello,

In this article I will try to show how to search for words in large text. The idea is this – using the text I create a trie (comes from retrieve) in which every word has a branch/leaf. Starting at root and symbol by symbol add a branch for it if it is not added yet and when reach an break symbol(.,; [] ….) go back to root and continue.

Each TrieNode contains an HashTable (Dictionary .net) with char key and value TrieNode. So for each char there is a brach to the next node. Also there is a property – IsWord which is used to determine whether or not the word(and not a part of it – i.e. alp not a word -> alpine – is)  exists in the original text. At initialize each node receives as children (in the hash table )all separator symbols. They all point to null – this is used whenbuilding the trie to determine whether the word is finished. In the article every char that points to a next node ( or null with separators) will be called child.

So lets take for example the (not very meaningful) text “Motorway “Alpine” is alpha for race bikes.”

Start at first symbol – m(case insensitive – it could be done as case sensitive also):
The current node is the root. In the root there is no child “m” so we create one and make the current node to be that newly created one. Next symbol o -> no such child ->create a new one and make it the current. So on until we each ’empty space symbol ‘ which will be found to exist as a child and that it points to null and so the current node’s IsWord property is set to true and the current is set to the root. (the algorithm is explained below in more detail)

step1

The next is “Alpine” using steps like before we have:

step2

So when we reach “alpha” what happens ->
1. a -> exist in children of root -> don’t add and make a  current. Next -> l
2. l -> exists in children of a -> don’t add and make l current. Next -> p
3. -> exists -> make it current -> next is h
4. h -> doesn’t exist – create one node h in children of p and make i. Next is a
5. a -> doesn’t exist – create one and make it current. Next is ‘ ‘
6. ‘ ‘  -> exists in a and point to null -> so we determine this is the end of word. Do current.SetWord(//sets IsWord = true;) and make the current to be the root.
(The separator points to null because, as mentioned before – each node is initialized with all separators as children  pointing to null)

step4

and so on until we get a trie looking like this:

stepn

This is good only if many searches will be done. Otherwise the time and memory spent on building the trie will not be worth it. The search will be very fast because on every step large portions of the trie are left outside of search – so time of search will be much faster than straightforward search. The longer the text is, the faster the search – compared to straightforward search.

The search will take each symbol of the word and find the corresponding child in the current node (starting from root) and if the word ends and the current node’s property IsWord == true so the word exists in the original text. Any other case – it doesn’t.

C# implementation for a huge text( in the example I’ve used some native Language files but any file can be used – text file) – here

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s