In this article I will try to show how to search for words in large text. The idea is this – using the text I create a trie (comes from retrieve) in which every word has a branch/leaf. Starting at root and symbol by symbol add a branch for it if it is not added yet and when reach an break symbol(.,;  ….) go back to root and continue.
Each TrieNode contains an HashTable (Dictionary .net) with char key and value TrieNode. So for each char there is a brach to the next node. Also there is a property – IsWord which is used to determine whether or not the word(and not a part of it – i.e. alp not a word -> alpine – is) exists in the original text. At initialize each node receives as children (in the hash table )all separator symbols. They all point to null – this is used whenbuilding the trie to determine whether the word is finished. In the article every char that points to a next node ( or null with separators) will be called child.
So lets take for example the (not very meaningful) text “Motorway “Alpine” is alpha for race bikes.”
Start at first symbol – m(case insensitive – it could be done as case sensitive also):
The current node is the root. In the root there is no child “m” so we create one and make the current node to be that newly created one. Next symbol o -> no such child ->create a new one and make it the current. So on until we each ’empty space symbol ‘ which will be found to exist as a child and that it points to null and so the current node’s IsWord property is set to true and the current is set to the root. (the algorithm is explained below in more detail)
The next is “Alpine” using steps like before we have:
So when we reach “alpha” what happens ->
1. a -> exist in children of root -> don’t add and make a current. Next -> l
2. l -> exists in children of a -> don’t add and make l current. Next -> p
3. p -> exists -> make it current -> next is h
4. h -> doesn’t exist – create one node h in children of p and make i. Next is a
5. a -> doesn’t exist – create one and make it current. Next is ‘ ‘
6. ‘ ‘ -> exists in a and point to null -> so we determine this is the end of word. Do current.SetWord(//sets IsWord = true;) and make the current to be the root.
(The separator points to null because, as mentioned before – each node is initialized with all separators as children pointing to null)
and so on until we get a trie looking like this:
This is good only if many searches will be done. Otherwise the time and memory spent on building the trie will not be worth it. The search will be very fast because on every step large portions of the trie are left outside of search – so time of search will be much faster than straightforward search. The longer the text is, the faster the search – compared to straightforward search.
The search will take each symbol of the word and find the corresponding child in the current node (starting from root) and if the word ends and the current node’s property IsWord == true so the word exists in the original text. Any other case – it doesn’t.
C# implementation for a huge text( in the example I’ve used some native Language files but any file can be used – text file) – here