When it comes to choosing a good algorithm we often get hurt by the amount of time we spend trying to optimize something. There is an old story about the mathematician Gauss, who was busy adding integers from 1 to 100 when he was in school. Although other students worked hard to add each number, Gauss realized that 100 + 1 is 101 and 99 + 2 is also 101. Guess what 98 + 3 is? Of course, 101. So you can easily find that there are 50 pairs that add up to 101 and know the answer is 5,050. No matter how fast you add, you are unlikely to lose anyone who knows that algorithm. So here’s a question: you have a large chunk of text and you want to search it. What is the best way?
Of course, this is a loading question. The best can mean a lot and will depend on the type of data you are working with and even the type of machine you are using. If you are just looking for a string, you can of course use the brute force algorithm. Suppose we look for the word “criminal” in the text of war and peace:
- Start with the first letter of war and peace
- If the current letter is not the same as the current “Criminal” letter, go to the next letter, reset the current letter to “Criminal” and return to Step 2 until you find another letter.
- If the present letters are the same, go to the next letter of the guilty person and compare it with the next letter, without forgetting the present letter of the text. If this is the case, keep repeating this step until there are no more characters in “Criminal” (at which point you have a match). If not, reset the current character of “criminal” and go back to the original current character of the text and then go to the next letter, go back to step 2.
It’s hard to describe in English. But, in other words, just compare the text with the search string, letter by letter, until you find a match. It works and in fact, with some modern hardware you can write some quick code for it. Can we do better?
Better algorithm
Again, it really depends on your good definition. Suppose there are many strings in the text that are almost what we are looking for, but not quite. For example, war and peace probably have a lot to do with the word “the”. But it has “there,” “then,” and “other” all of which have our target words. It’s not a big deal for the word “the” because it’s short, but what if you’re sifting through large search strings? (I don’t know – DNA genome data or something.) You’ll spend a lot of time chasing dead edges. When you discover that the current text contains 199 of the 200 characters you are looking for, it will be frustrating.
There is another problem. Although it is easy to tell where the string matches and therefore, where it does not match, it is difficult to find out if only a small insert has been inserted or deleted. Important for tools like this diff
And rsync
Where they just don’t want to know what matched, they want to understand why things don’t match.
It was looking rsync
In fact, that led me to see how rsync
Compare two files using a rolling checksum. While this may not be for every application, it is interesting to have it in your strategy bag. Clearly, one of the best uses of this “rolling checksum” algorithm is exactly how. rsync
Using it means that it finds files when they are separated very quickly, but they can also do a reasonable job of finding out when they return to the same state. Roll the reference frame, rsync
Can detect what has been inserted or deleted and make appropriate changes remotely by saving network bandwidth.
In search
However, you can use the same technique for conducting large text searches. To do this, you need a hashing algorithm that can easily enter and exit items. For example, suppose the checksum algorithm was simple. Just add the ASCII codes together for each letter. So the “AAAB” string is hashed to 65 + 65 + 65 + 66 or 261. Now suppose the next letter is a C, that is, “AAABC”. We can calculate the checksum starting from the second position by subtracting the first A (65) and adding a C (67). Stupid with this small data set, of course, but instead of adding hundreds to a number every time you want to calculate hashes, you can now do it by adding and subtracting each one.
We can then calculate the hash for our search string and start counting the hashes of the file for the same length. If the hash code does not match, we know there is no match and we move on. If they match, we’ll probably have to verify the match since the hashes are, in general, irrational. Two strings can have the same hash value.
However, there are some problems with it. If you are only looking for a single string, computing hashes is expensive. In the worst case, you will need to make a comparison, an addition and a subtraction for each character, as well as some tests during your hash collision: two strings with the same hash that do not actually match. With the general scheme, you need to do a test for each character and do some waste tests for false positives.
To optimize the hash algorithm, you can have a fancier hashing. But it is also more expensive to calculate, making overhead worse. However, what if you were looking for a bunch of the same strings of the same length? Then you can count the hash once and save it. After that every search will be very fast because you don’t waste time doing many end investigations just to go backwards.
My hash algorithm is very simple, but not very good. For example, you can see in the example that a false positive will cause an additional comparison. Of course, better hash algorithms exist, but there is always the possibility of collisions.
What is the difference between using this hashing technique? Well, I decided to write a short code to find out. I decided to ignore the cost of computing the initial part of the search pattern hash and the rolling hash because these would be zero in enough interaction.
Convicted
If you search for the word “accused” in Project Gutenberg’s war and peace text, you will find that it occurs only four times out of 3.3 million characters. A typical search had to compare about 4.4 million to find out. The hash algorithm easily wins just under 4.3 million. But the hash count spoils it. If you calculate addition and subtraction as costs equal to two comparisons, it adds up to a total of about 5.8 million pseudo-comparisons.
That usually? Probably not too many false positives for the “criminal”. If you run the code with the word “the” so that there should be lots of false hits, the conventional algorithm compares to about 4.5 million and the total consistent for the hash algorithm is about 9.6 million. So you can see how false positives affect normal algorithms.
You will notice that my lazy hashing algorithm results in a lot of false hash positives which eliminates some of the benefits. A more complex algorithm will help, but will cost some advance calculation so it doesn’t help as much as you might think. Almost any hashing algorithm for an arbitrary string would collide with something. Of course, for smaller search strings, the hash may be a search string and it would be perfect, but in general it is not.
The code does not store hashes, but suppose it did and suppose the false positive rate of the first search is approx. This means that once the hashes are pre-computed, we save a little over 100,000 comparisons per search. So once you search for 60 or more strings, you break even. If you search for 600 strings – don’t forget, they must be the same size – you can save quite a bit in simple comparison code.
I don’t actually do time things, because I don’t want to optimize every bit of code. In general, fewer operations are going to be better than more operations. There are plenty of ways to increase the effectiveness of code and some heuristics that you can apply if you do a little analysis of the search string. But I just wanted to check my gut-level feeling about how much each algorithm spent on text search.
Reflection
I started thinking about this after reading the code rsync
And backup programs kup
. As it turns out there is a name for it, the Robin-Carp algorithm. There are some good hash functions that can reduce false positives and get a few extra points of efficiency.
What about me? I’m not suggesting that an RK search is your best approach to things. To get the most out of it, you need a large data set with many specific-sized searches If you think about something like that rsync
, It really uses hashes to search for places where two very long strings can be equal. But I think there are some cases where these odd algorithms can make sense, so it’s worth knowing about them. It’s also fun to challenge your insights by writing a little code and getting some idea of how good or bad one algorithm is over another.