Thoughts on implementing your own search engine

Imagine you have a big database of products which you want to make accessible to your customers via a search engine - what would you do? Of course you can bootstrap a first working version using a third-party solution, as we did with Swiftype. However, as with most other third-party solutions, you'll eventually hit a point where the third-party doesn't satisfy your needs anymore. In our case our "need" was simply improved search results tailored for our usecase, but because there's not many settings you can tweak in Swiftype and we couldn't find a viable alternative we decided to roll our own search engine. I had a chat with someone who feels very comfortable with databases and search engines in general and he gave me some tips to make our new search shine. Here's what he told me / what we eventually came up with:

Tokenization

The first and probably most important step is the tokenization of your data. It's translating a string like "first-party search engines rock" into a set of strings"first, party, search, engines, rock". So basically you split the string into all its words. Sounds easy? Almost, if it weren't for compound words and other funny language-specific characteristics. Compound words are not so common in English, even so that I can't think of one right now? But in German they are VERY common. If you can't handle compound words then your search probably sucks for German data. So our solution to this problem is to first create a set of "known words". In our case, we used the data we want to tokenize in order to tokenize it. Inception. So if we have a product called "milk" and another called "milkshake" we split the latter into "milk" and "shake" because we know that "milk" is a real word. Obviously this can also lead to false positives where you don't want to associate a product with "milk" although its name contains those characters, but that is a whole set of new problems we won't address today. Another set of "known words" could come from previous queries of your users. However, you should handle those separately, i.e. keep a set of "clean" and "dirty" keywords.

Stemming

The next step would be to "normalize" words so that "run", "running" and "ran" are all associated with "run". This is called stemming. We skipped this step because it doesn't make much sense for our kind of data (mostly names, so no verbs) and is quite hard to implement - usually involving some kind of dictionary.

Scoring

What you want to do next is to score your tokens so that we only need to query some kind of database in order to get the results later. For each token you count how often it appears for one product and then assign that as a score for this particular keyword for this product. For example a product called "yummy milk", which is in a category of products called "milk products" you assign a score of 2 for the keyword "milk" for the product "yummy milk". Bonus: weight your score by assigning a higher score if a keyword appears in certain fields (e.g. increase the score by 2 if the keyword appears in the name of the product and only increase by 1 if it appears in the category).

Phonetics

Now we have lots of keywords per product, but what if the user searches for something we don't have a keyword for or if he mistypes his query? First, we create phonetics for each keyword and store that. It's how a word is pronounced, so if the user searches for a word that sounds similar to one of our keywords we'll still be able to return a result. For German data you can use "cologne phonetics".
So what about typos like "mlik" instead of "milk"? We store each keyword with its characters sorted. Boom. Both "mlik" and "milk" will be translated to "ilkm" first so they both return the same results. Again, this can lead to false positives in some cases.

Further improvements

Other things you can do to improve your search engine (not yet implemented by us):
- scrape a word's synoms from Wiktionary and apply them during tokenization
- generate possible typos for a word by looking at the keys surrounding a character, e.g. for "i" in "milk" there is "u", "o", "j" and "k" around it on the keyboard, so we create alternative keywords: "mulk", "molk", ... you get the idea.

The final improvement that would improve your search engine: natural language processing. Someone who is searching for "milk" is probably only interested in actual milk, nothing else. No idea how to implement that, only Google knows I guess...

Just for fun, here's what our search data looks like in Google AppEngine Search: