Friday, June 3, 2016

AppEngine 101: Datastore Consistency

The beauty of cloud solutions like AppEngine and its database, called Datastore: it just scales. It does indeed scale very well, but it does so by applying a few restrictions. In the case of Datastore that is "eventual consistency", something you're not used to when you're used to conventional databases like MySQL.

What does eventual consistency mean?

Here's a really simple example to describe it: You have a table called Messages where you store messages sent by the users of your website (a chatroom, or guestbook, etc). When the page is reloaded you query all data from the Messages-table and display it. Someone enters a message and it's stored in the database. The page is reloaded moments later and all Messages are looked up using a query. The recently stored message does not show up though. Because eventual consistency!

After changing (creating, updating, deleting) data in your database, queries executed moments later might (!) not return the latest data in some cases. Eventually, though, it is going to return those changes. It might be nanoseconds, seconds, ... later.

Your first thought might be that this is awful, however it isn't. It is what allows us to scale virtually infinitely. Eventual consistency is completely fine for lots of usecases: Facebook News Feed (who cares if those status updates show up a few moments earlier or later?), or even static data (a shop which changes its product assortment only once a week during a maintenance timeframe).
Of course there are times where consistent data is crucial: everything involving real money flowing, mission-critical data used for real time status monitoring, etc. This is why Datastore has "Transactions". Every database action executed within a transaction is consistent. However, if consistency can't be assured because someone else is changing that data at the same time transactions fail and you have to retry them for example.

For a much more detailed explanation check out this article: Balancing Strong and Eventual Consistency with Google Cloud Datastore

Tuesday, May 31, 2016

Java 101: Collections

One thing you really have to understand when learning Java are its data structures, commonly known as Collections in Java. As the name suggests, Collections describe classes which allow you to collect a certain amount of data. If you're coming from JavaScript for example, it's what you refer to as arrays and objects. There is tons of different Collections available in the standard libraries of Java - even more available from third-party sources - but there's only a few types and implementations of them which you really have to know by heart for day-to-day development. Let's discuss them briefly one by one:

List

A list has an order and does allow duplicate values*. The two most commonly used implementations are LinkedList and ArrayList, the former being used if you don't know upfront how much data will be added to the list later, the latter if you do know.

Set

A set has no order and does not allow duplicate values*. You usually use it in form of a HashSet. There is nothing comparable available in JavaScript.

Map

A map has no order and does not allow duplicate values*. HashMap is how you use it in most cases. Each value is assigned a key and can be effectively accessed using that. In JavaScript this behaves very similar to an object.

*when I talk about order and duplicate values, I'm talking about the most common implementations. Theoretically you could have each type of collection with any characteristics you want, based on its implementation.

How do you decide which one to use?

Given enough experience in hands-on programming you instinctively know which data structure to use for a certain task. Here's some rule of thumbs for quick reference:
  1. for a certain amount of data without duplicates, which you want to access effectively (e.g. use contains-method) use a Set
  2. for a certain amount of data without any further requirements (duplicates allowed, no specific performance requirements) use a List. As mentioned before, use ArrayList if you know the final size upfront, LinkedList otherwise.
  3. if you have to lookup data repeatedly (not iterate them one by one, but get a specific item) use a Map where you assign a key which is easily creatable and look up your data using that.
Here's a short code snippet which I hope makes it clear what I'm talking about:

I found that beginnersbook.com has a few nice examples for each of the aforementioned implementations. Check them out for more details on each of them.

Monday, May 30, 2016

Mac OSX Virtual Machine guest lags

If you have a Mac VM running as a guest on a non-Mac host (Windows in my case) you might experience some serious lag and graphic glitches like I did. Fixing it was pretty easy by disabling "BeamSync" - the Mac-equivalent to VSync if I understood that correctly. Anyway, you don't need it for normal Mac usage (i.e. no videos, games, etc I guess), so download "BeamSyncDropper" and keep on using your Hackintosh efficiently :)

You can download the tool + read instructions on how to use it here: http://www.tonymacx86.com/threads/beamsyncdropper-tool-to-disable-beamsync-permanently.92201/page-2#post-666095

PS: Mac VMs are great for developers who only developer for iOS if they really have to! ;)

Saturday, May 28, 2016

Thoughts on implementing your own search engine

Imagine you have a big database of products which you want to make accessible to your customers via a search engine - what would you do? Of course you can bootstrap a first working version using a third-party solution, as we did with Swiftype. However, as with most other third-party solutions, you'll eventually hit a point where the third-party doesn't satisfy your needs anymore. In our case our "need" was simply improved search results tailored for our usecase, but because there's not many settings you can tweak in Swiftype and we couldn't find a viable alternative we decided to roll our own search engine. I had a chat with someone who feels very comfortable with databases and search engines in general and he gave me some tips to make our new search shine. Here's what he told me / what we eventually came up with:

Tokenization

The first and probably most important step is the tokenization of your data. It's translating a string like "first-party search engines rock" into a set of strings"first, party, search, engines, rock". So basically you split the string into all its words. Sounds easy? Almost, if it weren't for compound words and other funny language-specific characteristics. Compound words are not so common in English, even so that I can't think of one right now? But in German they are VERY common. If you can't handle compound words then your search probably sucks for German data. So our solution to this problem is to first create a set of "known words". In our case, we used the data we want to tokenize in order to tokenize it. Inception. So if we have a product called "milk" and another called "milkshake" we split the latter into "milk" and "shake" because we know that "milk" is a real word. Obviously this can also lead to false positives where you don't want to associate a product with "milk" although its name contains those characters, but that is a whole set of new problems we won't address today. Another set of "known words" could come from previous queries of your users. However, you should handle those separately, i.e. keep a set of "clean" and "dirty" keywords.

Stemming

The next step would be to "normalize" words so that "run", "running" and "ran" are all associated with "run". This is called stemming. We skipped this step because it doesn't make much sense for our kind of data (mostly names, so no verbs) and is quite hard to implement - usually involving some kind of dictionary.

Scoring

What you want to do next is to score your tokens so that we only need to query some kind of database in order to get the results later. For each token you count how often it appears for one product and then assign that as a score for this particular keyword for this product. For example a product called "yummy milk", which is in a category of products called "milk products" you assign a score of 2 for the keyword "milk" for the product "yummy milk". Bonus: weight your score by assigning a higher score if a keyword appears in certain fields (e.g. increase the score by 2 if the keyword appears in the name of the product and only increase by 1 if it appears in the category).

Phonetics

Now we have lots of keywords per product, but what if the user searches for something we don't have a keyword for or if he mistypes his query? First, we create phonetics for each keyword and store that. It's how a word is pronounced, so if the user searches for a word that sounds similar to one of our keywords we'll still be able to return a result. For German data you can use "cologne phonetics".
So what about typos like "mlik" instead of "milk"? We store each keyword with its characters sorted. Boom. Both "mlik" and "milk" will be translated to "ilkm" first so they both return the same results. Again, this can lead to false positives in some cases.

Further improvements

Other things you can do to improve your search engine (not yet implemented by us):
- scrape a word's synoms from Wiktionary and apply them during tokenization
- generate possible typos for a word by looking at the keys surrounding a character, e.g. for "i" in "milk" there is "u", "o", "j" and "k" around it on the keyboard, so we create alternative keywords: "mulk", "molk", ... you get the idea.

The final improvement that would improve your search engine: natural language processing. Someone who is searching for "milk" is probably only interested in actual milk, nothing else. No idea how to implement that, only Google knows I guess...

Just for fun, here's what our search data looks like in Google AppEngine Search:
columns: sorted characters of a keyword used as a field name. rows: score for each keyword per product

Thursday, December 31, 2015

How to get the most out of CloudFlare

What are the benefits of using CloudFlare?

Quite a few things, all of them being free to use:

Using CloudFlare: performance comparison before and after

Unfortunately blogs hosted by Blogger are designed in a way that makes it impossible for CloudFlare to fully optimize it: static resources (JavaScript, images, etc) are hosted on dozens of different domains, but CloudFlare can only optimize content hosted on your own domain. There's still a few things CloudFlare can do, most importantly: delaying JavaScript until the page is loaded, therefore speeding up the time it takes to see content on your website.

Here's the raw data:
before enabling CloudFare - WebPagetest
after enabling CloudFlare - WebPagetest

before enabling CloudFlare - PageSpeed Insights
after enabling CloudFlare - PageSpeed Insights

How to configure CloudFlare?

  1. Sign up at cloudflare.com
  2. Follow CloudFlare setup
    1. Add your domain
    2. Make sure they imported all DNS-entries for your domain (about half of them missing in my case). Also make sure the "Status" of each entry is an orange cloud-icon. That means that all traffic is going through CloudFlare's server. Only if you enable this you'll benefit from the features offered by CloudFlare - otherwise it's just a plain DNS server.
    3. Change nameserver at your domain registrar
  3. Configure cloudflare
    1. Default settings are mostly fine, I turned down "Security Level" to "Low" because I want to avoid false positives where some of my visitors have to enter a captcha before reading my blog...
    2. Turn on "Auto Minify" for HTML, JS and CSS. There's almost no risk of breaking something, unless you're doing funky stuff in your JavaScript (which you shouldn't do anyway if that's the case).
    3. Wait for the DNS changes to kick in (1-2 days), see if everything still works fine and then give "Rocket Loader" a try. Set it to "Automatic", force-reload your website and see if everything works as expected.
    4. Create a "Page Rule" for "yourdomain.com/*" (e.g. "tomtasche.at/*") and set "Custom Caching" to "Cache everything"