The Great database debate –What’s the deal ?

When I look at the online game industry and its dynamics, it reminds me somewhat of earlier innovations – pivoting around data management – that took place in other industries. Consider for example, the automation of reservation systems in airlines with SABRE, or the blinding competitive advantage that Wal*Mart build up (and enjoys even today) with its automated supply chain and real time inventory management. These were among other companies and products with visions about the potential power of data. Capturing and storing data is one thing, realizing the competitive advantage of data is another.

Ok, so what does it mean to the online gaming industry?

I have had many a conversation and I still see a lack of “thinking through the database strategy”. Some in the industry are now realizing that the data is strategic to their business and that in terms of game performance, the database engine needs to perform and scale like their game engines. Yet, many companies out there seem to settle on their decision with open source, specifically, MySQL, too quickly. Here are some of the things I usually get:

  • It’s free
  • It does “most” of what we want it to do
  • Google, Wikipedia use MySQL – it scales for them!

Now I will say upfront that I have nothing personally against MySQL – there are scenarios where it meets the needs of certain business requirements adequately. But there’s a lot of omission of detail in much of these statements (MySQL does advertise that Google uses them) that is actually presenting a lot of misleading conclusions – I think it’s probably time to put things in the right perspective. So , you might want to grab that cup of coffee and a snack to bite on , this could be some long reading.

Let’s put cost in perspective – “It’s free” refers to the licensing cost of acquiring the open source community edition. Most people I know always sign up for some level of support which has to be paid for. You then need to install it, run it, design database schemas, load data and maintain it. That’s investment in time and DBA resources, aka total cost of ownership. Calculate those costs for at least three years out for all the instances that need to be deployed and managed (and the supporting hardware and people needed to manage it). Add to that any additional costs for upcoming requirements such as cluster support or high availability. That’s the price you are going to have to pay. That, my friends, does not spell FREE.

In terms of “it does most of what we want it to do” – the biggest risk you face there is working around capabilities that the database management system doesn’t provide (but ought to). Think about it, this can not only be a huge cost, but also sub-optimal in terms of performance and efficiency. It means:

  • that prior to MySQL 5.0 (which is very recent), code that dbmses were efficient at executing such as stored procedures and triggers had to be written in application logic. This further means that business logic that needs to be shared, cannot be and have to be made redundant across applications
  • that additional support components for functionality such as load balancing and failover have to be acquired or again coded in the application.

If you can see through that, you realize that those are hidden costs that are not accounted for. Another risk that I have seen materialize is when customers find that a critical capability they need is missing, they have to invest in porting a lot of their code  – adding to the cost bucket significantly.

Now on to some interesting stuff – Google and Wikipedia!

First, yes, (a) Google uses MySQL and yes (b) Google scales but concluding (b) happens because of (a) – er, that’s really stretching it and is not factual.

Google has tens of billions of web pages, over 8 billion indexes, Google Earth data and Google Analytics and of course Google’s AdWords application. The only publicly cited application of MySQL at Google is AdWords. To my knowledge, once you sign up for AdWords and fill in your ad and keywords, then it’s pretty much a read-only data model where the ads are looked up and placed in Google pages/results and websites that cater to AdSense. Google’s infrastructure is much more beyond the AdWords application.

So Google does not scale on MySQL. In factt I did some research and have come to learn that Google scalability is built on:

  1. Google File System (distributed file system clusters)
  2. BigTable ( a specific data management system for a specific need)
  3. Chubby – Distributed Lock service
  4. MapReduce

They need the combination of the above four to store and scale large amounts of data – essentially they needed to scale across hundreds of thousands of distributed servers that collectively store petabytes of data.

Without going into too much detail, (there’s lot of info already on the internet) BigTable is a persistent multi-dimensional sorted map – the map is indexed by a row key , a column key and timestamp. In order to perform parallel computations on the data stored in BigTable, Google uses a software framework called MapReduce. All this runs on top of Google’s distributed file system cluster.

As you can see, Google made huge investments in building their own technology to scale their infrastructure – including their own data management system – BigTable. So there you go.

Wikipedia is built on top of the MediaWiki application server. Again Wikipedia does not scale on MySQL alone, in fact, there is a rich set of supporting scaling infrastructure that sits above MySQL:

  1. A complete content delivery and acceleration network front-ended with caching servers – squid servers, load-balancing with Linux Virtual service – this load balances between the CDN, Application (MediaWiki) and Search (Lucene)
  2. PowerDNS for geographical distribution
  3. MediaWiki Application Server
  4. memcached for distributed object cache
  5. A separate Fulltext search component using Lucene – because MySQL fulltext is only available on MySQL ISAM and does not perform or scale as well as Lucene
  6. Separate Media storage and server for static files including images
  7. And finally, MySQL to host the core database – one database per wiki – core data consists of metadata, users, article revision history, article references. 90% of the queries will be single table lookups on primary keys – there are roughly about 12 tables per core database. When an edit is submitted, mediawiki writes it to the database, without deleting previous versions.

So how does Wikipedia scale? Look closely it’s a combination of Squid caching servers + MediaWiki + Lucene+MySQL+memcached.

A lot of hits are serviced by the cache and they “avoid” expensive database queries.

So in summary this works for WikiPedia because:

  1. of lots of cacheable content.
  2. scaling users is almost completely handled at the application level – MediaWiki and powerDNS
  3. database requirements are very simple

The beauty with Informix is that it does not require applications to “help” out and does not need to rely on a host of supporting software to get work done. On the other hand, if the enterprise uses third party solutions as part of their infrastructure, Informix works well and complements many third party software and hardware high availability and scalability solutions. So stay tuned for a look at how built in Informix capabilities work to maximize your applications performance, scaling and availability requirements.

I hope that gives things a little bit more perspective. And hopefully, you made it through without a refill of your coffee cup! At the end of the day, what works for Google works for Google’s business needs, what works for Wikipedia works for Wikipedia’s needs. Your business is your business and the infrastructure needs have to be thought through at every level.

Stay tuned for more !!

Leave a Comment