With his thinking about Redis Kenne Jima explores the fact that the next holy grail for the NoSQL trend will be to evaluate how the system behaves when the amount of data to manage is 10 times bigger than the available RAM of the system.
This big picture view is really interesting, but there is from my limited experience of "web scale" one big issue. This approach considers that all the web scale problems are similar in nature. Let me explain this a bit more with some examples of web scale:
- a search engine like Google;
- twitter;
- a blog hosting platform;
- a large analytic system.
These 4 examples operates easily at web scale but with very different usage patterns which makes the development of a generic answer to the problem very difficult. Do not forget that in the follow string of thoughts, the scale is the web, that is, at least 100's of GB of data.
The Search Engine
The meta web scale, as its goal is to make sense of the complete web, can be composed of two parts:
- the index, fully in RAM.
- the raw data and associated data, on disk when not hot and in use.
When you think about it, the storage of the index in RAM is totally different than the storage of raw data and support data on disk. This means that one system to cover both needs is nearly unfeasible. Also, you need to have the index fully in RAM if you want millisecond response time. It was said that one search request could hit up to 1000 servers in the Google farm, if you hit them in a tree fashion, depending of your buckets, you may have let say a 5 to 10 server depth. If each depth requires 3 seeks (10 ms each), you add the read of data, the maybe 2ms latency of the communication, then the need to aggregate the data, you would reach very often something like 3 to 5 seconds waiting time to get an answer.
So, the index is in RAM, fully customized to do one thing and do it well.
Twitter/Timeline System
In the post about the Twitter streaming API, we get that Twitter produces about 35KB of data per second (including the packaging of the network data in JSON format). But when you really think about the Twitter usage, it is:
- very dynamic for about 48h;
- totally dead for the tweets older than that (even the Twitter search is not returning these tweets);
So, they only need to be very efficient with 48x3600x35KB=6GB. Yes, only about 6GB of data. You can increase that to 600GB if you want, as you can get a system with 12GB of RAM for 100€ per month, you can store these in RAM for 5000€/m. Of course at such scale you have staff and go collocation and it is even cheaper.
So, twitter can keep hot only a fraction of its data in RAM with nearly no needs to access the old data. This makes the performance access of the old data a non problem, just one or two seeks will be fine.
A Blog Hosting Platform
The blog hosting platform is basically disk bound. The general usage pattern outside of the editors is: "search something on Google, click, see the page, go somewhere else". You may have a search engine (mostly used by editors) and a time line (for the feeds) but these fails into the Twitter/search engine problem.
In this case, only a RDBMS (or a simple file system based storage) with asynchronous update of the indexes/feeds is needed. Simple and any kind of well designed "bulk storage" will work. You will be able to go a long way with Lucene for search, so, you do not really need more.
A Large Analytics System
Here it is interesting, you have a lot of data and when you process them you really need to process them fast. Here, the column oriented design, with the capacity to compress data is really needed to minimize seek time. We are really in the case to have extremely efficient 1:10 ratio. But, you cannot use a NoSQL system because they are not column oriented and do not offer the per column compression needed to really achieve high performance. You end up using a system designed for column based access.
Why so Many NoSQL and Why it Does Not Matter
So, different data access patterns from different business needs and different constraints are the reasons why we have so many alternatives. These alternatives were mostly developed by a company to solve a specific problem when they were already at web scale (or on the trajectory to reach it), they are answering the sweet spot requirements of a company to match their business goals.
But which one does not matter, the real point is that you need to use a system reasonably well designed, with good base performances (nearly all are providing them now) which can allow you to develop fast and answer your business goals, because the the day you will approach the web scale you will need to go custom. So, your building blocks need to play well and integrated themselves with others for the transition to web scale.
The Future is Integrated
Thrift, protocol buffer, why are the big players developing such systems? Because the real key to achieve web scale is the integration of many components efficiently. The same way, you, as a person, work better when you integrated all the data coming from your 5 senses to get one picture of your world, to get one answer to your visitors, you will need to tap into many different sources with different constraints.
The future is integrated and the systems built from the ground to be integrated and communicate well with others will win the race. Today I see three very promising components:
- Mongrel2, more than a web server, a data delivery hub;
- MongoDB, the potential to become the new MySQL as insanely easy to store vast amount of complex data. You will outgrow it, but it will bring you a long way to reach the web scale;
- ØMQ, the high performance hub infrastructure.
I do not think silver bullet, and when you give a challenge like the 1:10 ratio, you tend to ask for a silver bullet answer, I think well integrated diversity. It is a bit harder to manage, but it adds flexibility and freedom. The challenges are here, not really in finding another perfect hammer.