Recently I started working on a project, in my spare time, and I hope it will eventually get large amounts of data, so I started thinking about scalability very early. I’m making something like a prototype with MySQL as a data storage, with Hibernate as an ORM and then decided to move to Cassandra, in order to solve all my to-be scalability problems.
First I have to make the clarification, that my project is far from complete and has not even been stress-tested, so you are free to call me “premature optimizer”. But in order to save myself from “the sharding hell”, I decided to select a NoSQL solution. I reviewed MongoDB, HBase, Cassandra and a few more, and decided Cassandra would best fit my case.
Cassandra is a great product that is developing fast. But after a week or so in trying to modify my DAO layer in order to use Cassandra rather than Hibernate + MySQL, I decided it’s not the time. Why?
- Ease of use – that’s the key aspect. I’m aware scalability comes at a price and so with Cassandra your possibilities are somewhat limited (Yes, it is simple and hence scalable). Secondary indexes appeared in the latest snapshots (but it did appear, so things are on the right track). The main point is that you must define your views in advance. I.e. you can’t say “Oh, I want to extract this pieces of information in such a way”. You must have made this decision upfront, and inserted your data in such a way that this query is possible. And defining everything upfront is a rarely successful task. My data model is (currently) relatively simple so I haven’t even experienced these difficulties to a full extent. Moreover, what I’m currently doing is more of a prototype. It’s not unlikely that, after I have defined all my functionality and make it functional using a rdbms, I migrate to Cassandra. As twitter does, actually.
- It’s moving. Cassandra is still at version 0.7, and is rapidly changing. And so is everything that revolves around it – API, tools, libraries.
- Tools. There is barely an administrator tool that allows you to view your “schema”, your current entries. The best I could find is this cassandra-webconsole. It’s really neat, but I had to adapt it to the newest version of Cassandra before I was able to run it. And it was still not showing my data.
- APIs – the Thrift API is ugly. And lacks some “extras”. That’s why APIs like Hector and pelops appear. But they are still not mature enough, partly because they should continuously mirror the changes in the Thrift API, partly because they haven’t been widely used yet (because Cassandra is not as widely used yet). The people, at least those behind Hector are very responsive and active – I suggested (committed) a few improvements that were gladly accepted. But there is still some way to walk.
- Additional frameworks – that’s related to the previous point. For example spring integration is an important thing. Some of my contributions to Hector were in that direction. Another thing is object mapping. We are all spoilt by ORMs and it’s always good to work with objects, since our systems are all object-oriented (or at least we believe so). There were some attempts at that. I started my own – helenus. And of course – it’s not anywhere near production, or even development capabilities. Indexing – there is Lucandra, I had a quick look and it was not able to translate my use of Hibernate Search (with Lucene, of course) – but I’m not a lucene expert, so it might be just me in this case.
I’ve listed a lot of cons above, which are rather logical, and I’ll again say that things are changing and will be better, in say, six months (perhaps with the exception of the first point).
And then I asked myself whether I need that much of scalability. Twitter, I think, still use their MySQL + memcached solution and although it is not “a piece of cake”, it’s working, for their 2 billion tweets.
And there are a lot of ways to optimize data access:
- Hibernate, for example, has lots of caching options.
- My web layer is currently designed to cache the currently active data so that data storage is not even touched for the most recent data.
- Hibernate Shards, although not updated to conform to JPA or to handle HQL queries, is still a powerful option that makes sharding transparent to the application
- They say the problem lies in joins. One can minimize the joins even in a relational database.
- MySQL has MyISAM, InnoDB and Falcon (see benchmarks) and one can choose the most efficient for his case.
On the plus side, I learnt a lot about scalability. The main ideas behind Cassandra, the CAP theorem, the “eventual consistency“, the “storage proxy” and so on are really important concepts that will come handy with or without Cassandra.
(To clarify – I’m not saying Cassandra is not usable in production – it obviously is, since Facebook, Digg and Twitter are using it (or are about to). But these companies have the capacity to cope with all the drawbacks. My point is that it’s not yet a preferable option for mainstream development)
Update: There is a nice project of springsource – spring-data, which has the ambition to provide unified interface for NoSQL storages.Google+