Why I Decided Not to Use Cassandra
Recently I started working on a project, in my spare time, and I hope it will eventually get large amounts of data, so I started thinking about scalability very early. I’m making something like a prototype with MySQL as a data storage, with Hibernate as an ORM and then decided to move to Cassandra, in order to solve all my to-be scalability problems.
First I have to make the clarification, that my project is far from complete and has not even been stress-tested, so you are free to call me “premature optimizer”. But in order to save myself from “the sharding hell”, I decided to select a NoSQL solution. I reviewed MongoDB, HBase, Cassandra and a few more, and decided Cassandra would best fit my case.
Cassandra is a great product that is developing fast. But after a week or so in trying to modify my DAO layer in order to use Cassandra rather than Hibernate + MySQL, I decided it’s not the time. Why?
- Ease of use – that’s the key aspect. I’m aware scalability comes at a price and so with Cassandra your possibilities are somewhat limited (Yes, it is simple and hence scalable). Secondary indexes appeared in the latest snapshots (but it did appear, so things are on the right track). The main point is that you must define your views in advance. I.e. you can’t say “Oh, I want to extract this pieces of information in such a way”. You must have made this decision upfront, and inserted your data in such a way that this query is possible. And defining everything upfront is a rarely successful task. My data model is (currently) relatively simple so I haven’t even experienced these difficulties to a full extent. Moreover, what I’m currently doing is more of a prototype. It’s not unlikely that, after I have defined all my functionality and make it functional using a rdbms, I migrate to Cassandra. As twitter does, actually.
- It’s moving. Cassandra is still at version 0.7, and is rapidly changing. And so is everything that revolves around it – API, tools, libraries.
- Tools. There is barely an administrator tool that allows you to view your “schema”, your current entries. The best I could find is this cassandra-webconsole. It’s really neat, but I had to adapt it to the newest version of Cassandra before I was able to run it. And it was still not showing my data.
- APIs – the Thrift API is ugly. And lacks some “extras”. That’s why APIs like Hector and pelops appear. But they are still not mature enough, partly because they should continuously mirror the changes in the Thrift API, partly because they haven’t been widely used yet (because Cassandra is not as widely used yet). The people, at least those behind Hector are very responsive and active – I suggested (committed) a few improvements that were gladly accepted. But there is still some way to walk.
- Additional frameworks – that’s related to the previous point. For example spring integration is an important thing. Some of my contributions to Hector were in that direction. Another thing is object mapping. We are all spoilt by ORMs and it’s always good to work with objects, since our systems are all object-oriented (or at least we believe so). There were some attempts at that. I started my own – helenus. And of course – it’s not anywhere near production, or even development capabilities. Indexing – there is Lucandra, I had a quick look and it was not able to translate my use of Hibernate Search (with Lucene, of course) – but I’m not a lucene expert, so it might be just me in this case.
I’ve listed a lot of cons above, which are rather logical, and I’ll again say that things are changing and will be better, in say, six months (perhaps with the exception of the first point).
And then I asked myself whether I need that much of scalability. Twitter, I think, still use their MySQL + memcached solution and although it is not “a piece of cake”, it’s working, for their 2 billion tweets.
And there are a lot of ways to optimize data access:
- Hibernate, for example, has lots of caching options.
- My web layer is currently designed to cache the currently active data so that data storage is not even touched for the most recent data.
- Hibernate Shards, although not updated to conform to JPA or to handle HQL queries, is still a powerful option that makes sharding transparent to the application
- They say the problem lies in joins. One can minimize the joins even in a relational database.
- MySQL has MyISAM, InnoDB and Falcon (see benchmarks) and one can choose the most efficient for his case.
On the plus side, I learnt a lot about scalability. The main ideas behind Cassandra, the CAP theorem, the “eventual consistency“, the “storage proxy” and so on are really important concepts that will come handy with or without Cassandra.
(To clarify – I’m not saying Cassandra is not usable in production – it obviously is, since Facebook, Digg and Twitter are using it (or are about to). But these companies have the capacity to cope with all the drawbacks. My point is that it’s not yet a preferable option for mainstream development)
Update: There is a nice project of springsource – spring-data, which has the ambition to provide unified interface for NoSQL storages.
private Date getLastChangeDate(Integer attributeId) {
String[] paramNames = {“attribId”};
Integer[] vals = {attributeId};
List<Timestamp> rows = getHibernateTemplate().findByNamedParam(“select max(change.lastChanged) from Attribute a, AttributeChange change where change.attributeId = a.attributeId AND (a.attributeId = :attribId OR a.dependantAttributeId = :attribId)”,paramNames,vals);
if(rows != null && rows.size() > 0) {
Date d = new Date(rows.get(0).getTime());
return d;
} else {
log.debug(“no changes found, returning null”);
return null;
}
}
Recently I started working on a project, in my spare time, and I hope it will eventually get large amounts of data, so I started thinking about scalability very early. I’m making something like a prototype with MySQL as a data storage, with Hibernate as an ORM and then decided to move to Cassandra, in order to solve all my to-be scalability problems.
First I have to make the clarification, that my project is far from complete and has not even been stress-tested, so you are free to call me “premature optimizer”. But in order to save myself from “the sharding hell”, I decided to select a NoSQL solution. I reviewed MongoDB, HBase, Cassandra and a few more, and decided Cassandra would best fit my case.
Cassandra is a great product that is developing fast. But after a week or so in trying to modify my DAO layer in order to use Cassandra rather than Hibernate + MySQL, I decided it’s not the time. Why?
- Ease of use – that’s the key aspect. I’m aware scalability comes at a price and so with Cassandra your possibilities are somewhat limited (Yes, it is simple and hence scalable). Secondary indexes appeared in the latest snapshots (but it did appear, so things are on the right track). The main point is that you must define your views in advance. I.e. you can’t say “Oh, I want to extract this pieces of information in such a way”. You must have made this decision upfront, and inserted your data in such a way that this query is possible. And defining everything upfront is a rarely successful task. My data model is (currently) relatively simple so I haven’t even experienced these difficulties to a full extent. Moreover, what I’m currently doing is more of a prototype. It’s not unlikely that, after I have defined all my functionality and make it functional using a rdbms, I migrate to Cassandra. As twitter does, actually.
- It’s moving. Cassandra is still at version 0.7, and is rapidly changing. And so is everything that revolves around it – API, tools, libraries.
- Tools. There is barely an administrator tool that allows you to view your “schema”, your current entries. The best I could find is this cassandra-webconsole. It’s really neat, but I had to adapt it to the newest version of Cassandra before I was able to run it. And it was still not showing my data.
- APIs – the Thrift API is ugly. And lacks some “extras”. That’s why APIs like Hector and pelops appear. But they are still not mature enough, partly because they should continuously mirror the changes in the Thrift API, partly because they haven’t been widely used yet (because Cassandra is not as widely used yet). The people, at least those behind Hector are very responsive and active – I suggested (committed) a few improvements that were gladly accepted. But there is still some way to walk.
- Additional frameworks – that’s related to the previous point. For example spring integration is an important thing. Some of my contributions to Hector were in that direction. Another thing is object mapping. We are all spoilt by ORMs and it’s always good to work with objects, since our systems are all object-oriented (or at least we believe so). There were some attempts at that. I started my own – helenus. And of course – it’s not anywhere near production, or even development capabilities. Indexing – there is Lucandra, I had a quick look and it was not able to translate my use of Hibernate Search (with Lucene, of course) – but I’m not a lucene expert, so it might be just me in this case.
I’ve listed a lot of cons above, which are rather logical, and I’ll again say that things are changing and will be better, in say, six months (perhaps with the exception of the first point).
And then I asked myself whether I need that much of scalability. Twitter, I think, still use their MySQL + memcached solution and although it is not “a piece of cake”, it’s working, for their 2 billion tweets.
And there are a lot of ways to optimize data access:
- Hibernate, for example, has lots of caching options.
- My web layer is currently designed to cache the currently active data so that data storage is not even touched for the most recent data.
- Hibernate Shards, although not updated to conform to JPA or to handle HQL queries, is still a powerful option that makes sharding transparent to the application
- They say the problem lies in joins. One can minimize the joins even in a relational database.
- MySQL has MyISAM, InnoDB and Falcon (see benchmarks) and one can choose the most efficient for his case.
On the plus side, I learnt a lot about scalability. The main ideas behind Cassandra, the CAP theorem, the “eventual consistency“, the “storage proxy” and so on are really important concepts that will come handy with or without Cassandra.
(To clarify – I’m not saying Cassandra is not usable in production – it obviously is, since Facebook, Digg and Twitter are using it (or are about to). But these companies have the capacity to cope with all the drawbacks. My point is that it’s not yet a preferable option for mainstream development)
Update: There is a nice project of springsource – spring-data, which has the ambition to provide unified interface for NoSQL storages.
String[] paramNames = {“attribId”};
Integer[] vals = {attributeId};
List<Timestamp> rows = getHibernateTemplate().findByNamedParam(“select max(change.lastChanged) from Attribute a, AttributeChange change where change.attributeId = a.attributeId AND (a.attributeId = :attribId OR a.dependantAttributeId = :attribId)”,paramNames,vals);
if(rows != null && rows.size() > 0) {
Date d = new Date(rows.get(0).getTime());
return d;
} else {
log.debug(“no changes found, returning null”);
return null;
}
}
Thanks for this inspiring article. Actually, you have spared me a lot of time. I was thinking about carrying out the same tests to find out if it makes any sense using Cassandra for production. In fact, I already expected that the API, the tools etc. are far from mature.
Nevertheless, I am still looking forward for using Cassandra as soon as a level it reached that makes it a real alternative.
Cheers
S.T.
Well, Cassandra is of course usable in production – after all it’s in use by Facebook and Digg at least (and Twitter?). But these sites have the time to invest in using the ugly thrift API, and the capacity to gather a number of experts and take important decisions upfront. For a general-purpose usage – not yet.
Also check code.google.com/p/kundera/
-Animesh
Most of Social networks (like Facebook) has stopped using mysql as main database and switched to use Cassandra or other no-sql DB. And we can consider this change as big grow for this new open-source data store, Cassandra, which was developed originally by Facebook to solve the problem of inbox search and to be fast, reliable and had the ability to handle read and write requests at the same time
source: Why does large Social Network projects switch to use Cassandra instead of Mysql?
Facebook uses mysql for most of their data, and do not use cassandra at all. They originally developed cassandra, but do not use it. Google “facebook mysql” or look at the wikipedia page for cassandra for more info.
Some of the issues they have to deal with include complex sharding and maintaining consistency of data across shards.
They used to use it, I think. Probably they dropped it at some point.