I attended the first NoSQL meetup in SF last Thursday and took some notes. This meeting was a presentation and discussion of distributed, non-relational database systems (DNRDBMS). Examples include Google’s Bigtable, Amazon’s Dynamo, and Yahoo!’s UDB and Sherpa. A couple of the systems presented were built on Hadoop‘s HDFS.
This was an exciting meeting — the vibe was great, everyone was supportive and interested. Lots of emphasis on architecture, features and technology, lots of smart people. I learned a ton. This writeup is basically a narrative of my impressions during the presentations. A more structured comparison of the projects would be interesting too. The focus was mainly, but not totally on open source implementations. I was very impressed by all of the people and projects presented.
Please note that this written from my point of view as the product manager for Sherpa, Yahoo!’s internal cloud key-value store. Despite any real or perceived snark, I was very impressed by all of the people and projects presented, and by the excitement the event generated. Kudos to Johan Oskarsson from Last.fm who put it together (from London, no less), CBS Interactive for hosting, and Digg for buying the beer.
Here at Yahoo!, we’re interested in helping out with any follow-up meetups. This is a really interesting area of research and experimentation right now, as the variety of implementations attest.
I’d be familiar with the Bigtable and Dynamo papers before reading this summary — it will help explain what people are trying to do in this space.
There were 3 fairly distinct types of non-relational databases shown:
- Distributed key-value stores (Dynamo-like systems)
- Distributed column stores (Bigtable-like systems)
- Something a little different
Yahoo! has a lot of experience with key-value stores. Our venerable user database (UDB) runs on thousands of machines, in dozens of datacenters, with dynamic, record-level replication. As far as I know, we’ve never shared any architectural or performance information about the UDB externally. It will be nice to be more open about Sherpa, our next-generation key-value store.
Many of these systems are application specific; the trade-offs involved in building distributed systems (the CAP, latency vs. durability, etc.) seemed driven by the needs of the application, rather than by awareness and consideration of what the trade-offs meant. For example, preseneters discussed write latencies so low that there was clearly no disk access involved. No one felt the need to acknowledge that there would be data loss if the machine crashed before a sync.
I didn’t hear a mention of the « cloud computing » buzzword till 3:30 in the afternoon. Very few of these apps presented are actually ready for a cloud (massive scale, multi-colo) deployment, but most of the users would like them to be.
I was a little disappointed in the testing rigor of the projects. Performance testing is a underappreciated art. At Yahoo!, we’ve spent a lot of time testing to precisely characterize our system performance. There are a number of factors that have a big impact on performance, including record size, consistency model, cache hit ratio, read/write ratio, etc. and it would be good to see these reflected in system characterization.
Here are my notes from the specific speakers:
Todd Lipcon, Cloudera
Really excellent summary of many of the issues we are trying to solve with Sherpa and these systems in general. Here is his deck, embedded from Slideshare.net:
- Partitioning schemes (does maintaining a map really not scale?)
- Data models
- Consistency models
- Conflict resolution
- Storage layout
- Cluster state management
- API (get, set, delete, multi-get)
- Automated recovery
I could also think of a few other issues that we’ve had to deal with:
- Cross datacenter replication
- Network architecture (guess no one here has melted a switch)
- Operational scale
- Cluster expansion
It would also be interesting to explore the trade-offs inherent in some of the design decisions. Maybe Yahoo! could contribute here, since we deal with these trade-offs every day.
Jay Kreps, LinkedIn
Voldemort is a distributed key-value store written in Java. Jay is a interesting laid-back guy with some strong opinions. This was one of the best things about the presentations: people said what they thought and the audience seemed cto listen and understand, rather than attack (until later, when we got to the java/c++ stuff). These opinions were definitely a memorable part of his presentation.
- Voldemort is a partitioned key-value store, more like Sherpa than Dynamo.
- Conventional RDBMS don’t scale for services
- Make model fit implementation (make it easy for developers to understand the impact of specific methods)
- Sherpa scan is a counter-example — it looks easy but crushes the system
- 90% of caches fix problems that shouldn’t exist; all caching should be in the storage layer
- Voldemort has a flexible deployment architecture that makes it possible to reduce the number of hops an operation takes (e.g., put the router in the client)
- All serialization methods suck — in 5 years we will forget about this problem. Voldemort supports a whole bunch of them
- Partitioning should handle nodes with different performance characteristics.
- Storage is pluggable.
- HTTP client doesn’t perform; they use a custom socket protocol. We’ve had the same problems with almost all HTTP clients.
- They are planning for Hadoop integration; haven’t done it yet.
- Vector clocks for conflict resolution
Avinash Lakshman, Facebook
Definitely the belle of the ball — the presenter was one of the designers of Amazon’s Dynamo, and brought a lot of those ideas to Cassandra, a Bigtable clone used for Facebook’s mail search. Avinash had a very polished presentation with a bit more rigor than many of the other presenters.
Rackable seems to be planning a cloud deployment of Cassandra in the near future.
- Highly available; eventual consistency only
- Replication knobs available (like Dynamo)
- For performance, they rely on application specific behavior — when a user positions their mouse in their inbox search box, their index gets pulled into memory before the search term is entered.
- Does not like pluggable storage models; says that app-specific optimizations are too valuable to generalize away. Specific example was a zero-copy streaming network copy enabled by a Linux call — data can be streamed from/to disks over the network. Hmmm…
- 180 node/50 TB system
- Messaging includes failure simulations (like dropping random messages)
- Future directions: ACLs (very familiar for Yahoos.), transactions, compression. communicative ops, pluggable (app-specific) inconsistency reconciliation
- Simple designs are better designs.
- Test by tee-ing network traffic
Cliff Moon, Microsoft (Powerset)
A Dynamo clone written in Erlang. Dynomite and CouchDB are both written in Erlang, which seems to shorten development time and the expense of performance. Cliff talked about the need to rewrite the critical bits in C. He showed some code samples. Erlang is not immediately intuitive for rusty C/Perl programmers.
Dynomite seemed to be fairly immature technology — while it’s being used in production for image-scraping Wikipedia, the cluster size was small (12 systems) and the system only ran in batch mode. System doesn’t support delete (hard in distributed environments).
Cliff’s best point was about the need for consistent APIs between layers. Clearly there is a proliferation of key-values stores and we should be thinking about some standardization at some level.
- Performance at the low end of the scale: 99% reads (100 byte records) at 1 sec.
- Very configurable — lots of knobs to turn for performance/scale. Very Dynamo-esque in this way
- Lots of serialization protocols.
Ryan Rawson, StumbleUpon
Hbase is a Bigtable clone written on top on of HDFS. It’s used by StumbleUpon and Powerset among others. Hbase is Java; (Hypertable is essentially the same thing written
in C++). Hbase is a cool project and they’ve made impressive performance improvements on previous versions. Stumble is actually serving data off Hbase, which I always thought was prohibitively slow. Ryan began by explaining how Stumble chose Hbase — they considered Cassandra, Hypertable, and Hbase. Cassandra didn’t work, Hypertable was fast and Hbase was slow. They chose Hbase because of the great community.
- The Bigtable bells and whistles: optimized for scans, not random access; rudimentary indexes; fast sorts.
- Uses Zookeeper (w00t!)
- Says that retrieving 500 rows takes 30 ms — however the total data size is about 1k (for all the rows). Not clear if this average/best/99.9%, but still an major increase over the same op in previous versions.
- Well supported operationally — monitoring, rolling upgrades, etc.
- No cross-datacenter replication — next version.
Doug Judd, Zevents
Hypertable is big table, built on HDFS. It’s basically the same as Hbase, but written in C++. This is because, according to Doug, Java is not suitable for high-performance, memory intensive applications. Zevents and Baidu use Hypertable extensively; Zevents for offline crawling and processing.
Hypertable is cool, but it is very similar in features to other Bigtable projects, and I was starting to get tired. If you have a deep love for C++, Hypertable may be your thing. But then, HDFS is written in Java. At least it’s not Erlang.😉
- Has request throttling to protect performance
- Group commits
- Failure inducer class
J. Chris Anderson, Couch.io
If Cassandra was the belle of the ball, CouchDB was definitely the jester. While I’m not totally sure I agree with what they are trying to do, it’s totally cool seeing someone blast away at the assumptions behind datastores in general and try to build something expressly for the web. Also, I totally recommend seeing J. Chris speak — he’s got some very interesting ideas and is both clear and entertaining.
CouchDB approaches many of the problems that we are dealing with by avoiding them. For example, the focus is totally on availability; if two records are inconsistent, it just returns both with the query and lets the app deal with it. Data is stored in append-only files — I didn’t totally catch this, but I believe the entire structure is
rewritten to disk if there are any changes with a checkpoint that allows quick recovery. The logs are periodically compacted for space. J. Chris said CouchDB should perform at about 80% of the disk performance. While time will tell about their design decisions, simpler is definitely better.
CouchDB is designed to run everywhere — instead of accessing data on a remote server, you just access it locally –on your phone, mac, or toaster. All the users of the application maintain their own datastore which replicated very cheaply via HTTP. Super low-latency with offline access. Again, I don’t know if this will work (J. Chris said that his approach to replication might be a bit simplistic), but it’s a neat idea.
Meebo has production Couch serving deployment called Lounge. I forget what they use it for.
If any of this sounds interesting, I recommend J. Chris’s blog — it explains where Couch is coming from/going to much better than I could.
After CouchDB, we had a few more quick talks. I’ll list these with a few comments:
- Vpork: performance tester for key-values stores. Something that would be nice to see more of.
- MongoDB: As someone who has done their share of database demos, I can tell you that the MongoDB demo wasn’t super interesting. However, the database itself is very cool — they are trying to do what Sherpa is doing — provide a scalable system with as many querying capabilities as possible.
- Google Megastore: Building secondary indexes on Bigtable. The presenter said that « write-read » consistency was actually really important for web applications. (Consider the case of writing a status update, and not being able to read it.) Dynamo does not provide this level of consistency and megastore provided consistency across Bigtable entity groups.
I also recommend the #nosql tag on twitter. You’ll find links to videos from the event and ongoing conversation.