Gizzard Anyone? Twitter Offers up Code for Distributed Data

Twitter last night offered up the code for Gizzard, an open-source framework for accessing distributed, scalable data stores quickly, which could become an important component of building out a web-based business, much like Facebook’s Cassandra project has swept through the ranks of webscale startups and even big companies.

Gizzard is a middleware networking service that sits between the front end web site client and the database and attempts to divide and replicate data in storage in intelligent ways that allows it to be accessed quickly by the site. From the Twitter blog post:

Twitter has built several custom distributed data-stores. Many of these solutions have a lot in common, prompting us to extract the commonalities so that they would be more easily maintainable and reusable. Thus, we have extracted Gizzard, a Scala framework that makes it easy to create custom fault-tolerant, distributed databases.

Gizzard is a framework in that it offers a basic template for solving a certain class of problem. This template is not perfect for everyone’s needs but is useful for a wide variety of data storage problems. At a high level, Gizzard is a middleware networking service that manages partitioning data across arbitrary backend datastores (e.g., SQL databases, Lucene, etc.).

The goal is to deliver relevant information to users faster across huge data sets that Twitter manages. Twitter said one of its FlockDB distributed graph database can serve 10,000 queries per second per commodity machine using Gizzard. I heard Twitter’s, Kevin Weil talk about the project a few weeks ago at SXSW, and at the time he said the company was building something to help manage distributed data sets using a Scala framework. This appears to be exactly that.

Whether or not Gizzard turns into another Cassandra or it fizzles, is open for debate, but the act of figuring out how to work with giant data sets and then sharing that information with others is an essential step in creating webscale businesses. Thus, Twitters’s decision to solve its own problem and then share it’s solution is beneficial for the startup community.

I’ve chatted with developers who feel that Google’s development of BigTable and its decision to keep it to themselves stalled the progress of building out webscale infrastructure for a few years until Facebook opened up Cassandra. This may be sour grapes — after all, a company does not have to open up code that gives it a strategic advantage — but it does highlight how difficult it is to build code that can handle and scale for millions of users. Sharing ways to do that lowers the barriers to entry for startups much like compute clouds such as Amazon’s EC2 or Rackspace’s CloudServers can.

So for anyone who wants some Gizzard, Twitter is happy to share.

Related GigaOM Pro Content (sub req’d): What Cloud Computing Can Learn from NoSQL

Image courtesy of Flickr user Sifu Renka