{"id":642985,"date":"2013-02-19T13:59:40","date_gmt":"2013-02-19T18:59:40","guid":{"rendered":"http:\/\/gigaom.com\/?p=611709"},"modified":"2013-02-19T13:59:40","modified_gmt":"2013-02-19T18:59:40","slug":"citusdb-today-sql-on-hadoop-tomorrow-the-world","status":"publish","type":"post","link":"https:\/\/mereja.media\/index\/642985","title":{"rendered":"CitusDB: Today, SQL on Hadoop. Tomorrow, the world!"},"content":{"rendered":"<p>Database startup <a href=\"http:\/\/www.citusdata.com\/\">Citus Data<\/a> on Tuesday joined those trying to enable fast SQL queries on Hadoop data, but it has much larger goals. It thinks it can be the only analytic database that anyone needs, able to query data wherever it\u2019s stored across a company\u2019s environment \u2014 in relational databases, Hadoop, MongoDB, Amazon S3 and elsewhere.<\/p>\n<p>Big data has opened companies\u2019 eyes to the importance of analytics and alternative data stores, but combining the two often means learning new languages, using multiple tools and probably sacrificing the performance they\u2019re used to from analytic platforms.<\/p>\n<p>Citus Data\u2019s flagship product, called CitusDB, is actually built atop PostgreSQL and its first iteration was designed for <a href=\"http:\/\/gigaom.com\/2012\/05\/01\/google-opens-up-its-biq-query-data-analytics-service-to-all\/\">Google Dremel-like scale and speed<\/a> on relational data. Thanks to a feature called \u201cforeign data wrappers,\u201d though, it\u2019s able to run SQL on numerous data types (e.g., CSV, log and JSON files) that don\u2019t comport with how Postgres formats data natively. So, while CitusDB now officially supports the Hadoop Distributed File System in addition to Postgres, it is by no means limited to them.<\/p>\n<p>Matt Ocko, managing partner at <a href=\"http:\/\/gigaom.com\/2012\/08\/09\/big-data-vc-firm-data-collective-steps-out-of-the-shadows\/\">Data Collective<\/a> and one of Citus Data\u2019s early investors, says the database can technically support any data source with an ODBC driver, and even could query something like log files straight from a data store.\u00a0In fact, Citus is working on extending its support to MongoDB \u2014 a capability that\u2019s in beta right now. Ocko is also particularly impressed with CitusDB\u2019s ability to act like a fabric connecting all these data sources rather than making users query each independently and then manually join the data. He cited a demonstration in which CitusDB carried out a query that required executing a join across Postgres and Hadoop.<\/p>\n<p>The other big thing about CitusDB is that it\u2019s not just flexible but fast, too. Ocko said CitusDB has outperformed Oracle\u2019s vaunted Exadata machine on a TPC-H benchmark test with data stored primarily on hard disk. That Postgres-Hadoop query he referenced completed in just a few seconds while running on the Amazon EC2 cloud.<\/p>\n<p>CitusDB is so fast, Citus Co-founder Umur Cubukcu told me, because of how it\u2019s architected. It moves the computation to where the data is rather than trying to move data across the network, and it has some impressive load-balancing the resource-management abilities baked in. If, for example, it needs data housed on a slow-running node in order to complete a task, the software will look for that data elsewhere rather than just wait for the congested resource to free up.<\/p>\n<p>In the case of Hadoop, MapReduce brings the computation to the data, too, but every job requires a scan over the entire dataset. This is why early SQL-on-Hadoop tools such as Hive <a href=\"http:\/\/drawntoscale.com\/is-there-a-database-in-big-data-heaven-understanding-the-world-of-sql-on-hadoop\/\">are still relatively slow<\/a>. Citus software engineer Carl Steinbach, who came to the company from Cloudera, said CitusDB is between 3 and 20 times faster than Hive depending on the query type.<\/p>\n<p>It\u2019s actually much faster for short queries that might be typical in an interactive environment, but he acknowledged those aren\u2019t really what Hive was designed to do.<\/p>\n<p><a href=\"http:\/\/gigaom2.files.wordpress.com\/2013\/02\/citus_hadoop_architecture.png\"><img decoding=\"async\" alt=\"Citus_Hadoop_Architecture\" src=\"http:\/\/gigaom2.files.wordpress.com\/2013\/02\/citus_hadoop_architecture.png?w=708\" class=\"aligncenter size-full wp-image-611818\"><\/a><\/p>\n<p>Rather, CitusDB\u2019s real competition is the spate of SQL-on-Hadoop projects, products and startups of which it\u2019s now a part. We\u2019ll have a whole session dedicated to this topic at <a href=\"http:\/\/event.gigaom.com\/structuredata\/?utm_source=data&#38;utm_medium=editorial&#038;%2338;utm_campaign=intext&#038;%2338;utm_term=611709+citusdb-today-sql-on-hadoop-tomorrow-the-world&#038;%2338;utm_content=dharrisstructure\">Structure: Data<\/a> next month, and there isn\u2019t enough room for everything on the market right now \u2014 <a href=\"http:\/\/gigaom.com\/2012\/10\/17\/batten-down-the-analysts-its-a-big-data-bi-storm\/\">Aster Data<\/a>, <a href=\"http:\/\/gigaom.com\/2012\/11\/13\/plotting-a-bi-coup-hadoop-startup-platfora-raises-20m\/\">Platfora<\/a>, Cloudera (<a href=\"http:\/\/gigaom.com\/2012\/10\/24\/cloudera-makes-sql-a-first-class-citizen-in-hadoop\/\">with Impala<\/a>), <a href=\"http:\/\/gigaom.com\/2012\/08\/17\/for-fast-interactive-hadoop-queries-drill-may-be-the-answer\/\">Apache Drill<\/a>, <a href=\"http:\/\/gigaom.com\/2012\/07\/24\/how-one-startup-wants-to-inject-hadoop-into-your-sql\/\">Drawn to Scale<\/a> and <a href=\"http:\/\/gigaom.com\/2012\/10\/16\/hadapt-does-big-love-for-big-data-and-hints-at-hadoops-future\/\">Hadapt<\/a>, to name several.<\/p>\n<p>These are impressive technologies (at least in theory where they\u2019re still under development), and Citus would be remiss to ignore them. But, aside from the ability to query multiple data sources, the company has something the others don\u2019t, Cubukcu said: It has the Postgres community and all the features they\u2019ve built into that database already. Things like connectors, authentication, full-text search and PostGIS for geospatial data that go beyond just running fast queries.<\/p>\n<p>\u201cWhen you\u2019re talking about an enterprise-class database,\u201d Steinbach said, \u201cyou\u2019re talking about more than a query execution engine.\u201d<\/p>\n<p> <img loading=\"lazy\" decoding=\"async\" alt=\"\" border=\"0\" src=\"http:\/\/stats.wordpress.com\/b.gif?host=gigaom.com&#038;blog=14960843&#038;%23038;post=611709&#038;%23038;subd=gigaom2&#038;%23038;ref=&#038;%23038;feed=1\" width=\"1\" height=\"1\" \/><\/p>\n<p><a href=\"http:\/\/pubads.g.doubleclick.net\/gampad\/jump?iu=\/1008864\/GigaOM_RSS_300x250&#038;sz=300x250&#038;%23038;c=504377\"><img decoding=\"async\" src=\"http:\/\/pubads.g.doubleclick.net\/gampad\/ad?iu=\/1008864\/GigaOM_RSS_300x250&#038;sz=300x250&#038;%23038;c=504377\" \/><\/a><\/p>\n<p><strong>Related research and analysis from GigaOM Pro:<\/strong><br \/>Subscriber content. <a href=\"http:\/\/pro.gigaom.com\/?utm_source=data&#038;utm_medium=editorial&#038;utm_campaign=auto3&#038;utm_term=611709+citusdb-today-sql-on-hadoop-tomorrow-the-world&#038;utm_content=dharrisstructure\">Sign up for a free trial<\/a>.<\/p>\n<ul>\n<li><a href=\"http:\/\/pro.gigaom.com\/2012\/05\/the-importance-of-putting-the-u-and-i-in-visualization\/?utm_source=data&#038;utm_medium=editorial&#038;utm_campaign=auto3&#038;utm_term=611709+citusdb-today-sql-on-hadoop-tomorrow-the-world&#038;utm_content=dharrisstructure\">The importance of putting the U and I in visualization<\/a><\/li>\n<li><a href=\"http:\/\/pro.gigaom.com\/2012\/04\/infrastructure-q1-cloud-and-big-data-woo-the-enterprise\/?utm_source=data&#038;utm_medium=editorial&#038;utm_campaign=auto3&#038;utm_term=611709+citusdb-today-sql-on-hadoop-tomorrow-the-world&#038;utm_content=dharrisstructure\">Infrastructure Q1: Cloud and big data woo enterprises<\/a><\/li>\n<li><a href=\"http:\/\/pro.gigaom.com\/2012\/11\/real-%C2%ADtime-query-for-hadoop-democratizes-access-to-big-data-analytics\/?utm_source=data&#038;utm_medium=editorial&#038;utm_campaign=auto3&#038;utm_term=611709+citusdb-today-sql-on-hadoop-tomorrow-the-world&#038;utm_content=dharrisstructure\">Real-\u00adtime query for Hadoop democratizes access\u00a0to big data analytics<\/a><\/li>\n<\/ul>\n<p><img width='1' height='1' src='http:\/\/gigaom.feedsportal.com\/c\/34996\/f\/646446\/s\/28bc5e42\/mf.gif' border='0'\/><\/p>\n<div class='mf-viral'>\n<table border='0'>\n<tr>\n<td valign='middle'><a href=\"http:\/\/share.feedsportal.com\/viral\/sendEmail.cfm?lang=en&#038;title=CitusDB%3A+Today%2C+SQL+on+Hadoop.+Tomorrow%2C+the+world%21&#038;link=http%3A%2F%2Fgigaom.com%2F2013%2F02%2F19%2Fcitusdb-today-sql-on-hadoop-tomorrow-the-world%2F\" ><img decoding=\"async\" src=\"http:\/\/res3.feedsportal.com\/images\/emailthis2.gif\" border=\"0\" \/><\/a><\/td>\n<td valign='middle'><a href=\"http:\/\/res.feedsportal.com\/viral\/bookmark.cfm?title=CitusDB%3A+Today%2C+SQL+on+Hadoop.+Tomorrow%2C+the+world%21&#038;link=http%3A%2F%2Fgigaom.com%2F2013%2F02%2F19%2Fcitusdb-today-sql-on-hadoop-tomorrow-the-world%2F\" ><img decoding=\"async\" src=\"http:\/\/res3.feedsportal.com\/images\/bookmark.gif\" border=\"0\" \/><\/a><\/td>\n<\/tr>\n<\/table>\n<\/div>\n<p><a href=\"http:\/\/da.feedsportal.com\/r\/158873362390\/u\/49\/f\/646446\/c\/34996\/s\/28bc5e42\/a2.htm\"><img decoding=\"async\" src=\"http:\/\/da.feedsportal.com\/r\/158873362390\/u\/49\/f\/646446\/c\/34996\/s\/28bc5e42\/a2.img\" border=\"0\"\/><\/a><img loading=\"lazy\" decoding=\"async\" width=\"1\" height=\"1\" src=\"http:\/\/pi.feedsportal.com\/r\/158873362390\/u\/49\/f\/646446\/c\/34996\/s\/28bc5e42\/a2t.img\" border=\"0\"\/><\/p>\n<div class=\"feedflare\">\n<a href=\"http:\/\/feeds.feedburner.com\/~ff\/OmMalik?a=rqIQpdwGPys:toCm2EbJgqg:yIl2AUoC8zA\"><img decoding=\"async\" src=\"http:\/\/feeds.feedburner.com\/~ff\/OmMalik?d=yIl2AUoC8zA\" border=\"0\"><\/img><\/a>\n<\/div>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/feeds.feedburner.com\/~r\/OmMalik\/~4\/rqIQpdwGPys\" height=\"1\" width=\"1\"\/><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Database startup Citus Data on Tuesday joined those trying to enable fast SQL queries on Hadoop data, but it has much larger goals. It thinks it can be the only analytic database that anyone needs, able to query data wherever it\u2019s stored across a company\u2019s environment \u2014 in relational databases, Hadoop, MongoDB, Amazon S3 and [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[],"class_list":["post-642985","post","type-post","status-publish","format-standard","hentry","category-news"],"_links":{"self":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/posts\/642985","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/comments?post=642985"}],"version-history":[{"count":0,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/posts\/642985\/revisions"}],"wp:attachment":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/media?parent=642985"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/categories?post=642985"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/tags?post=642985"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}