{"id":645010,"date":"2013-03-03T15:00:13","date_gmt":"2013-03-03T20:00:13","guid":{"rendered":"http:\/\/gigaom.com\/?p=616112"},"modified":"2013-03-03T15:00:13","modified_gmt":"2013-03-03T20:00:13","slug":"how-and-why-linkedin-is-becoming-an-engineering-powerhouse","status":"publish","type":"post","link":"https:\/\/mereja.media\/index\/645010","title":{"rendered":"How and why LinkedIn is becoming an engineering powerhouse"},"content":{"rendered":"<p>Most LinkedIn users know \u201cPeople You May Know\u201d as one of that site\u2019s flagship features \u2014 an onmipresent reminder of other LinkedIn users with whom you probably want to connect. Keeping it up to date and accurate requires some heady data science and impressive engineering to keep data constantly flowing between the various LinkedIn applications. When Jay Kreps started there five years ago, this wasn\u2019t exactly the case.<\/p>\n<p>\u201cI was here essentially before we had any infrastructure,\u201d Kreps, now principal staff engineer, told me during a recent visit to LinkedIn\u2019s Mountain View, Calif., campus.\u00a0He actually came LinkedIn to do data science, thinking the company would have some of the best data around, but it turned out the company had an infrastructure problem that needed his attention instead.<\/p>\n<p>How big? The version of People You May Know in place then was running on a single Oracle database instance \u2014 a few scripts and heuristics provided intelligence \u2014 and it took six weeks to update (longer if the update job crashed and had to restart).\u00a0And that\u2019s only if it worked. At one point, Kreps said, the system wasn\u2019t working for six months.<\/p>\n<p>When the scale of data began to overload the server, the answer wasn\u2019t to add more nodes but to cut out some of the matching heuristics that required too much compute power.<\/p>\n<p>So, instead of writing algorithms to make People You Know Know more accurate, he worked on getting LinkedIn\u2019s Hadoop infrastructure in place and built a distributed database called <a href=\"http:\/\/data.linkedin.com\/opensource\/voldemort\">Voldemort<\/a>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" alt=\"tracking_high_level\" src=\"http:\/\/gigaom2.files.wordpress.com\/2013\/03\/tracking_high_level.png?w=300&#038;h=230\" width=\"300\" height=\"230\" class=\"alignright size-medium wp-image-616287\">Since then, he\u2019s built <a href=\"http:\/\/data.linkedin.com\/opensource\/azkaban\">Azkaban<\/a>, an open source scheduler for batch processes such as Hadoop jobs, and <a href=\"http:\/\/data.linkedin.com\/opensource\/kafka\">Kafka<\/a>, another open source tool that Kreps called \u201cthe big data equivalent of a message broker.\u201d At a high level, Kafka is responsible for managing the company\u2019s real-time data and getting those hundreds of feeds to the apps that subscribe to them with minimal latency.<\/p>\n<h2 id=\"espresso-anyone\">Espresso, anyone?<\/h2>\n<p>But Kreps\u2019s work is just a fraction of the new data infrastructure that LinkedIn has built since he came on board. It\u2019s all part of a mission to create a data environment at LinkedIn that\u2019s as innovative as that of any other web company around, and that means the company\u2019s applications developers and data scientists can keep building whatever products they dream up.<\/p>\n<p>Bhaskar Ghosh, LinkedIn\u2019s senior director of data infrastructure engineering \u2014 who\u2019ll be part of our guru panel at <a href=\"http:\/\/event.gigaom.com\/structuredata\/?utm_source=data&#38;utm_medium=editorial&#038;%2338;utm_campaign=intext&#038;%2338;utm_term=616112+how-and-why-linkedin-is-becoming-an-engineering-powerhouse&#038;%2338;utm_content=dharrisstructure\">Structure: Data on March 20-21<\/a> \u2014 can\u2019t help but find his way to the whiteboard when he gets to discussing what his team has built. It\u2019s a three-phase data architecture comprised of online, offline and nearline systems, each designed for specific workloads. The online systems handle users\u2019 real-time interactions; offline systems, primarily Hadoop and a Teradata warehouse, handle batch processing and analytic workloads; and nearline systems handle features such as People You May Know, search and the LinkedIn social graph, which update constantly but require slightly less than online latency.<\/p>\n<div id=\"attachment_616138\" class=\"wp-caption aligncenter\" style=\"width: 718px\"><a href=\"http:\/\/gigaom2.files.wordpress.com\/2013\/03\/20130226_145754.jpg?w=708\"><img loading=\"lazy\" decoding=\"async\" alt=\"Ghosh's diagram of LinkedIn's data architecture\" src=\"http:\/\/gigaom2.files.wordpress.com\/2013\/03\/20130226_145754.jpg?w=708&#038;h=531\" width=\"708\" height=\"531\" class=\"size-large wp-image-616138\"><\/a><\/p>\n<p class=\"wp-caption-text\">Ghosh\u2019s diagram of LinkedIn\u2019s data architecture<\/p>\n<\/div>\n<p>One of the most-important things the company has built is a new database system called <a href=\"http:\/\/data.linkedin.com\/projects\/espresso\">Espresso<\/a>. Unlike Voldemort, which is an eventually consistent key-value store modeled after Amazon\u2019s Dynamo database and used to serve certain data at high speeds, Espresso is a transactionally consistent document store that\u2019s going to replace legacy Oracle databases across the company\u2019s web operations. It was originally designed to provide a usability boost for LinkedIn\u2019s InMail messaging service, and the company\u00a0plans to open source Espresso later this year.<\/p>\n<p>According to Director of Engineering Bob Schulman, Espresso came to be \u201cbecause we had a problem that had to do with scaling and agility\u201d in the mailbox feature. It needs to store lots of data and keep consistent with users\u2019 activity. It also needs a functional search engine so users \u2014 even those with lots of messages \u2014 can find what they need in a hurry.<\/p>\n<p>With the previous data layer in tact, he explained, the solution for developers to solve scalability and reliability issues was doing so in the application.<\/p>\n<p>However, Principal Software Architect Shirshanka Das\u00a0noted, \u201ctrying to scale [your] way out of a problem\u201d with code isn\u2019t necessarily a long-term strategy. \u201cThose things tend to burn out teams and people very quickly,\u201d he said, \u201cand you\u2019re never sure when you\u2019re going to meet your next cliff.\u201d<\/p>\n<div id=\"attachment_616132\" class=\"wp-caption aligncenter\" style=\"width: 718px\"><img loading=\"lazy\" decoding=\"async\" alt=\"L to R: Kreps, Shirshanka Das, Bhaskar Ghosh, Bob Schulman\" src=\"http:\/\/gigaom2.files.wordpress.com\/2013\/03\/20130226_153554-e1362319622641.jpg?w=708&#038;h=471\" width=\"708\" height=\"471\" class=\"size-large wp-image-616132\"><\/p>\n<p class=\"wp-caption-text\">L to R: Kreps, Das, Ghosh and Schulman<\/p>\n<\/div>\n<p>Schulman and Das have also worked together on technologies such as <a href=\"http:\/\/data.linkedin.com\/opensource\/helix\">Helix<\/a> \u2014 an open-source cluster management framework for distributed systems \u2014 and Databus. The latter, which has been around since 2007 and <a href=\"http:\/\/engineering.linkedin.com\/data-replication\/open-sourcing-databus-linkedins-low-latency-change-data-capture-system\">the company just open sourced<\/a>, is a tool that pushes changes in what Das calls \u201csource of truth\u201d data environments like Espresso to downstream environments such as Hadoop so that everyone can ensure they\u2019re working with the freshest data.<\/p>\n<p>In an agile environment, Schulman said, it\u2019s important to be able to change something without breaking something else. The alternative is to bring stuff down to make changes, he added, and \u201cit\u2019s never a good time to stop the world.\u201d<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" alt=\"databus-usecases\" src=\"http:\/\/gigaom2.files.wordpress.com\/2013\/03\/databus-usecases.jpg?w=708&#038;h=243\" width=\"708\" height=\"243\" class=\"aligncenter size-large wp-image-616291\"><\/p>\n<h2 id=\"next-up-hadoop\">Next up, Hadoop<\/h2>\n<p>Thus far, LinkedIn\u2019s biggest push has been in improving its nearline and online systems (\u201cBasically, we\u2019ve hit the ball out of the park here,\u201d Ghosh said), so its next big push is offline \u2014 Hadoop, in particular. The company already uses Hadoop for the usual gamut of workloads \u2014 ETL, model-building, exploratory analytics and pre-computing data for nearline applications \u2014 and Ghosh wants to take it even further.<\/p>\n<p>He laid out a multipart vision, most of which centers around tight integration between the company\u2019s Hadoop clusters and relational database systems. Among the goals: better ETL frameworks, ad-hoc queries, alternative storage formats and an integrated metadata framework \u2014 which Ghosh calls the holy grail \u2014 that will make it easier for various analytic systems to use each other\u2019s data. He said LinkedIn has something half-built that should be finished this year.<\/p>\n<p>\u201c[<a href=\"http:\/\/gigaom.com\/2013\/02\/21\/sql-is-whats-next-for-hadoop-heres-whos-doing-it\/\">SQL on Hadoop<\/a>] is going to take two years to work,\u201d he explained. \u201cWhat do we do in the meanwhile? We cannot throw this out.\u201d<\/p>\n<p>Actually, the whole of LinkedIn\u2019s data engineering efforts right now put a focus on building services that can work together easily, Das said. The Espresso API, for example, allows developers to connect a columnar storage engine and do some limited online analytics right from within the transactional database.<\/p>\n<div id=\"attachment_616137\" class=\"wp-caption aligncenter\" style=\"width: 718px\"><a href=\"http:\/\/gigaom2.files.wordpress.com\/2013\/03\/20130226_153409.jpg?w=708\"><img loading=\"lazy\" decoding=\"async\" alt=\"With Hadoop plans laid out\" src=\"http:\/\/gigaom2.files.wordpress.com\/2013\/03\/20130226_153409.jpg?w=708&#038;h=531\" width=\"708\" height=\"531\" class=\"size-large wp-image-616137\"><\/a><\/p>\n<p class=\"wp-caption-text\">With Hadoop plans laid out.<\/p>\n<\/div>\n<h2 id=\"good-infrastructure-makes-for-\">Good infrastructure makes for happy data scientists<\/h2>\n<p>Yael Garten, a senior data scientist at LinkedIn, said better infrastructure makes her job a lot easier. Like Kreps, she was drawn to LinkedIn (from her previous career doing bioinformatics research at Stanford) because the company has so much interesting data to work with, only she was fortunate enough to miss the early days of spotty infrastructure that couldn\u2019t handle 10 million users, much less today\u2019s more than 200 million users. To date, she said, she hasn\u2019t come across a problem she couldn\u2019t solve because the infrastructure couldn\u2019t handle the scale.<\/p>\n<p>The data science team embeds itself with the product team and they work together to either prove out product managers\u2019 hunches or build products around data scientists\u2019 findings. In 2013, Garten said, developers should expect infrastructure that lets them prototype applications and test ideas in near real time. And even business managers need to see analytics as close to real time as possible so they can monitor how new applications are performing.<\/p>\n<p>And infrastructure isn\u2019t just about making things faster, she noted: \u201cSomething things wouldn\u2019t be possible.\u201d She wouldn\u2019t go into detail about what this magic piece of infrastructure is, but I\u2019ll assume it\u2019s the company\u2019s top-secret distributed graph system. Ghosh was happy to go into detail about a lot things, but not that one.<\/p>\n<h2 id=\"a-virtuous-hamster-wheel\">A virtuous hamster wheel<\/h2>\n<p>Neither Ghosh nor Kreps sees LinkedIn \u2014 or any leading web company, for that matter \u2014 quitting the innovation game any time soon. Partially, this is a business decision. Ghosh, for example, cites the positive impact on company culture and talent recruitment, while Kreps points out the difficult total-cost-of-ownership math when comparing paying for software licenses or hiring open source committers versus just building something internally.<\/p>\n<p>Kreps acknowledged that the constant cycle of building new systems is \u201ckind of a hamster wheel,\u201d but there\u2019s always an opportunity to do new stuff and build products with their own unique needs. Initially, for example, he envisioned two targets use cases for Hadoop but now the company has about 300 individual workloads; it went from two real-time data feeds to 650.<\/p>\n<p>\u201cBut companies are doing this for a reason,\u201d he said. \u201cThere is some problem this solves.\u201d<\/p>\n<p>Ghosh, well, he shot down the idea of relying too heavily on commercial technologies or existing open source projects almost as soon as he suggests it\u2019s a possibility. \u201cWe think very carefully about where we should do rocket science,\u201d he told me, before quickly adding, \u201c[but] you don\u2019t want to become a systems integration shop.\u201d<\/p>\n<p>In fact, he said, there will be a lot more development and a lot more open source activity from LinkedIn this year: \u201c[I&#8217;m already] thinking about the next two or three big hammers.\u201d<\/p>\n<p> <img loading=\"lazy\" decoding=\"async\" alt=\"\" border=\"0\" src=\"http:\/\/stats.wordpress.com\/b.gif?host=gigaom.com&#038;blog=14960843&#038;%23038;post=616112&#038;%23038;subd=gigaom2&#038;%23038;ref=&#038;%23038;feed=1\" width=\"1\" height=\"1\" \/><\/p>\n<p><a href=\"http:\/\/pubads.g.doubleclick.net\/gampad\/jump?iu=\/1008864\/GigaOM_RSS_300x250&#038;sz=300x250&#038;%23038;c=973017\"><img decoding=\"async\" src=\"http:\/\/pubads.g.doubleclick.net\/gampad\/ad?iu=\/1008864\/GigaOM_RSS_300x250&#038;sz=300x250&#038;%23038;c=973017\" \/><\/a><\/p>\n<p><strong>Related research and analysis from GigaOM Pro:<\/strong><br \/>Subscriber content. <a href=\"http:\/\/pro.gigaom.com\/?utm_source=data&#038;utm_medium=editorial&#038;utm_campaign=auto3&#038;utm_term=616112+how-and-why-linkedin-is-becoming-an-engineering-powerhouse&#038;utm_content=dharrisstructure\">Sign up for a free trial<\/a>.<\/p>\n<ul>\n<li><a href=\"http:\/\/pro.gigaom.com\/2012\/03\/a-near-term-outlook-for-big-data\/?utm_source=data&#038;utm_medium=editorial&#038;utm_campaign=auto3&#038;utm_term=616112+how-and-why-linkedin-is-becoming-an-engineering-powerhouse&#038;utm_content=dharrisstructure\">A near-term outlook for big data<\/a><\/li>\n<li><a href=\"http:\/\/pro.gigaom.com\/2012\/04\/infrastructure-q1-cloud-and-big-data-woo-the-enterprise\/?utm_source=data&#038;utm_medium=editorial&#038;utm_campaign=auto3&#038;utm_term=616112+how-and-why-linkedin-is-becoming-an-engineering-powerhouse&#038;utm_content=dharrisstructure\">Infrastructure Q1: Cloud and big data woo enterprises<\/a><\/li>\n<li><a href=\"http:\/\/pro.gigaom.com\/2011\/03\/defining-hadoop-the-players-technologies-and-challenges-of-2011\/?utm_source=data&#038;utm_medium=editorial&#038;utm_campaign=auto3&#038;utm_term=616112+how-and-why-linkedin-is-becoming-an-engineering-powerhouse&#038;utm_content=dharrisstructure\">Defining Hadoop: the Players, Technologies and Challenges of 2011<\/a><\/li>\n<\/ul>\n<p><img width='1' height='1' src='http:\/\/gigaom.feedsportal.com\/c\/34996\/f\/646446\/s\/292865b5\/mf.gif' border='0'\/><\/p>\n<div class='mf-viral'>\n<table border='0'>\n<tr>\n<td valign='middle'><a href=\"http:\/\/share.feedsportal.com\/viral\/sendEmail.cfm?lang=en&#038;title=How+and+why+LinkedIn+is+becoming+an+engineering+powerhouse&#038;link=http%3A%2F%2Fgigaom.com%2F2013%2F03%2F03%2Fhow-and-why-linkedin-is-becoming-an-engineering-powerhouse%2F\" ><img decoding=\"async\" src=\"http:\/\/res3.feedsportal.com\/images\/emailthis2.gif\" border=\"0\" \/><\/a><\/td>\n<td valign='middle'><a href=\"http:\/\/res.feedsportal.com\/viral\/bookmark.cfm?title=How+and+why+LinkedIn+is+becoming+an+engineering+powerhouse&#038;link=http%3A%2F%2Fgigaom.com%2F2013%2F03%2F03%2Fhow-and-why-linkedin-is-becoming-an-engineering-powerhouse%2F\" ><img decoding=\"async\" src=\"http:\/\/res3.feedsportal.com\/images\/bookmark.gif\" border=\"0\" \/><\/a><\/td>\n<\/tr>\n<\/table>\n<\/div>\n<p><a href=\"http:\/\/da.feedsportal.com\/r\/159490091858\/u\/49\/f\/646446\/c\/34996\/s\/292865b5\/a2.htm\"><img decoding=\"async\" src=\"http:\/\/da.feedsportal.com\/r\/159490091858\/u\/49\/f\/646446\/c\/34996\/s\/292865b5\/a2.img\" border=\"0\"\/><\/a><img loading=\"lazy\" decoding=\"async\" width=\"1\" height=\"1\" src=\"http:\/\/pi.feedsportal.com\/r\/159490091858\/u\/49\/f\/646446\/c\/34996\/s\/292865b5\/a2t.img\" border=\"0\"\/><\/p>\n<div class=\"feedflare\">\n<a href=\"http:\/\/feeds.feedburner.com\/~ff\/OmMalik?a=247XUuGmZ4E:OGiuA9ickRs:yIl2AUoC8zA\"><img decoding=\"async\" src=\"http:\/\/feeds.feedburner.com\/~ff\/OmMalik?d=yIl2AUoC8zA\" border=\"0\"><\/img><\/a>\n<\/div>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/feeds.feedburner.com\/~r\/OmMalik\/~4\/247XUuGmZ4E\" height=\"1\" width=\"1\"\/><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Most LinkedIn users know \u201cPeople You May Know\u201d as one of that site\u2019s flagship features \u2014 an onmipresent reminder of other LinkedIn users with whom you probably want to connect. Keeping it up to date and accurate requires some heady data science and impressive engineering to keep data constantly flowing between the various LinkedIn applications. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[],"class_list":["post-645010","post","type-post","status-publish","format-standard","hentry","category-news"],"_links":{"self":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/posts\/645010","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/comments?post=645010"}],"version-history":[{"count":0,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/posts\/645010\/revisions"}],"wp:attachment":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/media?parent=645010"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/categories?post=645010"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/tags?post=645010"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}