{"id":656152,"date":"2013-05-04T13:30:25","date_gmt":"2013-05-04T17:30:25","guid":{"rendered":"http:\/\/gigaom.com\/?p=642005"},"modified":"2013-05-04T13:30:25","modified_gmt":"2013-05-04T17:30:25","slug":"careful-your-big-data-analytics-may-be-polluted-by-data-scientist-bias","status":"publish","type":"post","link":"https:\/\/mereja.media\/index\/656152","title":{"rendered":"Careful: Your big data analytics may be polluted by data scientist bias"},"content":{"rendered":"<p>Expectations surrounding the future of \u00a0<a href=\"http:\/\/bl-1.com\/click\/load\/VmQKO1E1AjNUN1A0Umw-b0231\">big data<\/a> range from the just huge to absolutely enormous \u2013 a reflection perhaps of both its real inherent potential and all the massive hype. Certainly though there is no dispute that companies can reap big benefits from exploring patterns found in the data they already generate and collect. Further, depending on the algorithms used, machine learning can even serve as a real world crystal ball: There are countless examples, but the story about <a href=\"http:\/\/bl-1.com\/click\/load\/AzFdbAZiV2ZfPFI2V2g-b0231\">Target\u2019s ability to predict pregnancies<\/a> by analyzing customer consumption patterns, or how well known mathematician <a href=\"http:\/\/bl-1.com\/click\/load\/U2EKO1QwVmdXNFcyBDI-b0231\">Nate Silver predicted the winner in all 50 states during last November\u2019s presidential election<\/a> are two poignant examples.<\/p>\n<p>But the fact remains that big data can only ever be as good as the machine learning that is used to provide insight, and even the most sophisticated machine learning techniques aren\u2019t omniscient \u2013 the old adage &#8220;garbage in, garbage out&#8221; sums up this dilemma perfectly. Businesses planning to invest in big data science, with the hopes of reaping the potential wealth of insights available,\u00a0must at all costs avoid introducing bias into the process \u2013 or risk jeopardizing everything.<\/p>\n<h2 id=\"data-bias-syndrome\">Data bias syndrome<\/h2>\n<p>Data bias comes in many forms. It can come from poorly defined business domain objectives. Or, it can come from opting to gather data that are easy to collect rather than data that are most informative. Data scientists can also receive data that have been biased by incorrect assumptions by the domain experts. (And as a footnote, the recent example of the <a href=\"http:\/\/www.newscientist.com\/article\/dn23448-how-to-stop-excel-errors-driving-austerity-economics.html\">austerity economics Excel scandal<\/a> shows how a minute data error can have cascading and devastating effects.)<\/p>\n<p>Likewise, data scientists themselves are not immune to bias. Some can run afoul of their own preconceived notions about business domain \u2013 too much knowledge can cause one to filter out data that may actually be helpful.\u00a0 Scientists with deep experience in a particular data set may develop too much reliance on pre-existing algorithms without re-examining validity for a particular use case.<\/p>\n<p>Finally, data quantity is a common problem. Intelligent learning requires abundant data, and often the data available are not sufficient to draw accurate conclusions \u2013 a problem known as data sparsity. This may sound unbelievable considering that data volume is doubling every two years according to an <a href=\"http:\/\/bl-1.com\/click\/load\/BjReb1A0UGFePVYzV2A-b0231\">EMC study<\/a>,\u00a0\u00a0but there\u2019s a difference between a dense data set populated by similar data points, and the far more diverse sets of user data points we find in the real world. In these cases, the gaps in the data are filled by machine learning algorithms that may inherently be biased, based on assumptions made by the data scientist when designing the algorithm. The trick is to find the right balance between unbiased data exploration and data exploitation.<\/p>\n<h2 id=\"removing-bias\">Removing bias<\/h2>\n<p>As companies bring data science in-house or purchase tools that act as a data abstraction layer, the need to address data bias becomes more immediate. The smart move is to build bias-quelling tactics into the data science process itself. Here\u2019s how:<\/p>\n<ul>\n<li><b>Employ domain experts <\/b>Rely on them\u00a0to help select relevant data and explore which features, inputs and outputs produce the best results. If heuristics are used to gain insights into smaller data sets, the data scientist will work with the domain expert to test the heuristics and ensure they actually produce better results. Like a pitcher and catcher in a baseball game, they are on the same team, with the same goal, but each brings different skill sets to complementary roles.<\/li>\n<li><b>Look for white spaces <\/b>\u00a0Data scientists who work with one data set for periods of time risk complacency, making it easier to introduce bias that reinforces preconceived notions. Don\u2019t settle for what you have; instead, look for the \u201cwhite spaces\u201d in your data sets and search for alternate sources to supplement \u201csparse data.\u201d<\/li>\n<li><b>Open a feedback loop<\/b>\u00a0This will help data scientists react to changing business requirements with modified models that can be accurately applied to the new business conditions. Applying Lean Startup like <a href=\"http:\/\/bl-1.com\/click\/load\/AzFaawRgVmcFZlE0Umc-b0231\">continuous delivery<\/a>\u00a0methodologies to your big data approach will help you keep your model fresh.<\/li>\n<li><b>Encourage your data scientists to explore.<\/b>\u00a0 If you can afford your own team of data scientists, be sure they have the space and autonomy to explore freely. Some <a href=\"http:\/\/bl-1.com\/click\/load\/U2ENPFE1ATBfPFYzUmA-b0231\">equate big data to the solar system<\/a>, so get out there and explore this uncharted universe!<\/li>\n<\/ul>\n<p>Whatever you do, don\u2019t ignore the issue: The last thing you want to do is implement a system that develops and propagates data, only to learn it&#8217;s hopelessly biased. If you don\u2019t solve this problem sooner rather than later, your organization will miss out on what many analysts are calling the next frontier for innovation.<\/p>\n<p><em>Haowen Chan is currently a principal scientist at <a href=\"http:\/\/www.baynote.com\/\">Baynote<\/a>,\u00a0\u00a0a provider of personalization solutions for online retailers. Robin D. Morris is a senior data scientist at Baynote; he is also associate adjunct professor in the Department of Applied Math and Statistics at the University of California, Santa Cruz.<\/em><\/p>\n<p><em>Have an idea for a post you\u2019d like to contribute to GigaOm? Click\u00a0<a href=\"http:\/\/gigaom.com\/2012\/11\/28\/have-an-idea-for-a-great-guest-post-heres-what-you-need-to-know\/\">here for our guidelines<\/a>\u00a0and contact info.<\/em><\/p>\n<p><em>Photo courtesy pzAxe\/Shutterstock.com.<\/em><\/p>\n<p> <img loading=\"lazy\" decoding=\"async\" alt=\"\" border=\"0\" src=\"http:\/\/stats.wordpress.com\/b.gif?host=gigaom.com&#038;blog=14960843&#038;%23038;post=642005&#038;%23038;subd=gigaom2&#038;%23038;ref=&#038;%23038;feed=1\" width=\"1\" height=\"1\" \/><\/p>\n<p><a href=\"http:\/\/pubads.g.doubleclick.net\/gampad\/jump?iu=\/1008864\/GigaOM_RSS_300x250&#038;sz=300x250&#038;%23038;c=880554\"><img decoding=\"async\" src=\"http:\/\/pubads.g.doubleclick.net\/gampad\/ad?iu=\/1008864\/GigaOM_RSS_300x250&#038;sz=300x250&#038;%23038;c=880554\" \/><\/a><\/p>\n<p><strong>Related research and analysis from GigaOM Pro:<\/strong><br \/>Subscriber content. <a href=\"http:\/\/pro.gigaom.com\/?utm_source=tech&#038;utm_medium=editorial&#038;utm_campaign=auto3&#038;utm_term=642005+careful-your-big-data-analytics-may-be-polluted-by-data-scientist-bias&#038;utm_content=gigaguest\">Sign up for a free trial<\/a>.<\/p>\n<ul>\n<li><a href=\"http:\/\/pro.gigaom.com\/2012\/09\/listening-platforms-finding-the-value-in-social-media-data\/?utm_source=tech&#038;utm_medium=editorial&#038;utm_campaign=auto3&#038;utm_term=642005+careful-your-big-data-analytics-may-be-polluted-by-data-scientist-bias&#038;utm_content=gigaguest\">Listening platforms: finding the value in social media data<\/a><\/li>\n<li><a href=\"http:\/\/pro.gigaom.com\/2012\/05\/the-importance-of-putting-the-u-and-i-in-visualization\/?utm_source=tech&#038;utm_medium=editorial&#038;utm_campaign=auto3&#038;utm_term=642005+careful-your-big-data-analytics-may-be-polluted-by-data-scientist-bias&#038;utm_content=gigaguest\">The importance of putting the U and I in visualization<\/a><\/li>\n<li><a href=\"http:\/\/pro.gigaom.com\/2012\/03\/a-near-term-outlook-for-big-data\/?utm_source=tech&#038;utm_medium=editorial&#038;utm_campaign=auto3&#038;utm_term=642005+careful-your-big-data-analytics-may-be-polluted-by-data-scientist-bias&#038;utm_content=gigaguest\">A near-term outlook for big data<\/a><\/li>\n<\/ul>\n<p><img width='1' height='1' src='http:\/\/gigaom.feedsportal.com\/c\/34996\/f\/646446\/s\/2b83573f\/mf.gif' border='0'\/><\/p>\n<div class='mf-viral'>\n<table border='0'>\n<tr>\n<td valign='middle'><a href=\"http:\/\/share.feedsportal.com\/share\/twitter\/?u=http%3A%2F%2Fgigaom.com%2F2013%2F05%2F04%2Fcareful-your-big-data-analytics-may-be-polluted-by-data-scientist-bias%2F&#038;t=Careful%3A+Your+big+data+analytics+may+be+polluted+by+data+scientist+bias\" ><img decoding=\"async\" src=\"http:\/\/res3.feedsportal.com\/social\/twitter.png\" border=\"0\" \/><\/a>&nbsp;<a href=\"http:\/\/share.feedsportal.com\/share\/facebook\/?u=http%3A%2F%2Fgigaom.com%2F2013%2F05%2F04%2Fcareful-your-big-data-analytics-may-be-polluted-by-data-scientist-bias%2F&#038;t=Careful%3A+Your+big+data+analytics+may+be+polluted+by+data+scientist+bias\" ><img decoding=\"async\" src=\"http:\/\/res3.feedsportal.com\/social\/facebook.png\" border=\"0\" \/><\/a>&nbsp;<a href=\"http:\/\/share.feedsportal.com\/share\/linkedin\/?u=http%3A%2F%2Fgigaom.com%2F2013%2F05%2F04%2Fcareful-your-big-data-analytics-may-be-polluted-by-data-scientist-bias%2F&#038;t=Careful%3A+Your+big+data+analytics+may+be+polluted+by+data+scientist+bias\" ><img decoding=\"async\" src=\"http:\/\/res3.feedsportal.com\/social\/linkedin.png\" border=\"0\" \/><\/a>&nbsp;<a href=\"http:\/\/share.feedsportal.com\/share\/gplus\/?u=http%3A%2F%2Fgigaom.com%2F2013%2F05%2F04%2Fcareful-your-big-data-analytics-may-be-polluted-by-data-scientist-bias%2F&#038;t=Careful%3A+Your+big+data+analytics+may+be+polluted+by+data+scientist+bias\" ><img decoding=\"async\" src=\"http:\/\/res3.feedsportal.com\/social\/googleplus.png\" border=\"0\" \/><\/a>&nbsp;<a href=\"http:\/\/share.feedsportal.com\/share\/email\/?u=http%3A%2F%2Fgigaom.com%2F2013%2F05%2F04%2Fcareful-your-big-data-analytics-may-be-polluted-by-data-scientist-bias%2F&#038;t=Careful%3A+Your+big+data+analytics+may+be+polluted+by+data+scientist+bias\" ><img decoding=\"async\" src=\"http:\/\/res3.feedsportal.com\/social\/email.png\" border=\"0\" \/><\/a><\/td>\n<td valign='middle'><\/td>\n<\/tr>\n<\/table>\n<\/div>\n<p><a href=\"http:\/\/da.feedsportal.com\/r\/164876707074\/u\/49\/f\/646446\/c\/34996\/s\/2b83573f\/a2.htm\"><img decoding=\"async\" src=\"http:\/\/da.feedsportal.com\/r\/164876707074\/u\/49\/f\/646446\/c\/34996\/s\/2b83573f\/a2.img\" border=\"0\"\/><\/a><img loading=\"lazy\" decoding=\"async\" width=\"1\" height=\"1\" src=\"http:\/\/pi.feedsportal.com\/r\/164876707074\/u\/49\/f\/646446\/c\/34996\/s\/2b83573f\/a2t.img\" border=\"0\"\/><\/p>\n<div class=\"feedflare\">\n<a href=\"http:\/\/feeds.feedburner.com\/~ff\/OmMalik?a=NUe6FSfVwuU:xXfY-_SjAWY:yIl2AUoC8zA\"><img decoding=\"async\" src=\"http:\/\/feeds.feedburner.com\/~ff\/OmMalik?d=yIl2AUoC8zA\" border=\"0\"><\/img><\/a>\n<\/div>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/feeds.feedburner.com\/~r\/OmMalik\/~4\/NUe6FSfVwuU\" height=\"1\" width=\"1\"\/><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Expectations surrounding the future of \u00a0big data range from the just huge to absolutely enormous \u2013 a reflection perhaps of both its real inherent potential and all the massive hype. Certainly though there is no dispute that companies can reap big benefits from exploring patterns found in the data they already generate and collect. Further, [&hellip;]<\/p>\n","protected":false},"author":8227,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[],"class_list":["post-656152","post","type-post","status-publish","format-standard","hentry","category-news"],"_links":{"self":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/posts\/656152","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/users\/8227"}],"replies":[{"embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/comments?post=656152"}],"version-history":[{"count":0,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/posts\/656152\/revisions"}],"wp:attachment":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/media?parent=656152"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/categories?post=656152"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/tags?post=656152"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}