{"id":647183,"date":"2013-03-17T17:16:11","date_gmt":"2013-03-17T21:16:11","guid":{"rendered":"http:\/\/gigaom.com\/?p=621428"},"modified":"2013-03-17T17:16:11","modified_gmt":"2013-03-17T21:16:11","slug":"can-we-please-stop-saying-unstructured-data","status":"publish","type":"post","link":"https:\/\/mereja.media\/index\/647183","title":{"rendered":"Can we please stop saying \u201cunstructured\u201d data?"},"content":{"rendered":"<p>Text. English. Chinese. Multi-structured. Language. Fuzzy. Logs. Hard-to-parse. Rich. Semi-structured. Whatever you want to call data that doesn\u2019t fit neatly into tidy little rows and columns these days, can we please stop calling it \u201cunstructured\u201d? I feel a bit like Don Quixote in even pursuing this topic, but after 15-plus years of (mostly) working on search and text processing (including <a href=\"http:\/\/www.amazon.com\/Taming-Text-Find-Organize-Manipulate\/dp\/193398838X\">writing a book on the subject<\/a>) I can\u2019t help but feel that it\u2019s time for the word \u201cunstructured\u201d to be retired and for us to find a better term to describe all of this stuff spewing from us and our computational creations.<\/p>\n<p><a href=\"http:\/\/structuredata2013-editgraphic.eventbrite.com\/\"><img decoding=\"async\" src=\"http:\/\/gigaom2.files.wordpress.com\/2013\/02\/structure-data_in-article-banners_300x2001.png?w=708\" alt=\"Structure:Data: Put data to work. 60+ big data experts speaking. March 20-21, 2013, New York City. Register now.\" class=\"alignleft size-full wp-image-610577\"><\/a>Why all the (somewhat tongue-in-cheek) vitriol towards such a simple word? When I\u2019m feeling cynical, I think that, in the early days of databases, someone coined \u201cunstructured\u201d as a derogatory term to mean \u201call the stuff a database isn\u2019t good at working on.\u201d\u00a0If \u201cstructured\u201d is good, then \u201cun\u201d-structured must be bad, right? The problem is that working with text is one of the defining computational challenges of our time. We need our best and brightest working on it; and not just so we can better target ads to consumers. It\u2019s too full of promise to describe with such a diminutive word as \u201cunstructured.\u201d Numerical data? Child\u2019s play! Text? Now there\u2019s a real challenge.<\/p>\n<p>Text is easily one of the most highly structured data types we face, filled with misspellings, misdirection, flowery language, ambiguity and implicit knowledge. Text is so often misunderstood that researchers in the field even have a metric (<a href=\"http:\/\/en.wikipedia.org\/wiki\/Inter-rater_agreement\">inter-annotator agreement<\/a>) that tracks how often two people examining the same piece of text agree on the answer to some question on the text. Authors like Faulkner and Joyce treat language as an art form, yet Joe Forum User can\u2019t, for whatever reason, write a complete sentence. I hate them and love them all, all at the same time. How good do you think a computer can be at parsing a single sentence that spans multiple pages, much less try to make sense of a sentence that doesn\u2019t even follow basic grammar rules?<\/p>\n<p>Sure, we\u2019ve made great progress, especially in recent years, in dealing with rich data like text, and big data and <a href=\"http:\/\/en.wikipedia.org\/wiki\/Deep_learning\">deep learning<\/a> techniques hold even more promise to unlocking some of the mystery. We can now detect the end of a sentence with a high degree of accuracy, find sentiment in tweets and locate the mentions of people on a page, just like your average fifth grader!<\/p>\n<p>Yet, despite all of these advances, we need at least an order of magnitude advance (if not two or more) in our ability to process rich data across a variety of domains for us to truly harness the opportunity this data presents in moving civilization forward. I hope you\u2019ll forgive me if the word \u201cunstructured\u201d leaves me feeling a bit empty inside when I think about that opportunity and the lack of inspiration it provides to potential contributors. As for me, I\u2019ll start by calling it \u201crich data\u201d from here on out, windmills be damned.<\/p>\n<p><i>Grant Ingersoll, who will be speaking at <\/i><a href=\"http:\/\/event.gigaom.com\/structuredata\/?utm_source=data&#38;utm_medium=editorial&#038;%2338;utm_campaign=intext&#038;%2338;utm_term=621428+can-we-please-stop-saying-unstructured-data&#038;%2338;utm_content=gigaguest\"><i>Structure:Data on March 20-21<\/i><\/a><i>, is the CTO and cofounder of LucidWorks<\/i><i>. He also coauthored <\/i><a href=\"http:\/\/www.amazon.com\/Taming-Text-Find-Organize-Manipulate\/dp\/193398838X\">Taming Text<\/a><i>, cofounded Apache Mahout and is a long-standing committer on the Apache Lucene and Solr open source projects.<\/i><i> He\u2019s engineered a variety of search, question-answering and natural-language processing applications. You can follow Grant on Twitter <a href=\"http:\/\/twitter.com\/gsingers\">@gsingers<\/a>.<\/i><b><i><\/i><\/b><\/p>\n<p>\u00a0<\/p>\n<p> <img loading=\"lazy\" decoding=\"async\" alt=\"\" border=\"0\" src=\"http:\/\/stats.wordpress.com\/b.gif?host=gigaom.com&#038;blog=14960843&#038;%23038;post=621428&#038;%23038;subd=gigaom2&#038;%23038;ref=&#038;%23038;feed=1\" width=\"1\" height=\"1\" \/><\/p>\n<p><a href=\"http:\/\/pubads.g.doubleclick.net\/gampad\/jump?iu=\/1008864\/GigaOM_RSS_300x250&#038;sz=300x250&#038;%23038;c=52680\"><img decoding=\"async\" src=\"http:\/\/pubads.g.doubleclick.net\/gampad\/ad?iu=\/1008864\/GigaOM_RSS_300x250&#038;sz=300x250&#038;%23038;c=52680\" \/><\/a><\/p>\n<p><strong>Related research and analysis from GigaOM Pro:<\/strong><br \/>Subscriber content. <a href=\"http:\/\/pro.gigaom.com\/?utm_source=data&#038;utm_medium=editorial&#038;utm_campaign=auto3&#038;utm_term=621428+can-we-please-stop-saying-unstructured-data&#038;utm_content=gigaguest\">Sign up for a free trial<\/a>.<\/p>\n<ul>\n<li><a href=\"http:\/\/pro.gigaom.com\/2012\/05\/the-importance-of-putting-the-u-and-i-in-visualization\/?utm_source=data&#038;utm_medium=editorial&#038;utm_campaign=auto3&#038;utm_term=621428+can-we-please-stop-saying-unstructured-data&#038;utm_content=gigaguest\">The importance of putting the U and I in visualization<\/a><\/li>\n<li><a href=\"http:\/\/pro.gigaom.com\/2012\/04\/aws-storage-gateway-jolts-cloud-storage-ecosystem\/?utm_source=data&#038;utm_medium=editorial&#038;utm_campaign=auto3&#038;utm_term=621428+can-we-please-stop-saying-unstructured-data&#038;utm_content=gigaguest\">AWS Storage Gateway jolts cloud-storage ecosystem<\/a><\/li>\n<li><a href=\"http:\/\/pro.gigaom.com\/2012\/03\/a-near-term-outlook-for-big-data\/?utm_source=data&#038;utm_medium=editorial&#038;utm_campaign=auto3&#038;utm_term=621428+can-we-please-stop-saying-unstructured-data&#038;utm_content=gigaguest\">A near-term outlook for big data<\/a><\/li>\n<\/ul>\n<p><img width='1' height='1' src='http:\/\/gigaom.feedsportal.com\/c\/34996\/f\/646446\/s\/29ac7150\/mf.gif' border='0'\/><\/p>\n<div class='mf-viral'>\n<table border='0'>\n<tr>\n<td valign='middle'><a href=\"http:\/\/share.feedsportal.com\/viral\/sendEmail.cfm?lang=en&#038;title=Can+we+please+stop+saying+%E2%80%9Cunstructured%E2%80%9D+data%3F&#038;link=http%3A%2F%2Fgigaom.com%2F2013%2F03%2F17%2Fcan-we-please-stop-saying-unstructured-data%2F\" ><img decoding=\"async\" src=\"http:\/\/res3.feedsportal.com\/images\/emailthis2.gif\" border=\"0\" \/><\/a><\/td>\n<td valign='middle'><a href=\"http:\/\/res.feedsportal.com\/viral\/bookmark.cfm?title=Can+we+please+stop+saying+%E2%80%9Cunstructured%E2%80%9D+data%3F&#038;link=http%3A%2F%2Fgigaom.com%2F2013%2F03%2F17%2Fcan-we-please-stop-saying-unstructured-data%2F\" ><img decoding=\"async\" src=\"http:\/\/res3.feedsportal.com\/images\/bookmark.gif\" border=\"0\" \/><\/a><\/td>\n<\/tr>\n<\/table>\n<\/div>\n<p><a href=\"http:\/\/da.feedsportal.com\/r\/159490328239\/u\/49\/f\/646446\/c\/34996\/s\/29ac7150\/a2.htm\"><img decoding=\"async\" src=\"http:\/\/da.feedsportal.com\/r\/159490328239\/u\/49\/f\/646446\/c\/34996\/s\/29ac7150\/a2.img\" border=\"0\"\/><\/a><img loading=\"lazy\" decoding=\"async\" width=\"1\" height=\"1\" src=\"http:\/\/pi.feedsportal.com\/r\/159490328239\/u\/49\/f\/646446\/c\/34996\/s\/29ac7150\/a2t.img\" border=\"0\"\/><\/p>\n<div class=\"feedflare\">\n<a href=\"http:\/\/feeds.feedburner.com\/~ff\/OmMalik?a=hSu-dKuRgnA:m567TBTKl1g:yIl2AUoC8zA\"><img decoding=\"async\" src=\"http:\/\/feeds.feedburner.com\/~ff\/OmMalik?d=yIl2AUoC8zA\" border=\"0\"><\/img><\/a>\n<\/div>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/feeds.feedburner.com\/~r\/OmMalik\/~4\/hSu-dKuRgnA\" height=\"1\" width=\"1\"\/><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Text. English. Chinese. Multi-structured. Language. Fuzzy. Logs. Hard-to-parse. Rich. Semi-structured. Whatever you want to call data that doesn\u2019t fit neatly into tidy little rows and columns these days, can we please stop calling it \u201cunstructured\u201d? I feel a bit like Don Quixote in even pursuing this topic, but after 15-plus years of (mostly) working on [&hellip;]<\/p>\n","protected":false},"author":7802,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[],"class_list":["post-647183","post","type-post","status-publish","format-standard","hentry","category-news"],"_links":{"self":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/posts\/647183","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/users\/7802"}],"replies":[{"embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/comments?post=647183"}],"version-history":[{"count":0,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/posts\/647183\/revisions"}],"wp:attachment":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/media?parent=647183"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/categories?post=647183"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/tags?post=647183"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}