{"id":218446,"date":"2010-01-19T14:30:57","date_gmt":"2010-01-19T19:30:57","guid":{"rendered":"http:\/\/languagelog.ldc.upenn.edu\/nll\/?p=2059"},"modified":"2010-01-19T14:30:57","modified_gmt":"2010-01-19T19:30:57","slug":"mining-a-year-of-speech","status":"publish","type":"post","link":"https:\/\/mereja.media\/index\/218446","title":{"rendered":"Mining a year of speech"},"content":{"rendered":"<p>John Coleman was on the BBC <em>Digital Planet<\/em> program a couple of weeks ago, <a href=\"http:\/\/www.bbc.co.uk\/programmes\/p005m6zn#p005w1t1\">discussing<\/a> a recently-awarded grant from the (British\/American\/Canadian) <a href=\"http:\/\/www.diggingintodata.org\/Default.aspx\">&#8220;Digging into Data&#8221; challenge<\/a>.\u00a0 The proposal was submitted under the title &#8220;Mining a Year of Speech&#8221;, and also involves the British Library Sound Archive, and some researchers at Penn, including <a href=\"http:\/\/www.ling.upenn.edu\/~jiahong\/\">Jiahong Yuan<\/a>, <a href=\"http:\/\/www.ldc.upenn.edu\/About\/Staff\/index.shtml#chris\">Chris Cieri<\/a>, and me.\u00a0 An Oxford University press release is <a href=\"http:\/\/www.ox.ac.uk\/media\/news_stories\/2009\/091204_2.html\">here<\/a>.<\/p>\n<p><span id=\"more-2059\"><\/span>Last week, John was in Philadelphia, discussing plans for who&#8217;ll do what when.\u00a0 On the U.K. side, the primary goal is to index the audio of the <a href=\"http:\/\/www.natcorp.ox.ac.uk\/docs\/URG\/cdifsp.html\">spoken part of the British National Corpus<\/a>. On the U.S. side, we&#8217;ll be indexing a variety of other spoken materials, and working with our British partners on issues of pronunciation modeling across dialects, integration of diverse metadata from different sources, and approaches to web-based search and retrieval for various types of researchers.<\/p>\n<p>One of the things that I learned during John&#8217;s visit is that during his time at <a href=\"http:\/\/en.wikipedia.org\/wiki\/Bell_Labs\">Bell Labs<\/a>, before he took the job at Oxford, he occupied the office that I had used during my last few years there. And as it happens, one of the other awards in the Digging into Data challenge was to a <a href=\"http:\/\/www.news.cornell.edu\/stories\/Jan10\/DiggingData.html\">group involving Mats Rooth at Cornell<\/a> &#8212; and Mats, I believe, occupied the same office during the interval between John&#8217;s time there and mine.<\/p>\n<p>For an example of what can be done with this sort of text\/audio alignment, take a look at the presentation on the <a href=\"http:\/\/www.oyez.org\/\">oyez.org website<\/a> of U.S. Supreme Court oral arguments (e.g. <a href=\"http:\/\/www.oyez.org\/cases\/2000-2009\/2007\/2007_07_290\">this one<\/a>).\u00a0 The techniques we&#8217;ll be using on the Digging into Data project were developed (mainly by Jiahong Yuan) for the SCOTUS application, under an NSF grant that just ended this past year.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>John Coleman was on the BBC Digital Planet program a couple of weeks ago, discussing a recently-awarded grant from the (British\/American\/Canadian) &#8220;Digging into Data&#8221; challenge.\u00a0 The proposal was submitted under the title &#8220;Mining a Year of Speech&#8221;, and also involves the British Library Sound Archive, and some researchers at Penn, including Jiahong Yuan, Chris Cieri, [&hellip;]<\/p>\n","protected":false},"author":4144,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[],"class_list":["post-218446","post","type-post","status-publish","format-standard","hentry","category-news"],"_links":{"self":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/posts\/218446","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/users\/4144"}],"replies":[{"embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/comments?post=218446"}],"version-history":[{"count":0,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/posts\/218446\/revisions"}],"wp:attachment":[{"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/media?parent=218446"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/categories?post=218446"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mereja.media\/index\/wp-json\/wp\/v2\/tags?post=218446"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}