EARS Metadata Extraction

Overview

As part of the EARS program, LDC creates annotated resources to support a metadata extraction (MDE) evaluation. The goal of EARS MDE is to enable technology that can take the raw STT output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: removing non-content words like filled pauses and discourse markers from the text; removing sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation and standardized spelling, plus sensible conventions for representing speaker turns and identity are further elements in the readable transcript. In support of the EARS program, LDC has defined a SimpleMDE annotation task and has annotated English telephone and broadcast news transcripts to provide training data for MDE.  The links below provide additional information about the EARS MDE Annotation project.

Data and Timeline
Data annotated, timeline of annotation

2004 Annotation Effort

2003 Annotation Effort

Annotation Guidelines

2004 Official Annotation Guidelines (SimpleMDE V6.2)

2003 Official Annotation Guidelines (SimpleMDE V5.0)

Simple Pilot Exercise

Administrative (password protected)
Information about work assignments, progress and administrative details for LDC annotators 

Tools
Download the latest free MDE Annotation Toolkit for Windows and *NIX

NIST's RT-03 site

MIT-LL's MDE site  (password protected)



strassel@ldc.upenn.edu
4/18/2004