Resources for SimpleMDE Pilot Annotation Exercise
- Data Content
- Tool and Pilot Study Data Distributions
- Tool User Manual
- Annotation Guidelines .doc format .pdf format
NEW: MDTM/CTM/UEM Files for Sites' Data (First-pass and Adjudication-pass) MDTM/CTM/UEM Files for Gold Standard Data NEW: AG/AIF->MDTM/CTM/UEM Exporter Package
Data Content
The data for the pilot annotation exercise consists of 30 minutes of telephone and 30 minutes of broadcast data. We have targeted 10 minutes of each data type as "mandatory" -- that is, all sites participating in the exercise must annotate these files. Sites should make every effort to annotate the remaining data as time allows.
Mandatory
As time allows
Still more data
CTS (Switchboard-1) sw4927
sw4940 sw4908
sw4936 sw4917
sw4928BN (Hub4)
ea980107
ed980104
ee970625
MDTM/CTM/UEM Files for Sites' First Pass and Adjudication Data (05/16/03)The following packages contain MDTM, CTM and UEM files generated from sites' annotation files. The sites-1p.tar.gz file contains the data from the first-pass annotation, and the adj.tar.gz file contains the data from the adjudication pass.
adj.tar.gz
sites-1p.tar.gz
MDTM/CTM/UEM Files for Gold Standard Data (5/5/03)The purpose of this distribution is to allow sites to do preliminary testing of the scoring tools and of the conversion process. This package contains MDTM/CTM/UEM files generated for LDC's three gold standard files (sw4927, sw4940 and ea980107). These were generated using LDC's AG->AIF converter and NIST's MDTM exporter. We will distribute binary builds of the complete converter (AG->AIF->MDTM/CTM/UEM) shortly. Please note that we are in the process of debugging/coordinating the conversion process, and these files are not final.
mdtm.20030505.1600.tar.gz
mdtm.20030505.1600.zip
AG/AIF->MDTM/CTM Exporter Package (5/7/03)We are releasing an AG/AIF->MDTM/CTM exporter package for the current simple MDE data. Please note that we are in the process of coordinating the conversion process, and the output format is subject to change. The AG->AIF part was developed at the LDC. The AIF->MDTM/CTM part was developed by NIST.
These binary distributions contain an AG/AIF->MDTM/CTM exporter and related tools and libraries.
Solaris 8 (Sparc)
Linux (i386, Glibc 2.1.3)
Windows (i386)Installation (On Unix/Linux):
The AIF->MDTM/CTM exporter requires Java 2 Standard Edition version 1.4 or better. It can be downloaded from Sun's Java site Once it's installed, you may need to change your JAVA_HOME variable or your PATH variable. If your environment variable points to an older java excutable, the script will fail.
Please go to the installation directory (one above the mdeTool directory), and unpack the the package as usual.
gzip -cd mde-pilot-exporter1-*.tar.gz | tar xfv -This will add a directory named "conversion" under the "mdeTool" directory. The exporter script (RTExporter.sh) is located in the directory.
Installation (On Windows):
The AIF->MDTM/CTM exporter requires Java 2 Standard Edition version 1.4 or better. It can be downloaded from Sun's Java site
Please go to the installation directory (one above the mdeTool directory), and unpack the the zip package as usual.
This should add a directory named "conversion" under the "mdeTool" directory. The exporter script (rtexporter.bat) is located in the directory.
Exporter Usage (On Unix/Linux):
> RTexporter.sh (options) File.ag.xml
For BN files, the options "-r -I -z" are required.
For CTS files, the options "-r -z" are required.
Examples of acutual usage:
This should create .mdtm, .ctm, .uem (etc.) files in the data directory.cd ...../mdeTool/data
../conversion/RTexporter.sh -r -z sw....ag.xml
or
../conversion/RTexporter.sh -r -I -z ea....ag.xml
Exporter Usage (On Windows):
The exporter runs from the command line. To open a terminal, you can choose the "Run.." item from the "Start" menu, and type in "cmd".
> rtexporter.bat (options) File.ag.xml
For BN files, the options "-r -I -z" are required.
For CTS files, the options "-r -z" are required.
Examples of acutual usage:
This should create .mdtm, .ctm, .uem (etc.) files in the data directory.cd \mdeTool\data
..\conversion\rtexporter.bat -r -z sw....ag.xml
or
..\conversion\rtexporter.bat -r -I -z ea....ag.xml
Broadcast News Data Supplement (4/15/03)These binary distributions contain a supplementary package containing the BN data. These should be installed after the main package is installed.
Solaris 2.8 (Sparc)
Linux (i386, Glibc 2.1.3)
Windows (i386)Note (4/15 2:30pm EST): a user reported a BN-specific problem that occurs when adding an interruption point in the complex depod window. We have fixed the bug, and replaced the distributions above around 2pm EST on 4/15. If you have downloaded a package before the time, and are having the same problem, you can replace the files mdeTool/bin/mdeMain.py and mdeTool/bin/mdeMain.pyc with the following files. We are sorry for the inconvenience. These files are common across all platforms.
mdeMain.py
Installation Instructions for Solaris and Linux:
Adding this package to the original package will not destroy any existing sw* files, but we strongly recommend that you make back-up copies of the annotation files that you have worked on (sw*.ag.xml files in the mdeTool/data directory) to keep in a safe place (such as your home directory).
First, please go to the installation directory of the original package --- namely, one directory above the mdeTool directory.
cd installationDirectory
The ls command should show the 'mdeTool' directory.
ls
mdeTool
Then the following command will unpack the update package.
gzip -cd mde-pilot-update1-*.tar.gz | tar xfv -This will add speech files, annotation files and scripts for the BN data to appropriate directories (such as, mdeTool/data and mdeTool/scripts). It will also install updated libraries.
After a successful installation, you will find the three scripts for the BN data: ea980107, ed980104 and ee970625 in the mde/scripts directory.
Installation Instructions for Windows:
Adding this package to the original package will not destroy any existing sw* files, but we strongly recommend that you make back-up copies of the annotation files that you have worked on (sw*.ag.xml files in the mdeTool\data directory) to keep in a safe place (such as your home directory).
First, please unpack the .zip file using a utility such as WinZIP.
When extracting the update files, specify the same installation directory (one directory above the mdeTool directory) as the original package: for example, if the original package was extracted in C:\mde, then the update package should be extracted in C:\mde.
The (un)zip program may ask you if it's okay to overwrite certain files (.py, .pyc and .dll files). Answer YES to all.
This will add speech files, annotation files and scripts for the BN data to appropriate directories (such as mdeTool\data and mdeTool\scripts). It will also install updated libraries.
After a successful installation, you will find the three scripts for the BN data: ea980107, ed980104 and ee970625 in the mdeTool\scripts directory.
Using Tool with BN files:
The tool can be started from the newly added scripts/batch files in the mdeTool/scripts directory. The tool uses only one text panel for the BN files, and all speakers are displayed there. The tool does not allow users to highlight regions across speaker boundaries. This behavior is normal because annotations are not to be created across speaker boundaries.
Distributions of SimpleMDE Annotation Tool and CTS Data (4/11/2003)
These packages include both the annotation tool and the data for the simple MDE pilot study. Please note that the tool included in these packages has been updated since the beta release, and there is no need for you to keep the beta tool previously distributed. These packages are self-contained, and no additional packages are necessary.
The following binary distributions are available:
Solaris 2.8 (Sparc)
Linux (i386, Glibc 2.1.3)
Windows (i386)** The bn data will be delivered in a separate package **
Installation Instructions for Solaris and Linux:
gzip -cd mde-pilot-*.tar.gz | tar xfv -This will create a directory named mdeTool. About xx MB of hard disk space is required.
In the directory mdeTool/scripts, you will find shell script to start the tool and load data.
cd mdeTool/scriptsThe following command (in mdeTool/scripts) will start the MDE annotation tool and load the speech file and the transcriptions for sw4927. It might take a while to display the text.
ls
sw4927 sw4917 ...
./sw4927 ...Installation Instructions for Windows:
Please unpack the .zip file using a utility such as WinZIP.
When the zip file is unpacked, a directory named mdeTool is created. Go to the mdeTool directory, and then to a subdirectory named scripts.
Double-click on a desired script. For example, double-clicking on sw4927 (or sw4927.bat) will start the tool and load the speech file plus the transcriptions for sw4927. It might take a while to display the text.
Installation Problems:
If you have any problems installing and starting the software, please send a message to Haejoong Lee and Kazuaki Maeda (haejoong@ldc.upenn.edu, maeda@ldc.upenn.edu).
Submission Instructions
After you've completed annotation, you should package up the annotation files and email them to Chris Walker and Stephanie Strassel at LDC.
Annotation files are saved with the suffix .ag.xml. These files can be found in the distribution directory structure at:
{ROOT}/mdeTool/data
where: '{ROOT}' denotes the directory within which the distribution has been installed.
The full paths for the mandatory files are:
{ROOT}/mdeTool/data/sw4927-ms98-a-word.ag.xml
{ROOT}/mdeTool/data/sw4940-ms98-a-word.ag.xml
{ROOT}/mdeTool/data/ea980107.ag.xml
Comparison and Adjudication (4/25/03)
We have prepared binary distribution of a comparison/adjudication tool and gold standard files for the three required files (sw4927, sw4940 and ea980107). These packages include scripts to compare your annotation files against the gold standards. The tool will let you create adjudicated files as well as records of discrepancies and resolutions.
The following binary distributions are available:
Solaris 2.8 (Sparc)
Linux (i386, Glibc 2.1.3)
Windows (i386)Installation Instructions for Solaris and Linux:
Adding this package to the original package will not destroy any existing *.ag.xml files, but we strongly recommend that you make back-up copies of any annotation files that you have worked on (*.ag.xml files in the mdeTool/data directory).
First, please go to the installation directory of the original package --- namely, one directory above the mdeTool directory.
cd installationDirectory
The ls command should show the 'mdeTool' directory.
ls
mdeTool
Then the following command will unpack the comparison package.
gzip -cd mde-pilot-comparison-*.tar.gz | tar xfv -This will add the comparison tool and the gold standard files to the existing software and data.
After a successful installation, you will find three scripts named compare-sw4927, compare-sw4940 and compare-ea980107 in the mdeTool/scripts directory.
The following command will start the comparison tool (mdeCompare) with sw4927 files, for example.
./compare-sw4927Installation Instructions for Windows:
Adding this package to the original package will not destroy any existing *.ag.xml files, but we strongly recommend that you make back-up copies of any annotation files that you have worked on (*.ag.xml files in the mdeTool\data directory).
First, please unpack the .zip file using a utility such as WinZIP.
When extracting the comparison files, specify the same installation directory (one directory above the mdeTool directory) as the original package: for example, if the original package was extracted in C:\mde, then the comparison package should be extracted in C:\mde.
The (un)zip program may ask you if it's okay to overwrite certain files (.py, .pyc and .dll files). Answer YES to all.
This will add the comparison tool and the gold standard files to the existing software and data.
After a successful installation, you will find three batch scripts named compare-sw4927, compare-sw4940 and compare-ea980107 in the mdeTool\scripts directory. Double-clicking on them will start the comparison tool (mdeCompare) with appropriate files.
Installation Problems:
If you have any problems installing and starting the software, please send a message to Kazuaki Maeda and Haejoong Lee (maeda@ldc.upenn.edu, haejoong@ldc.upenn.edu)
Adjudication Process with "compare-*" Scripts:
When one of the provided scripts (compare-*) is started for the first time, the tool creates a third file containing annotations where your annotations and the gold standards match. It will also create a list of differences (diffList -- *.diff files).
The tool will then let you select a resolution for each mismatches.
If you choose "File->Save (Both File3 and DiffList)" (after or before you start the adjudication process), the adjudication file and the diffList file will be saved.
When the script is started for the second time, it will use the existing adjudicated file and the diffList file. If you would like to start over the adjudication process, you can simple remove or rename the *-adjudicated-*.ag.xml file and the *.diff file in the data directory, and restart the script.
User Manual (mdeCompare):
See the User Manual Section
Data Submission Instructions:
After you've completed the adjudication process, you should package up the annotation files and email them to Chris Walker and Stephanie Strassel at LDC.
The required files are:
mdeTool/data/sw4927-adjudicated-ms98-a-word.ag.xml
mdeTool/data/sw4940-adjudicated-ms98-a-word.ag.xml
mdeTool/data/ea980107-adjudicated.ag.xml
mdeTool/data/sw4927.diff
mdeTool/data/sw4940.diff
mdeTool/data/ea980107.diff
Comparison Tool Update (4/29/03)
This tool update adds the functionality to add comments for each comparison in the adjudication process. This is not a bug-fix; it's a tool enhancement.
This update is compatible with the '.ag.xml' and '.diff' files created with the original comparison package. As a result, if sites have already finished the adjudication process, the updated tool will let them add comments to the selections they have already made.
The following packages contain the updated mdeCompare.py and mdeText.py files. They are common for all platforms, but the .tar.gz file can be used for Unix/Linux and the .zip file can be used for Windows.
Comparison Tool Update (Tar gzip file for Unix/Linux)Installation:
Comparison Tool Update (Zip file for Windows)
Adding this package to the original package will not destroy any existing *.ag.xml or *diff files, but we recommend that you make back-up copies of any files that you have worked on (*.ag.xml and *diff files in the mdeTool/data directory). Note: Do not remove the *adjudicated*ag.xml and *diff files that have been created in previous adjudication sessions unless you want to restart the adjudication process.
Installation is similar to all other previous packages. First, please go to the installation directory (one above the mdeTool directory).
If it's Unix/Linux, you can unpack it with the command:
gzip -cd mde-pilot-comparison-update.tar.gz | tar xfv -If it's Windows, please unpack the zip file using a utility such as WinZip. It might ask you if it's okay to replace some *.py and *.pyc files; answer YES.
When the package is extracted, the mdeTool/bin/mdeCompare.py and mdeTool/bin/mdeText.py files will be replaced.
Adding Adjudication Comments:
If the update package is installed, the comparison tool should show a "comments" box above the selection buttons, as well as in the diffList table.
Comments can be typed in the "comments" box. If you select "File1", "File2" or "File3", the comments in the comments box will be associated with the selection in the diffList. When files are saved, the comments will be saved in the diffList (*.diff) file.
You can also add comments to existing diffList files. To do this, simply start the comparison scripts (without removing exiting *adjudicated*.ag.xml and *diff files). Then, go to the the comparison you would like to add comments to, type in comments in the comments box, and click on the "Add Comments" button.
Please note that if you move to another comparison without making a selection, or without "adding comments", the text in the comments box will disappear, and be replaced with the comments for the new comparison (often an empty string).
(Note: Hitting the space bar in the comment boxes no longer triggers audio output.)
User Manual (mdeCompare)
Display:
For both file types, there are 3 rows:1) Input File 1
2) Input File 2
3) The annotation output file (File 3)
For switchboard, each channel is presented in its own column:1) Channel A is on the left
2) Channel B is on the right
Adjudication Process:
When the tool is loaded for a given file, the first annotational difference will be selected in the Adjudication Display at the bottom of the screen.
The Adjudication Display allows the user to select between the two annotations provided by the respective input files, or to choose an entirely different annotation.
To choose between pre-existing annotations:
1) Press the button 'Select File X' where 'X' is the file which contains the desired annotation.
2) Press the 'Next' Button.
3) The Adjudication Display should now present a new difference to be adjudicated.
-> NOTE: if the wrong file is ever selected, it may be unselected by pressing the 'Unselect' button.
To choose an annotation which is entirlely new:
1) Add the annotation to the File 3 in the usual way.
2) Press the button 'Select File 3'.
3) Press the 'Next' Button.
4) The Adjudication Display should now present a new difference to be adjudicated.
'QC' Process:
Occasionally, there will be an incorrect annotation in both channels, (or an incorrect 'non-annotation'). These mistakes can be fixed be altering the annotation in File 3 in the usual way.
Note, however, that if annotations are added/removed/changed at points where there is no disagreement, then the 'Select File 3' process need not be followed.
A Note on Order of Presentation:
It may seem like the tool is asking you to adjudicate the same annotation multiple times. This is not a bug. If you look at the contents of the fields in the Adjudication Display, the details of the specific annotational difference in question are presented. (When the difference is 'Only in File X', then the field for File Y will be empty.)
In addition to seemingly re-presenting differences, the tool also frequently seems to 'jump around' in the file. This is due to the order in which the differences are listed in the underlying '.diff' file generated by the tool. The order of presentation there is:
1) All of Channel A followed by all of Channel B.
2) Within each channel:
Discourse Markers
Filled Pauses
Depods
Asides
EETs
QT
NoRTMetadata
SUs
User Manual (mdeTool)
We plan to add an integrated user manual in the actual annotation tool. The following describes how to add, delete and check annotations. (This section is provided for documentation purposes. You may find in actual practice that the tool is largely self-explanatory.)
Text Manipulation
Text DisplayAudio Display and Playback
Channel A in left pane, Channel B in right pane
[Note: Speaker turns should line up temporally --this needs to be added to next version of tool.]
Text Scrolling
Leftmost scrollbar controls Speaker A
Rightmost scrollbar controls Speaker B
Middle scrollbar controls A and B in sync
Text Selection
Click on word to select
To select span, click on first and last word in span
[Note: This selection is progressive - clicking on the third word then clicking on the fifth word then clicking on the ninth word will select the fifth through ninth words, not the third through ninth words.]To move to the other speaker's transcript, hit <TAB>
When you select a region of text in the main transcription window, the corresponding audio region is automatically selected in the wavefile below. To play this section of audio, hit <space>. Note that the granularity of the playback selection is determined by the pre-existing segmentation boundaries in the underlying transcript file.
You can also select a region to play from within the waveform by swiping your mouse across the audio file and hitting the play button (the black arrow). To play just one speaker at a time, hit the open arrow from the playback controls. The slider bar to the left of the playback controls allows you to adjust the granularity of the waveform display. The slider bar below the waveform image allows you to scroll back and forth in the waveform.
Annotation
Edit Disfluencies and Fillers
To add an annotation layer
To remove an annotation layer
- Select region of text to be annotated
- Click right mouse button or <enter>/<return>
- Add annotation window appears
- Select edit or filler type
- Hit Submit to record annotation
- You'll then see the annotation represented within the transcript. Each annotation layer is represented with a color-coded underlining. Fillers are further displayed in colored font.
Fillers
- Select affected text
- Hit control-d
- Dialog box will pop open displaying all layers of annotation affecting that region. Select the layer you wish to delete and hit 'delete
Filled PausesEdit Disfluencies
Filled pauses belonging to the list of targeted FP words are pre-annotated and displayed in blue font and blue underlining.
Discourse Markers
DM tokens belonging to the list of targeted DM words are pre-identified in red font but are *not* pre-annotated. You must select and tag all DMs.
Explicit Editing Terms and Asides/Parentheticals
These are not pre-identified or pre-tagged by the tool. You must identify and tag all EETs and Asides/Parentheticals manually
Note: Colored fonts are used to highlight particular tokens that belong to a list of pre-identified words (of, e.g., filled pause words). Only items on these pre-identified lists show up with colored font. Annotation (whether manual or automatic) shows up as color-coded underlining.
Simple Depods (one edit disfluency)SUs
Simply select 'Depod' and submit
Annotation tool will automatically identify the IP at right edge of the depod
Complex depods with multiple interruption points
Select 'Depod with multiple IP' and submit
A new window will open that allows you to identify the multiple IPs
Within this new window you'll see the text of the disfluency displayed. Click on the word that precedes the interruption point and select 'Add IP'. Repeat for each word that precedes an IP. The annotation tool will automatically identify the IP at the rightmost edge of the depod (after the final word).
SUs are annotated as a second pass over the data after edit and filler annotation is complete.
To begin SU annotation, select Edit > SU Annotation from pulldown menu
A new SU annotation window will open
To add an new SU boundary
To remove an existing SU boundary
- Select the final word of the SU in the main transcription window
- Select SU type to be added in SU annotation window
- Hit 'Insert' to save annotation
- The SU annotation will be displayed in the main transcription window. Annotations are displayed using a color-coded slash followed by a punctuation character following the final word of the SU
When you are finished with SU annotation, click 'Exit' to dismiss the SU annotation window.
- Select the final word of the SU in the main transcription window.
- Hit 'Delete' to remove the SU boundary
- The SU annotation will disappear from the main transcription window.
Options
For each annotation layer you add, you can also select one or more of the following options:
Difficult Decision
To be used for difficult annotation decisions
You can use the comments text box to describe difficulty
Problems
Questionable TranscriptionSaving Annotations
Identifies regions where transcript does not match audio
These regions should still be annotated for metadata
No RT/Metadata
Identifies regions that are too difficult to annotate at all (because of badly mismatched audio/transcript, overlapping speech or some other complicating factor)
To save the annotation file, select File > Save from the pullodown menu.
You also have the option of saving the file under a different name, using the File > Save As option.
Checking Annotations
These are all options under the Edit pulldown menu
Show rendered text
This option will open a window that displays a text rendering of the cleaned-up transcript. The window will display both speakers for telephone data. Within this window, depods and fillers are removed from the transcript and are not displayed. SU boundaries are displayed with a slash plus punctuation. Unannotated regions of text are not displayed.
Show selected annotations
This option will open a window that displays just the selected annotation layers for a particular channel. This is a useful option for second passing when you want to focus on one type of annotation. (Note: This window will display the tagged transcript for whichever channel your cursor is currently positioned in. Newer versions of the tool will display selected annotations for both speakers).
Show List of Annotations
This option allows you to view the underlying annotation layers for careful second passing and QC purposes. This option will open a window allowing you to select particular annotation records for viewing in a separate window. The righthand column lists the annotation types. You can choose one or multiple annotation types to display. The lefthand column allows you to select which features to display for each annotation record. The recommended feature selections are checked by default. The additional feature selections are details of the structured annotation record that won't be illuminating for quality checking.
After making your selection, hit 'Show'. Another window opens that displays each of the selected annotation types and their features. You can click on any of these annotation records and the tool will automatically select that annotation span within the main transcription window. (Note: This window will display the list of annotations for whichever channel your cursor is currently positioned in. Newer versions of the tool will display selected annotation layers for both speakers).
Validate Annotations
This option can be selected by choosing File > Validate Data from the pulldown menu. This option will check for common errors including:
A dialog box will open reporting any invalid annotations including words not assigned to an SU.
- Content of filled pauses
- Content of discourse markers
- Content of backchannels
- SU assignment