(157) previous ~ index ~ next
To: "TDT distribution list" <tdt-distrib@ldc.upenn.edu>
From: "George Doddington" <doddington@email.msn.com>
Subject: TDT Dev Test issues -- some comments
Date: Mon, 7 Sep 1998 22:30:53 -0700
* Regarding story boundaries: It sometimes happens that a "story" has no
words and thus that the boundary word ID's are null. This is especially
true for ASR transcriptions. These stories will, however, be included in
the test and in the training as they naturally occur.
* Regarding missing ASR files: This occurs in the Dev Set for 1 (out of
230)
CNN file and for 30 (out of 57) VOA files. When this happens, a manual
transcription file will be substituted for the missing ASR file. The reason
for doing this is to keep the results as comparable as possible by keeping
the story set the same.
* Regarding the latent miss rate in the annotated tag table: LDC is working
on reducing this miss rate. One potential way of reducing the miss rate is
to use the ensemble output of all the experimental systems as a candidate
list. I question whether this is practicable, but we should discuss this at
the
meeting at IBM.
* Regarding the processing of MISCELLANEOUS stories: MISC stories
are not used in training for the tracking task. These stories are included,
however, during test. (Story type is not permissible information to the
TDT system, so there is no choice but to process all stories.) In scoring,
however, performance will be computed with MISC stories excluded.
* Regarding the beginning word ID in test files: This word ID is where
testing is to begin. While it is typically also the beginning word in a
story, that is not necessarily the case.
--------
George Doddington in Orinda, CA, 925/631-6628, doddington@msn.com
(157) previous ~ index ~ next
Last updated Wed Sep 9 09:40:57 1998