(163) previous ~ index ~ next

To: tdt-distrib@ldc.upenn.edu
From: Ralf Brown <Ralf_Brown@v.gp.cs.cmu.edu>
Subject: more missed labels in devset
Date: Wed, 09 Sep 1998 08:39:37 -0400

I ran the entire devset through DTree as both training and test documents,
and checked the resulting 58 "false alarms" on the 7 events with at least
24 YES stories. Ten more errors:

Event 39:
APW19980301.0188 should be YES

Event 41:
CNN19980421.1130.0900 and CNN19980422.1130.0946 should be YES
CNN19980422.1130.0000 should be BRIEF

Event 44:
NYT19980304.0423 should be at least BRIEF
CNN19980307.1300.0342 should be YES (assuming the Minnesota case is
		on-topic, as it	seems to be from the given examples)


Event 47:
CNN19980323.1600.1057 should be at least BRIEF
CNN19980404.1000.0109 should be YES
CNN19980421.1600.0350 should be at least BRIEF

Event 65:
CNN19980419.1130.0948 should be BRIEF

Note that some of these errors affect the split points because they are YESes
in the "training" portion of the March/April data....

So far, assuming that all of my own judgements are confirmed, we have miss
rates (YES+BRIEF) of
Event 39: 3 of 54
Event 41: 3 of 27
Event 42: 6 of 29
Event 43: 1 of 15
Event 44: 21 of 174, or 26 of 179 (if teen smoking is on-topic)
Event 47: 3 of 29
Event 48: 24 of 159
Event 56: 10 of 52
Event 57: 1 of 13
Event 63: 2 of 17
Event 64: 1 of 12
Event 65: 2 of 46

Thus we have labeling miss rates between about five and 20 percent....

Ralf
(163) previous ~ index ~ next

Last updated Wed Sep 9 09:40:57 1998