(287) previous ~ index ~ next
To: TDT Distrib <tdt-distrib@ldc.upenn.edu>
From: Paul van Mulbregt <paulvm@dragonsys.com>
Subject: Anomalies in tokenization
Date: Thu, 28 Sep 2000 10:29:09 -0400
1. In the TDT2 file tdt2v3.1/tkn/19980401_1300_1319_APW_ENG.tkn, some
anomalies in tokenization have appeared.
This is a long story (7659 tokens), with a lot of words being OOV for our
word list, about 150. Of those 150, about 100 seem to be "double words",
words that belong apart but have been collapsed to form one.
<W recid=39> thecountry's
<W recid=50> havesuprised
<W recid=64> in12
<W recid=75> allowedunder
<W recid=84> bankincreased
<W recid=102> takesinto
<W recid=111> profitmargin.
<W recid=120> byincreasing
<W recid=164> Bank,HongKong
<W recid=184> exchange(forex)
<W recid=194> thecurrencies
<W recid=204> PoundSterling.
...
Looking at the context of these makes it clear that they really are joined
inappropriately.
E.g. "time in12 years", "Hock Hua Bank,HongKong Bank, Hong Leong Bank" etc.
This is a AP News wire file. Did it arrive with this level of dropped
spacing? Is this common?
2. I also see that the first line in this file
<W recid=1> MALAYSIAN
is all capitalized, a phenomenon that doesn't occur in the few other APW
files I checked. Are we at the mercy of the APW, for whatever they decide
to do?
-Paul
------------------------------------------------------------------
Paul van Mulbregt, Dragon Systems, a Lernout & Hauspie Company, Newton, MA.
(617) 965-5200
email: paulvm@dragonsys.com
(287) previous ~ index ~ next
Last updated Fri Oct 6 09:38:56 2000