(082) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: Mark Liberman <myl@unagi.cis.upenn.edu>
Subject: Mandarin segmenter, again
Date: Wed, 14 Apr 1999 14:23:05 EDT

TDT-ers,

Yesterday afternoon, Zhibiao Wu realized that setting up cgi-script
and database procedures to keep track of (individual and
institutional) licenses for Dragon's Mandarin segmentation code would
be roughly the same amount of work as writing a new segmenter, namely
about half a day. So he did the latter rather than the former.

The resulting 172-line perl script implements the same simple
algorithm as the Dragon software, which (I believe) was originally
designed by Paul Bamberg and implemented by Dean Bandes in 1995, as
part of setting up the scoring system for the Mandarin section of the
CallHome project. The algorithm is simply to find the
maximum-probability segmentation of the character string relative to
estimated word probabilities in the LDC CallHome lexicon; thus it is a
simple dynamic programming graph search problem. Zhibiao's version,
though independently implemented, ought therefore to produce identical
output in all cases to the Bamberg/Bandes code, and does so in the
couple of tests that we have run. It also seems to run about twice as
fast as the Dragon code, though I believe this is due to some
debugging/checking in the Bandes version that could in principle be
turned off.

We'll release Zhibiao's program under an open-source license. It
should appear on the http://www.ldc.upenn.edu/Projects/Mandarin page
this afternoon, if it is not already there. Like the Bamberg/Bandes
program, it requires a copy of the LDC 'CallHome' Mandarin lexicon,
whose license terms will not change.

I'd like to thank Paul and Dean again for their 1995 work, and
especially to thank Steve and various Dragon lawyers for recent hard
work to get the various licenses drafted and approved. I apologize to
Steve for not having saved him quite a bit of trouble by urging
Zhibiao to do this right after the DARPA workshop instead of
yesterday. My mistake was thinking that the earlier license structure
would work without much extra tweaking; I should have known that
simple lawyering is almost always more trouble than simple
programming.

-Mark Liberman



(082) previous ~ index ~ next

Last updated Thu May 13 09:28:23 1999