(177) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: Shudong Huang <shudong@unagi.cis.upenn.edu>
Subject: Evaluation of Chinese-English Word Lists (Text)
Date: Thu, 2 Sep 1999 16:37:08 -0400 (EDT)
Dear All,
I'm sending you an evaluation of Chinese-English word lists compiled at LDC
and unofficially distributed to you through our website. This is much
delayed information and I sincerely apologize for this.
Here's the plain text file.
Sincerely,
Shudong
===============================================================
Evaluation of LDC's Bilingual Dictionaries
This note evaluates the Chinese-English bilingual word lists collected by
LDC and unofficially distributed to TDT'ers for reference, which are
available at http://www.ldc.upenn.edu/Projects/Chinese. The goal is to
explore why there's such a big collection discrepancy between the
pronunciation dictionary that comes from hub4 and hub5 projects and the
dictionaries compiled from various resources (particularly the
Internet). Please send your comments and questions to
shudong@unagi.cis.upenn.edu.
I. Data Description
Three dictionaries are evaluated in this note. They are briefly described
in Table I.
Table I Data Summary
Original File Short Name Total Number of
Name Henceforth Used Unique Entries Description
============ =============== =============== ===========
ma_lex.v03 ma_lex.v03 43,968 Mandarin Pronunciation
Lexicon (v.3), based on
hand-segmented/corrected
data of hub4 and hub5
------------ --------------- --------------- ------------------------
ldc_ce_dict. ce.v1 24,298 Chinese to English
1.0.gb dictionary (v.1), based on
CEDICT and LDC resources
------------ --------------- --------------- ------------------------
ldc_ce_dict. ce.v2 128,341 Chinese to English
2.0.txt dictionary (v.2, based on
ce.v2 and reversed English
to Chinese dictionaries v.1
and v.2)
(Note: 1. For ma_lex.v03, words with different pronunciations are listed
separately. The total number of non-unique entries is 44404.
2. ldc_ce_dict.2.0.txt is not completely 'clean' in that non-unique
entries are 25 more.)
II. Comparison
Ma_lex.v03 and ce.v1 have a total number of 13,525 entries in
common. Although ce.v2 is significantly larger than ce.v2, the total number
of entries shared between this 'huge' bilingual word list and the
pronunciation dictionary only increases to 19, 409 - in other words, there
are still 24,559 items in ma_lex.v03 missing from ce.v2. That's over 50%.
Since ma_lex.v03 is larger than ce.v1, the first question that comes to
mind is if there's anything wrong with ma_lex.v03 such as too many of
those items not shared by ce.v1 are proper names, or phrases that should be
further segmented. Since ma_lex.v03 already has parts-of-speech (POS)
information, the easiest thing to do is simply compare the POS distribution
of missing words with the overall POS distribution of ma_lex.v03. The
results are in Table II. (POS information is not available for either ce.v1
or ce.v2.)
Table II Distribution of ma_lex.v03 words missing from ce.v2 in
comparison with the overall distribution of ma_lex.v03
both in descending order of POS
Missing Words All Words
==================== =====================
noun 13515 noun 23428
verb 4927 verb 10313
name 2058 adj 3365
adj 1354 name 2419
phrase 1014 phrase 1232
adv 580 for_name 1123
for_name 470 adv 1113
adj_r 99 number 197
verb_r 94 pro 149
pro 85 class 146
acronym 71 conj 114
conj 50 adj_r 112
onom 49 surname 107
surname_affix 47 verb_r 97
number 46 onom 95
name_affix 43 acronym 94
class 31 interj 64
number_class 21 surname_affix 49
name_seg 20 name_affix 44
interj 17 prep 42
adv_r 14 number_class 31
surname 12 name_seg 20
prep 11 part_final 16
part_final 6 adv_r 15
part 2 part_struc 7
class_r 1 affix 4
part_struc 1 part_asp 4
affix 0 part 3
part_asp 0 class_r 1
(Note: 1. For an explanation of terms used here, please consult
documentation that comes with the released pronunciation dictionary.
2. For words with more than one POS tag, the first one is used here
as it was considered to be the primary category when tags were added.)
A quick look at the above table suggests that the number of missing words
for a particular category is more or less proportional to the number of
words in ma_lex.v03 for that category. For instance, noun and verb top the
first and second in both lists.
There is, however, certain degree of disproportion. This is best seen from
Table III, which gives numbers showing how many percent of words in
ma_lex.v03 are missing for each POS category.
Table III Missing Words over All Words in ma_lex.v03
in Descending Order of POS
POS Missing All Missing/All (%)
============ ======= === ===============
class_r 1 1 100
name_seg 20 20 100
name_affix 43 44 98
verb_r 94 97 97
surname_affix 47 49 96
adv_r 14 15 93
adj_r 99 112 88
name 2058 2419 85 !
phrase 1014 1232 82 !
acronym 71 94 76
number_class 21 31 68
part 2 3 67
noun 13515 23428 58 !
pro 85 149 57
adv 580 1113 52
onom 49 95 52
verb 4927 10313 48
conj 50 114 44
for_name 470 1123 42
adj 1354 3365 40
part_final 6 16 38
interj 17 64 27
prep 11 42 26
number 46 197 23
class 31 146 21
part_struc 1 7 14
surname 12 107 11
affix 0 4 0
part_asp 0 4 0
Total 24638 44404 55
On the top of the list are some minor categories with a small number of
words - most of which are words of reduplication. According to the
segmentation guidelines for the pronunciation dictionary, certain types of
reduplicated words are not segmented.
Along with reduplicated words on the top are name-related words. This is
quite expected since as long as names are collected in any lexicon,
significant differences may exist among dictionaries depending on the
source data and collection criteria.
Note that foreign names score relatively low on the list (lower than the
overall percentage). This reflects a very important distinction between
Chinese and foreign (Western) names - there's no limit on the former, but
the latter are more or less of a closed set, despite regional differences
of transliteration.
Next to name-related words that shows a high degree of disproportion is
phrases; ma_lex.v03 contains many phrases not found in ce.v2 (as many as
82%). Obviously the two dictionaries have very different criteria for
collecting or defining phrases. Fortunately, the total number of phrases in
ma_lex.v03 is less than 3%.
The major POS category that has a higher missing percentage is (common)
noun (though it is not too higher than the overall percentage). This
category deserves some special attention as noun is the biggest POS
category and over 50% of missing words are nouns.
III. Missing Nouns - A Case Study
For reference, the medium sized Chinese-English Dictionary, published by
the Commercial Press of mainland China in 1980, is used. This has been
considered as a more or less classic and standard translation
dictionary. Although it is nearly two decades old and may lack many words
that reflect more recent social, political, scientific and technological
developments, it is still wildly used and serves a good reference. This
dictionary will be referred to as CED.
CED collects over 6,000 individual characters and over 50,000 multiple
character words. Compounds and lexicalized phrases (excluding 4-character
idioms and proverbs) are not entered as individual lexical items but are
listed after the head (or the first element).
A random sample of 100 missing nouns are examined in details. Among them,
47 appear in CED as individual words and 12 as compounds/lexicalized
phrases. If CED is taken as the standard, these 59 items should be included
in a medium-sized lexicon.
22 of the remaining 41 items may also be treated as words, even though they
are not in CED. The details are as follows:
* 2 have an -r suffix;
* 1 has a -zi suffix;
* 3 have an affix other than -r or -zi;
* 3 are somewhat technical;
* 2 are somewhat colloquial/dialectal;
* 11 are regular words - some represent relatively new concepts (such as
evening university) but a couple have rather vague meanings unless
contextualized.
The -r suffix is bit tricky though. In most cases, it does not change the
meaning of the stem to which it's attached. However, it has some
phonological/phonetic effects - complicated though often predictable - on
the pronunciation of the whole string. For this reason, words with an -r
suffix are included in LDC's pronunciation dictionary.
For the final 19 items, 11 of them have 4 or more characters and, by some
criteria on wordhood, should perhaps be more properly tagged as phrases
instead of simply nouns (unless of course the expression is an idiom). The
wordhood of the rest - 8 in total - is much harder to decide,
e.g. jin1kuang4, meaning 'gold ore' (as, say, against jin1bi4, meaning
'golden coin').
Therefore as far as nouns are concerned and based on the random sample,
about 80% of those missing from ce.v2 should be included in a medium-sized
lexicon; only 10-20% may be treated as phrases and hence excluded from the
lexicon if stricter segmentation principles are to be followed.
IV. The Other Direction
Now that there doesn't seem to be anything fundamentally wrong with the
pronunciation dictionary - at least for the category of common nouns, what
about those that are collected in the bilingual wordlists but missing from
the pronunciation dictionary? Since ce.v2 has too many 'non-words', only
ce.v1 is assessed.
A random sample of 100 items was collected. POS tagging was applied by hand
and all items were checked against CED. Here's the results:
* 21 foreign transliteration words (usually nominals), none of which is in
CED;
* 28 single characters - the total number of single characters collected in
ce.v1 is 6129 (note: the criterion for single characters in ma_lex.v03 is
that they are included only if they appear as words in the data);
* 2 with a wrong character in them;
* 1 with a traditional character;
* 32 phrases, 27 of which are NOT in CED, 5 of which are. A few of them are
new scientific expressions;
* 16 'true' words, 9 of which are in CED, 7 are not. In addition, 7 out of
the 16 words are formal, or new/scientific words. In terms of POS, only 3
are verbs - all the rest are nouns.
Based on the above numbers, we can infer that nearly 50% of ce.v1's missing
words are either foreign transliterations or single character
words. There's a small percentage (about 3%) of items that has something
wrong - such as use of incorrect characters. Only about 20% are real words
and lexicalized phrases by the CED standard. The rest are hard-to-decide
phrases.
V. Conclusive Remarks
It appears that the Internet CEDICT, upon which the first version of LDC's
Chinese-to-English wordlist is largely based, has a very different
conception about words and segmentation principles (if any) than LDC's
pronunciation dictionary.
First of all, in ce.v1, 6129 (out of 24298), or 25% are single
characters. By contrast, the pronunciation dictionary only has 3507 out of
44404 (or 3211 unique) single characters, less than 10%. Note that the
total number of Chinese characters in GB 2312-80 is 6763. So basically,
ce.v1 includes almost all GB-encoded characters. In fact, the original
CEDICT has 8833 double-byte characters, some of which cannot be displayed
under cxterm or mule on Unix.
Secondly, assuming that a initial capitalized letter for a glossary
indicates a proper name (true in most cases but not all), there are about
4,000 (out of 24298, i.e. 16.5%) such items. Most of these are
transliterations of foreign names.
These two types of entries alone account for about 40% of the collection in
ce.v1. By contrast, the same type of items in ma_lex.v03 only account for
about 17%.
There's some other interesting differences between ce.v1 and ma_lex.v03 as
compared to CED. First of all, it seems that some contributors of the
Internet CEDICT (upon which ce.v1 is largely based) started the project by
'copying' CED. This is clearly shown by the observation that many words
that appear early on the alphabetic order of CEDICT have almost identical
translations as those in CED. But that's not the case for words appearing
later on the list.
To further compare both ce.v1 and ma_lex.v03 with CED, both dictionaries
are sorted in the alphabetic order (with those that cannot be sorted by the
Unix command 'sort' truncated). The middle point of CED falls on Pinyin
"mian". For ce.v1, this point is "lai", but for ma_lex.v03, it's "man",
much closer to "mian".
Now compare the dictionaries with respect to their first quarter point. CED
falls on "gong", but ce.v1 falls on "dong" - separately by those "e-" and
"f-" words. This indicates that CEDICT's collection leans toward the
beginning of the alphabetic order. However, for ma_lex.v03, this point is
also on "gong". What a coincidence!
It follows that ce.v1 has a distribution problem compared to CED.
In summary, while LDC's pronunciation dictionary suffers from a minor
problem of treating some phrases as words (the lexicon's size cannot really
be considered a problem because all entries are based on real data), the
bilingual list (v.1) suffers a much more serious problem of coverage, both
in terms of POS and distribution (note: the original CEDICT has not been
expanded since last November). Although the second version's size is
significantly increased, the real coverage does not increase much and it
introduces a new problem by including too many lengthy phrases (and even
brackets!) in it.
(177) previous ~ index ~ next
Last updated Thu Sep 2 18:19:20 1999