(177) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: Shudong Huang <shudong@unagi.cis.upenn.edu>
Subject: Evaluation of Chinese-English Word Lists (Text)
Date: Thu, 2 Sep 1999 16:37:08 -0400 (EDT)

Dear All,

I'm sending you an evaluation of Chinese-English word lists compiled at LDC
and unofficially distributed to you through our website. This is much
delayed information and I sincerely apologize for this.

Here's the plain text file.

Sincerely,

Shudong


===============================================================

Evaluation of LDC's Bilingual Dictionaries

This note evaluates the Chinese-English bilingual word lists collected by
LDC and unofficially distributed to TDT'ers for reference, which are
available at http://www.ldc.upenn.edu/Projects/Chinese. The goal is to
explore why there's such a big collection discrepancy between the
pronunciation dictionary that comes from hub4 and hub5 projects and the
dictionaries compiled from various resources (particularly the
Internet). Please send your comments and questions to
shudong@unagi.cis.upenn.edu.

I. Data Description

Three dictionaries are evaluated in this note. They are briefly described
in Table I.
Table I  Data Summary

Original File	Short Name	Total Number of
Name		Henceforth Used	Unique Entries	Description
============	=============== =============== ===========

ma_lex.v03	ma_lex.v03	43,968		Mandarin Pronunciation
						Lexicon (v.3), based on
						hand-segmented/corrected
						data of hub4 and hub5
------------	---------------	---------------	------------------------
ldc_ce_dict.	ce.v1		24,298		Chinese to English
1.0.gb						dictionary (v.1), based on
						CEDICT and LDC resources
------------	---------------	---------------	------------------------
ldc_ce_dict.	ce.v2		128,341		Chinese to English
2.0.txt						dictionary (v.2, based on
						ce.v2 and reversed English
						to Chinese dictionaries v.1
						and v.2) 

(Note: 1. For ma_lex.v03, words with different pronunciations are listed
separately. The total number of non-unique entries is 44404.
2. ldc_ce_dict.2.0.txt is not completely 'clean' in that non-unique
entries are 25 more.)

II. Comparison

Ma_lex.v03 and ce.v1 have a total number of 13,525 entries in
common. Although ce.v2 is significantly larger than ce.v2, the total number
of entries shared between this 'huge' bilingual word list and the
pronunciation dictionary only increases to 19, 409 - in other words, there
are still 24,559 items in ma_lex.v03 missing from ce.v2. That's over 50%.

Since ma_lex.v03 is larger than ce.v1, the first question that comes to
mind is if there's anything wrong with ma_lex.v03 such as too many of
those items not shared by ce.v1 are proper names, or phrases that should be
further segmented. Since ma_lex.v03 already has parts-of-speech (POS)
information, the easiest thing to do is simply compare the POS distribution
of missing words with the overall POS distribution of ma_lex.v03. The
results are in Table II. (POS information is not available for either ce.v1
or ce.v2.)
Table II Distribution of ma_lex.v03 words missing from ce.v2 in
         comparison with the overall distribution of ma_lex.v03
         both in descending order of POS 

Missing Words		All Words
====================	=====================
noun		13515	noun		23428
verb		4927	verb		10313
name		2058	adj		3365
adj		1354	name		2419
phrase		1014	phrase		1232
adv		580	for_name	1123
for_name	470	adv		1113
adj_r		99	number		197
verb_r		94	pro		149
pro		85	class		146
acronym		71	conj		114
conj		50	adj_r		112
onom		49	surname		107
surname_affix	47	verb_r		97
number		46	onom		95
name_affix	43	acronym		94
class		31	interj		64
number_class	21	surname_affix	49
name_seg	20	name_affix	44
interj		17	prep		42
adv_r		14	number_class	31
surname		12	name_seg	20
prep		11	part_final	16
part_final	6	adv_r		15
part		2	part_struc	7
class_r		1	affix		4
part_struc	1	part_asp	4
affix		0	part		3
part_asp	0	class_r		1


(Note: 1. For an explanation of terms used here, please consult
documentation that comes with the released pronunciation dictionary.
2. For words with more than one POS tag, the first one is used here
as it was considered to be the primary category when tags were added.)

A quick look at the above table suggests that the number of missing words
for a particular category is more or less proportional to the number of
words in ma_lex.v03 for that category. For instance, noun and verb top the
first and second in both lists.

There is, however, certain degree of disproportion. This is best seen from
Table III, which gives numbers showing how many percent of words in
ma_lex.v03 are missing for each POS category.
Table III	Missing Words over All Words in ma_lex.v03
		in Descending Order of POS 

POS		Missing	All	Missing/All (%)
============	=======	===	===============
class_r		1	1	100
name_seg	20	20	100
name_affix	43	44	98
verb_r		94	97	97
surname_affix	47	49	96
adv_r		14	15	93
adj_r		99	112	88
name		2058	2419	85	!
phrase		1014	1232	82	!
acronym		71	94	76
number_class	21	31	68
part		2	3	67
noun		13515	23428	58	!
pro		85	149	57
adv		580	1113	52
onom		49	95	52
verb		4927	10313	48
conj		50	114	44
for_name	470	1123	42
adj		1354	3365	40
part_final	6	16	38
interj		17	64	27
prep		11	42	26
number		46	197	23
class		31	146	21
part_struc	1	7	14
surname		12	107	11
affix		0	4	0
part_asp	0	4	0
Total		24638	44404	55

On the top of the list are some minor categories with a small number of
words - most of which are words of reduplication. According to the
segmentation guidelines for the pronunciation dictionary, certain types of
reduplicated words are not segmented.

Along with reduplicated words on the top are name-related words. This is
quite expected since as long as names are collected in any lexicon,
significant differences may exist among dictionaries depending on the
source data and collection criteria.

Note that foreign names score relatively low on the list (lower than the
overall percentage). This reflects a very important distinction between
Chinese and foreign (Western) names - there's no limit on the former, but
the latter are more or less of a closed set, despite regional differences
of transliteration.

Next to name-related words that shows a high degree of disproportion is
phrases; ma_lex.v03 contains many phrases not found in ce.v2 (as many as
82%). Obviously the two dictionaries have very different criteria for
collecting or defining phrases. Fortunately, the total number of phrases in
ma_lex.v03 is less than 3%.

The major POS category that has a higher missing percentage is (common)
noun (though it is not too higher than the overall percentage). This
category deserves some special attention as noun is the biggest POS
category and over 50% of missing words are nouns.


III. Missing Nouns - A Case Study

For reference, the medium sized Chinese-English Dictionary, published by
the Commercial Press of mainland China in 1980, is used. This has been
considered as a more or less classic and standard translation
dictionary. Although it is nearly two decades old and may lack many words
that reflect more recent social, political, scientific and technological
developments, it is still wildly used and serves a good reference. This
dictionary will be referred to as CED.

CED collects over 6,000 individual characters and over 50,000 multiple
character words. Compounds and lexicalized phrases (excluding 4-character
idioms and proverbs) are not entered as individual lexical items but are
listed after the head (or the first element).

A random sample of 100 missing nouns are examined in details. Among them,
47 appear in CED as individual words and 12 as compounds/lexicalized
phrases. If CED is taken as the standard, these 59 items should be included
in a medium-sized lexicon.

22 of the remaining 41 items may also be treated as words, even though they
are not in CED. The details are as follows:
* 2 have an -r suffix;
* 1 has a -zi suffix;
* 3 have an affix other than -r or -zi;
* 3 are somewhat technical;
* 2 are somewhat colloquial/dialectal;
* 11 are regular words - some represent relatively new concepts (such as
  evening university) but a couple have rather vague meanings unless
  contextualized.

The -r suffix is bit tricky though. In most cases, it does not change the
meaning of the stem to which it's attached. However, it has some
phonological/phonetic effects - complicated though often predictable - on
the pronunciation of the whole string. For this reason, words with an -r
suffix are included in LDC's pronunciation dictionary.

For the final 19 items, 11 of them have 4 or more characters and, by some
criteria on wordhood, should perhaps be more properly tagged as phrases
instead of simply nouns (unless of course the expression is an idiom). The
wordhood of the rest - 8 in total - is much harder to decide,
e.g. jin1kuang4, meaning 'gold ore' (as, say, against jin1bi4, meaning
'golden coin').

Therefore as far as nouns are concerned and based on the random sample,
about 80% of those missing from ce.v2 should be included in a medium-sized
lexicon; only 10-20% may be treated as phrases and hence excluded from the
lexicon if stricter segmentation principles are to be followed.

IV. The Other Direction

Now that there doesn't seem to be anything fundamentally wrong with the
pronunciation dictionary - at least for the category of common nouns, what
about those that are collected in the bilingual wordlists but missing from
the pronunciation dictionary? Since ce.v2 has too many 'non-words', only
ce.v1 is assessed.

A random sample of 100 items was collected. POS tagging was applied by hand
and all items were checked against CED. Here's the results:
* 21 foreign transliteration words (usually nominals), none of which is in
   CED; 
* 28 single characters - the total number of single characters collected in
  ce.v1 is 6129 (note: the criterion for single characters in ma_lex.v03 is
  that they are included only if they appear as words in the data); 
* 2 with a wrong character in them;
* 1 with a traditional character;
* 32 phrases, 27 of which are NOT in CED, 5 of which are. A few of them are
  new scientific expressions; 
* 16 'true' words, 9 of which are in CED, 7 are not. In addition, 7 out of
  the 16 words are formal, or new/scientific words. In terms of POS, only 3
  are verbs - all the rest are nouns. 

Based on the above numbers, we can infer that nearly 50% of ce.v1's missing
words are either foreign transliterations or single character
words. There's a small percentage (about 3%) of items that has something
wrong - such as use of incorrect characters. Only about 20% are real words
and lexicalized phrases by the CED standard. The rest are hard-to-decide
phrases.

V. Conclusive Remarks

It appears that the Internet CEDICT, upon which the first version of LDC's
Chinese-to-English wordlist is largely based, has a very different
conception about words and segmentation principles (if any) than LDC's
pronunciation dictionary.

First of all, in ce.v1, 6129 (out of 24298), or 25% are single
characters. By contrast, the pronunciation dictionary only has 3507 out of
44404 (or 3211 unique) single characters, less than 10%. Note that the
total number of Chinese characters in GB 2312-80 is 6763. So basically,
ce.v1 includes almost all GB-encoded characters. In fact, the original
CEDICT has 8833 double-byte characters, some of which cannot be displayed
under cxterm or mule on Unix.

Secondly, assuming that a initial capitalized letter for a glossary
indicates a proper name (true in most cases but not all), there are about
4,000 (out of 24298, i.e. 16.5%) such items. Most of these are
transliterations of foreign names.

These two types of entries alone account for about 40% of the collection in
ce.v1. By contrast, the same type of items in ma_lex.v03 only account for
about 17%.

There's some other interesting differences between ce.v1 and ma_lex.v03 as
compared to CED. First of all, it seems that some contributors of the
Internet CEDICT (upon which ce.v1 is largely based) started the project by
'copying' CED. This is clearly shown by the observation that many words
that appear early on the alphabetic order of CEDICT have almost identical
translations as those in CED. But that's not the case for words appearing
later on the list.

To further compare both ce.v1 and ma_lex.v03 with CED, both dictionaries
are sorted in the alphabetic order (with those that cannot be sorted by the
Unix command 'sort' truncated). The middle point of CED falls on Pinyin
"mian". For ce.v1, this point is "lai", but for ma_lex.v03, it's "man",
much closer to "mian".

Now compare the dictionaries with respect to their first quarter point. CED
falls on "gong", but ce.v1 falls on "dong" - separately by those "e-" and
"f-" words. This indicates that CEDICT's collection leans toward the
beginning of the alphabetic order. However, for ma_lex.v03, this point is
also on "gong". What a coincidence!

It follows that ce.v1 has a distribution problem compared to CED.

In summary, while LDC's pronunciation dictionary suffers from a minor
problem of treating some phrases as words (the lexicon's size cannot really
be considered a problem because all entries are based on real data), the
bilingual list (v.1) suffers a much more serious problem of coverage, both
in terms of POS and distribution (note: the original CEDICT has not been
expanded since last November). Although the second version's size is
significantly increased, the real coverage does not increase much and it
introduces a new problem by including too many lengthy phrases (and even
brackets!) in it.

(177) previous ~ index ~ next

Last updated Thu Sep 2 18:19:20 1999