LDC has created a list of Chinese characters paired with their Pinyin representations. Characters with more than one Pinyin representation are listed on separate lines. Total entries are 7809.
You can view the list on-line or download the zipped version.
LDC has prepared a pair of bilingual glossaries, Chinese-to-English and
English-to-Chinese, each of which consists of a list of words in the source language,
paired with a list of sets of words in the target language. They have been compiled from a
set of diverse resources, partly LDC-internal but mainly from the Internet. Some of the
materials from the Internet come with copyright restrictions, which we have retained in
the compilation. Please read the copyright
notice.
If you know of other resources that we can incorporate, please contact us.
Two versions are currently available. The second version of Chinese-to-English is recommended for reference only! Many of its added entries are not really words but phrases for Chinese because they came from reversing the English-to-Chinese wordlist.
Click on the following links to view the files on-line. (The Chinese characters are encoded in GB.)
You can also download all the files in a single compressed package by clicking here.
Zhibiao Wu compiled a Chinese segmenter - a 288-line Perl script - now available for download.
To run this script, you should also download a frequency dictionary into the same directory.
Tips: to download the above files instead of opening them in a browser, right click on the link and choose "Save Link As ..." in Netscape or "Save Target As ..." in Internet Explorer.
You may click here to download the whole compressed package including a readme file .
You may click here to download the newest version (1.2) which works with perl5 and has support for UTF8 encoded Chinese text.