(209) previous ~ index ~ next

To: tdt-distrib@ldc.upenn.edu
From: Xiaoyi Ma <xma@unagi.cis.upenn.edu>
Subject: parallel news released by Hong Kong Government
Date: Wed, 13 Oct 1999 11:28:36 -0400 (EDT)

TDTers,

I've put the parallel Hong Kong News onto our members-only ftp site.

[for ftp instructions, contact Dave Graff ]

The compress tar file is 13,514,831 bytes.
Please untar the file and read the README file first.

Strictly speaking, the corpora is Cantonese - English parallel. Please note
that there ARE difference between Cantonese and Chinese.

The Cantonese part of the corpora is encoded in BIG5, which is widely used
in Taiwan and Hong Kong. For those of you who want to convert BIG5 to GB,
you can find converters at
ftp://ftp.ifcss.org/pub/software/unix/convert/
BeTTY-1.53, EHZ-2.0 and hc-30 can convert between Big5 and GB.

It's usually considered "safe" to convert Big5 to GB. However, the
"translation" has never been perfect and will never be perfect in either
direction, because neither the GB character set nor the BIG5 character set
is a subset of the other. BIG5 contains thousands of Han characters not
defined in GB, while GB contains a few Han characters and a number of
symbols not defined in BIG5 or its vendor-specific variants. Currently,
such "untranslatable" characters are converted to, for example, the
"white-square" symbols by the conversion software. Although in practice the
frequency of occurence of encountering untranslatable characters is low,
this situation is nevertheless not fully satisfactory.

Please feel free to contact me if you encounter any problem.


Best,
Xiaoyi
(209) previous ~ index ~ next

Last updated Tue Oct 19 10:10:09 1999