Before you download the corpora, please read the README file and the COPYRIGHT notice. The compressed file is about 16.5MB and you will get about 61MB files after decompression. Winzip 7.0 can automatically decompress the file on a Windows-based PC.
Thanks to The Government of the Hong Kong Special Administrative Region to make the
bilingual laws of HKSAR available for public for research and educational purposes.
Linguistic Data Consortium retrieved all the laws of HKSAR from the website of the
Department of Justice of HKSAR at http://www.justice.gov.hk/ during January 1999. The
retrieved files are cleaned up and aligned on sentence level.
COPYRIGHT:
----------
The corpora is available for research and educational purposes only. Please read the
copyright notice before you proceed.
ENCODING:
---------
The Chinese part of the corpora is encoded in Big5 code. However, The Hong Kong Special
Administrative Region Government has defined more than 3,000 user-defined characters. It's
not clear how many of them are actually used in the corpora. For more information, see http://www.info.gov.hk/gccs/ .
STRUCTURE OF THE DATA:
----------------------
The corpora is organized as a series of sentences, identified by sequential numeric
indices, with corresponding files in English and Chinese containing the corresponding sets
of sentences. The initial sentence index (in files hklaws.00.e and hklaws.00.c) is
"1", and the last sentence index (in files hklaws.23.e and hklaws.23.c) is
"238271". The sentence numbering establishes the parallelism -- two sentences
having the same number are purported to be parallel in content.
Each file contains up to 10,000 sentences. The two digits in the file name represent the
first two digits of the sentence indices contained in the file. (Leading zeros are applied
for the first 10 file names, though these leading zeros are not used in the sentence
indexing within the files).
Each line of each file begins with an SGML tag of the form:
<s id=#>
where "#" represents a one- to six-digit sentence index number. Following the
closing angle bracket is a space character and then the sentence itself. Each sentence is
terminated by a single line-feed character.
LICENSE STATEMENT for the laws, press releases and news of the Hong Kong Special
Administrative Region (HKSAR).
Copyright (C) 1999, The Government of the Hong Kong Special Administrative Region (HKSAR)
This license statement and copyright notice applies to the laws, press releases and news
of the Hong Kong Special Administrative Region.
COPYING AND DISTRIBUTION
Permission is granted to the Linguistic Data Consortium to make and distribute copies of
the laws, press releases and news of Hong Kong Special Administrative Region provided this
copyright notice and permission notice are distributed with all copies.
USAGE
The permission has been given to reproduce and include the laws of Hong Kong and the press
releases and/or news items on the Hong Kong Special Administrative Region Government
website for research and educational purposes.
The permission is for the mentioned purposes only and prior permission must be sought from
"The Government of the Hong Kong Special Administrative Region" if the materials
are to be used for any other purposes.
The files, extracts from the files, and translations of the files must not be sold as part
of any commercial software package, nor must they be incorporated in any printed document
without the specific permission of the copyright holders.
COPYRIGHT
Copyright over the documents covered by this statement is held by The Government of the
Hong Kong Special Administrative Region.
I agree to the copyright notice.