Before you download the corpora, please read the README file and the COPYRIGHT notice. The compressed file is about 16.5MB and you will get about 61MB files after decompression. Winzip 7.0 can automatically decompress the file on a Windows-based PC.

README

Thanks to The Government of the Hong Kong Special Administrative Region to make the bilingual laws of HKSAR available for public for research and educational purposes.

Linguistic Data Consortium retrieved all the laws of HKSAR from the website of the Department of Justice of HKSAR at http://www.justice.gov.hk/ during January 1999. The retrieved files are cleaned up and aligned on sentence level.

COPYRIGHT:
----------
The corpora is available for research and educational purposes only. Please read the copyright notice before you proceed.

ENCODING:
---------
The Chinese part of the corpora is encoded in Big5 code. However, The Hong Kong Special Administrative Region Government has defined more than 3,000 user-defined characters. It's not clear how many of them are actually used in the corpora. For more information, see http://www.info.gov.hk/gccs/ .

STRUCTURE OF THE DATA:
----------------------
The corpora is organized as a series of sentences, identified by sequential numeric indices, with corresponding files in English and Chinese containing the corresponding sets of sentences. The initial sentence index (in files hklaws.00.e and hklaws.00.c) is "1", and the last sentence index (in files hklaws.23.e and hklaws.23.c) is "238271". The sentence numbering establishes the parallelism -- two sentences having the same number are purported to be parallel in content.

Each file contains up to 10,000 sentences. The two digits in the file name represent the first two digits of the sentence indices contained in the file. (Leading zeros are applied for the first 10 file names, though these leading zeros are not used in the sentence indexing within the files).

Each line of each file begins with an SGML tag of the form:

<s id=#>

where "#" represents a one- to six-digit sentence index number. Following the closing angle bracket is a space character and then the sentence itself. Each sentence is terminated by a single line-feed character.

COPYRIGHT

LICENSE STATEMENT for the laws, press releases and news of the Hong Kong Special Administrative Region (HKSAR).

Copyright (C) 1999, The Government of the Hong Kong Special Administrative Region (HKSAR)

This license statement and copyright notice applies to the laws, press releases and news of the Hong Kong Special Administrative Region.

COPYING AND DISTRIBUTION

Permission is granted to the Linguistic Data Consortium to make and distribute copies of the laws, press releases and news of Hong Kong Special Administrative Region provided this copyright notice and permission notice are distributed with all copies.

USAGE

The permission has been given to reproduce and include the laws of Hong Kong and the press releases and/or news items on the Hong Kong Special Administrative Region Government website for research and educational purposes.

The permission is for the mentioned purposes only and prior permission must be sought from "The Government of the Hong Kong Special Administrative Region" if the materials are to be used for any other purposes.

The files, extracts from the files, and translations of the files must not be sold as part of any commercial software package, nor must they be incorporated in any printed document without the specific permission of the copyright holders.

COPYRIGHT

Copyright over the documents covered by this statement is held by The Government of the Hong Kong Special Administrative Region.

I agree to the copyright notice.