Frequently Asked Questions


1. How do I get GALE data?


There are two types of GALE data: corpora created specifically for the program, and previously-released corpora that have been designated as GALE-relevant. New and planned GALE releases (kickoff and quarterly) are described on the GALE data matrix.  Authorized GALE sites will automatically receive copies of these releases once their appointed data contact person has signed the GALE user agreement and provided contact information.  Previously-released corpora that have been designated as GALE-relevant are described on the GALE catalog query page.  Authorized GALE sites should follow instructions on that page for requesting corpora.

2. How will data be distributed?

GALE kickoff and quarterly releases will be distributed in one of two formats.  Text and annotations will be distributed via web download.  On the release date, each site's data contact person will receive an email containing a URL where each corpus can be downloaded. Audio will be released on media.  On the release date, LDC will prepare a shipment containing CDs, DVDs and/or hard drives, depending on the size of the corpus.  Packages will be shipped via DHL or FedEx.  All authorized GALE sites will receive text and annotation corpora. Given the high cost of creating and shipping hard drives, audio data will not be distributed to sites who are not participating in GALE evaluations.

3. How do I know what data will be released in each quarter?

The GALE data matrix reflects LDC's current plan for data to be distributed each quarter.  The matrix is updated on a regular basis to reflect changes to the plan.  As details of each release are finalized that information is also added to the matrix, and can be viewed by clicking on the name of the release under the Deliveries column.  (For instance, see details about the Kickoff1 Release).

4. My site has annotated some LDC data and I want to share it with other sites on my team.  How can I do that?

GALE researchers can redistribute annotations in stand-off form without involving LDC in any way.  We understand that UIMA annotations are stand-off in nature, that UIMA is the basis for integration of sites' technology into the common platform, and that integration is supposed to happen as the technology is developed; so this ought to mean there are no impediments to sharing annotations within teams.

However we do recognize that some legacy annotations may exist as inline rather than standoff form.  Redistribution of copyrighted data across sites is prohibited by GALE user agreements and by LDC's agreements with our data providers.

To make it easier for sites to share this kind of data among their team members we are creating a "transshipment point" at LDC.  We have set up a local LDC machine to act as an scp server.  When a GALE participant wants to distribute something to other sites on the team, (s)he can upload the file(s) and inform other participants, including responsible LDC parties, about what has been deposited and why.  The other parties then can download the deposit if they want to.

Each site will receive a user account per Resource Distribution Task Specification. These accounts will allow password-less sftp or winscp access. Each account will be a member of one of three user groups (corresponding to the three teams). Each file deposited by a site will inherit group permissions for that site. Sites will have read/write access to the data deposited by other members of their group. Sites will not have read or write access to data deposited/owned by groups other than their own.

We will make a general announcement to the GALE Data Announcement List when this service is up and running (ETA: early December 2005).