Annotation Guide
TDT 2003

Introduction Annotation Strategy Using the Interface TDT4 topiclist

Introduction and terminology

Events and Topics
The notions of event and topic are crucial to TDT annotation.  A TDT style event is defined as a specific thing that happens at a specific time and place along with all necessary preconditions and unavoidable consequences.  It is defined by first defining an initial seminal event , which is the starting point for an event, For instance, when an U.S. Marine jet sliced a funicular cable in Italy in February 1998, the cable car's crash to earth and the subsequent injuries were all unavoidable consequences and thus part of the same event.  For the purposes of TDT, a topic is defined as an event or activity, along with all directly related events and activities. It is important to understand the difference between a TDT topic and the notion of topic in normal discourse. While one might normally think of a topic as something broad like "accidents", a TDT topic is limited to a specific collection of related events of the type accident, in this case a particular cable car crash.

Rules of Interpretation
To increase the consistency of judgments about what constitutes "related" events, annotators refer to a set of rules of interpretation. These rules state, for each type of seminal event, what other types of events should be considered related. This then informs the annotators? judgments about which stories are "on-topic". In the example above, stories about the investigation, the Marine pilot, the repercussions for his unit, the victim's families and their quest for justice were all on topic.

In TDT-4, there are twelve topic types with corresponding rules of interpretation, as follows:

1. Elections, e.g. 30030: Taipei Mayoral Elections
Seminal events include: a specific political campaign, election day coverage, inauguration, voter turnouts, election results, protests, reaction.
Topic includes: the entire process, from announcements of a candidate's intention to run through the campaign, nominations, election process and through the inauguration and formation of a newly-elected official's cabinet or government.
2. Scandals/Hearings, e.g. 30038: Olympic Bribery Scandal
Seminal events include: media coverage of a particular scandal or hearing, evidence gathering, investigations, legal proceedings, hearings, public opinion coverage.
Topic includes: everything from the initial coverage of the scandal through the investigation and resolution.
3.  Legal/Criminal Cases, e.g. 30003: Pinochet Trial
Seminal events include: the crime itself, arrests, investigations, legal proceedings, verdicts and sentencing.
Topic includes: the entire process from the coverage of the initial crime through the entire investigation, trial and outcome.  Changes in laws/policies as a result of a crime are not generally on-topic unless a clear and direct connection between the specific crime and the legislation is made.
4.  Natural Disasters, e.g., 30002: Hurricane Mitch
Seminal events include: weather events (El Nino, tornadoes, hurricanes, floods, droughts), other natural events like volcanic eruptions, wildfires, famines and the like, rescue efforts, coverage of economic or human impact of the disaster.
Topic includes: the causal (weather/natural) activity including predictions thereof, the disaster itself, victims and other losses, evacuations and rescue/relief efforts.
5.  Accidents, e.g., 30014: Nigerian Gas Line Fire
Seminal events include: transportation disasters, building fires, explosions and the like.
Topic includes: causal activities and all their unavoidable consequences like death tolls, injuries, economic losses, investigations and any legal proceedings, victims' efforts for compensation.
6.  Acts of Violence or War, e.g., 30034: Indonesia/East Timor Conflict
Seminal events include: a specific act of violence or terrorism or series of directly related incidents (such as a strike and retaliation).
Topic includes: Direct causes and consequences of a particular act of violence such as preparations (including technological/weapons development), coverage of the particular action, casualties/loss of life, negotiations to resolve the conflict, direct consequences including retaliatory strikes.  This topic type is difficult to define across the board, and can easily become extremely broad and far-reaching.  As such, each topic of this type is treated individually and is defined in such a way as to sensibly limit its scope and make annotation manageable.
7.  Science and Discovery News, e.g., 31019: AIDS Vaccine Testing Begins
Seminal events include: announcement of a discovery or breakthrough, technological advances, awards or recognition of a scientific achievement.
Topic includes: Any aspect of the discovery, impact on everyday life, the researchers or scientists involved, descriptions of research and technology directly involved in the discovery.
8.  Financial News, e.g., 30033: Euro Introduced
Seminal events include: specific economic or financial announcements (like a specific merger or bankruptcy announcement); reactions to the event; direct impact on the economy or business world.  General economic trends or patterns without a clear seminal event are not appropriate as TDT topics. Topic includes: the specific event, its direct causes, impacts on finance, government interventions or investigations, public or business world reactions, media coverage and analysis of the event.
9.  New Laws, e.g., 30009: Anti-Doping Proposals
Seminal events include: announcement of new legislation or proposals, acceptance or denial of the legislation, reactions.
Topic includes: the entire process, from announcement of the proposal, lobbying or campaigning, voting surrounding the legislation, reactions from within the political world and from the public, challenges to the proposal, analysis and opinion pieces concerning the legislation.
10.  Sports News, e.g., 31016: ATP Tennis Tournament
Seminal events include: a particular sporting event or tournament, sports awards, coverage of a particular athlete's injury, retirement or the like.
Topic includes: training or preparations for a competition, the game itself, results.  For tournament and championship events like the World Series or Superbowl, only direct precedents are considered on topic.  Therefore, semi-finals and finals games leading up to the championship are on topic, but regular season play is not.
11.  Political and Diplomatic Meetings, e.g., 30018: Tony Blair Visits China
Seminal events include: preparations for the meeting, the meeting itself, outcomes, reactions.
Topic includes: the whole process from the preparations and travel, the meeting itself, media coverage and public reaction, any outcome including legislation or policies adopted as a direct outcome of the meeting.  Sources often report on one of a series of meetings between two officials or delegations; in these cases, only the current meeting part of the topic, although planning for a future meeting that is a direct outcome of the current meeting and is discussed as part of the current meeting will be considered on topic.
12.  Celebrity/Human Interest News, e.g., 31036: Joe DiMaggio Illness
Seminal events include: most often involves the death of a famous person or other significant life events like marriage, or a noteworthy tidbit about some regular person, like someone setting a world record or giving birth to septuplets.
Topic includes the specific event, causes (such as illness in the case of a celebrity's death) or consequences (such as a funeral or memorial service), public reaction or media coverage, editorials and opinion pieces, retrospectives or life histories that are a direct consequence of the seminal event.
13.  Miscellaneous News, e.g., 31024: South Africa to Buy $5 Billion in Weapons
Seminal events include all specific events or activities that do not fall into one of the above categories.
Topic includes the event itself, direct causes and unavoidable consequences thereof.
This particular conceptualization of topic is a critical component of TDT annotation, as it allowed annotators to potentially identify all the stories in the corpus that discussed some pre-defined topic. The topic definitions and rules of interpretation ensured that each annotator was working with the same understanding of the topic at hand and, at least in theory, that all annotators would identify the same stories as on-topic.

The format of the topic definition is fixed for each topic to enforce annotation consistency.  The topic title is a brief phrase that is easy to remember and immediately evokes the topic.  Each topic is accompanied by a topic icon, which provides the annotator with a visual reminder.  The seminal event that contributed the topic is described by answering the questions what, who, when and where with regard to the event. The topic explication provides further details.  Links to the relevant rule of interpretation, topic research, Chinese- and Arabic-specific topic information and links to sample on-topic stories are provided. 


The Annotation Strategy

TDT2003 involves 40 topics -- 10 chosen from English seed stories, 10 chosen from Chinese seed stories, and 20 chosen from Arabic seeds.  The corpus contains 4 months' worth of data (October 2000 through January 2001) from the following sources:

  English Mandarin Arabic
Newswire NYT 
APW
Zaobao 
Xinhua
An-Nahar 
Al-Hayat 
AFP - Agence France Presse
(Web) Radio PRI The World 
VOA English
VOA Mandarin 
CNR
VOA Arabic
(Web) Television CNN Headline News 
ABC World News Tonight  
MSNBC News With Brian Williams 
NBC Nightly News
CCTV 
CTS 
CBS - Taiwan
Nile TV

Your job is to search through the 4 months' worth of news to find stories that are related to the topic.  You will have several resources at your disposal:
Using these tools, you will search for stories in the corpus that discuss your topic.  When you find a likely story, you will read it and label it in the following way:
YES: this story discusses the topic in a substantial way.  Stories that you label as YES should give some information about the topic.  It doesn't have to be new information about the topic -- a story that summarizes a topic's history or gives a snippet of information that you've read about before still counts as a YES.  Even if the document contains a relatively small amount of information about a topic, it should be considered a YES.

NO: this story does not discuss the topic at all, or only mentions the topic in passing without giving any information about the topic.  If a document names a topic or makes reference to it but does not provide any information about the topic, it should be considered a NO.

The decision between YES and NO stories is usually clear, but some cases will be tricky.  If you're having trouble deciding between YES and NO, ask yourself whether you learned anything about the topic by reading the story, no matter how small and no matter if you've seen that same information before.  If you learn something about the topic by reading the story, then it should count as YES.  If you're still having trouble making up your mind, consult with your team leaders.  When in doubt, a story will usually fall on the side of YES.

NOT EASY: if you have real difficulty making a decision, you should indicate that by adding the additional label NOT EASY.  All stories must receive a YES or NO label, but you can also use the NOT EASY button to indicate that you struggled to make a decision for a particular story.  A NOT EASY judgment on a story will trigger additional quality control checking.


Your search for on-topic stories will be done in a number of stages, outlined below:
STAGE 1: Initial Query

-submit all known on-topic stories as a query
    [If there are no known on-topic stories in your language for the topic, then skip to Stage 3]
-search engine returns relevance-ranked list of stories
-read and annotate (yes/no (hard) ) ALL stories in relevance-ranked list until

              you have found 5 additional on-topic stories
OR UNTIL
              you have read at least 2 off-topic stories for every on-topic story found and the last 10 stories were off-topic

STAGE 2: Improved Query Based on Additional On-Topic Stories

-issue a new query using a concatenation of all known on-topic stories
-search engine returns relevance-ranked list of news stories (excluding those already annotated)
-read and annotate (yes/no) ALL stories in relevance-ranked list until you have read at least 2 off-topic stories for every on-topic story found and the last 10 stories were off-topic

STAGE 3: Text-based Query

-issue a new query using the TOPIC RESEARCH DOCUMENT PLUS ANY ADDITIONAL SEARCH TERMS APPROPRIATE (e.g., parts of the topic explication)
-search engine returns relevance-ranked list of news stories not already seen
-read and annotate (yes/no) ALL stories in relevance-ranked list until you have read at least 2 off-topic stories for every on-topic story found and the last 10 stories were off-topic

STAGE 4: Creative Searching

You are encouraged to use your specialized knowledge (drawn from topic research and the known on-topic stories) to conduct additional manual searches through the corpus.  These additional searches will be based on keywords, names, particular on-topic stories, etc.   Think creatively!  If you come up with a novel way to search for additional on-topic stories, let us know.

If you find additional information (names, places, dates, events) about your topic, you should revise the topic research page for that topic.   Then re-submit the topic research page as a text query to find additional on-topic stories.


Using the Annotation Interface

I. Getting Started

Start Netscape.

In an xterm, type label-tdt4 or tdt4-label.

This window will open, containing the list of files that have been assigned to you.

Click on one of the file names (topic numbers).

Click  to begin annotation.
 

Before you begin your search for additional on-topic stories, you must

Once you've done this, you're ready to begin annotation!

II. Step-by-step guide to annotation

Following the annotation strategy outlined above, follow these steps IN ORDER.

STAGE 1: Initial Query

To begin annotation, click on Annotate. The main annotation window will close and will be replaced by two other windows:

the judgement file will contain a list of stories for you to read and label as YES or NO
the article window will display the story
The initial file you see will be named judge.000.  This file is automatically generated by the search engine.  It is based on either the original seed story or a translated document containing information about the topic and seed story.

Subesquent searches (either document searches or text queries) will create additional files, named judge.001, judge.002, and so on.

Displaying and Labeling Stories

Click on a document ID to display the highlighted story.  Read through this list of stories, labeling each story YES or NO by clicking on the YES or NO buttons on the labeling window.  If necessary, click on the NOT EASY button to register a difficult decision.  Once you've entered a label for a story, the interface will automatically display the next story for you.

Continue reading your initial judge.000 file until you have found 5 additional on-topic stories OR UNTIL you have read at least 2 off-topic stories for every on-topic story found and the last 10 stories were off-topic (this is known as the off-topic threshold).

The interface will keep track of how many on-topic and off-topic stories you've located, and will let you know when you've reached the off-topic threshold by displaying a message at the bottom of the judgement file window.

Remember to use the NOT EASY option when you have trouble making a decision. NOT EASY must be used in conjunction with a YES or NO label.

Once you have reached the off-topic threshhold, you should move on to the next phase of searching.  The interface will automatically keep track of how many on-topic and off-topic documents you've found. In subsequent rounds of searching, the interface will not display documents you've already labeled as YES or NO.

NOTE: If you don't find any on-topic stories during Stage 1, then you will move directly to Stage 3, Text-based Searching.

STAGE 2: Find On-topic Stories based on Improved Query

This stage is designed to refine the initial search.  Assuming you've found on-topic documents during Stage 1, you will issue a new query to the search engine that uses the on-topic stories you've already found as the query.

You will issue a new query using the on-topic stories you've already found as the query terms.  The search engine returns a relevance-ranked list of stories not already seen.
You will read and annotate ALL stories in the relevance-ranked list until you've reached the off-topic threshhold (you have read at least 2 off-topic stories for every on-topic story found and the last 10 stories were off-topic.)  Again, the interface will notify you when you've reached the threshhold.

 

STAGE 3: Text Queries

The goal of stage 3 is to find stories that are on-topic but unlike those already seen.  This stage of searching uses text that you input as the query.  This text might consist of (parts of) the topic research document, topic explication/definition, parts of on-topic stories, keyword terms you've identified, information from sources outside of the TDT corpus, or anything else you think might be useful.

The 'create text query' button will open a language-specific text editor for you to work in.  For English you can simply copy and paste in sections of the topic research document to the search engine, and/or type in additional terms as necessary.  For Chinese and Arabic, you must type in the text you wish to use as a query.

Chinese Text Editing Help
Arabic Text Editing Help
 
NOTE: DO NOT open a previous text search by using 'VIEW', then proceed to edit the document and resubmit it as a new text query -- this creates serious problems for the interface and record keeping.  This is a known bug in the interface but until it's resolved, you must copy and paste information into a NEW text query rather than resubmitting an edited version of an old one

Once again, after you've submitted the text query to the search engine, it will return a relevance-ranked list of news stories not already seen.  You must read and annotate (YES/NO) ALL stories in relevance-ranked list until you've reached the off-topic threshhold.

STAGE 4: Creative Searching

You are encouraged to use your specialized knowledge (drawn from topic research and the known on-topic stories you've already seen) to conduct additional manual searches through the corpus.  These additional searches will be based on keywords, names, particular on-topic stories, etc.  By this point, you should be an expert on this topic.  Use your intuition, and use the knowledge you've already gained about the peculiarities of this topic.  Think creatively!  If you come up with a novel way to search for additional on-topic stories, let us know.

Strategies to think about may include:

If you find additional information (names, places, dates, events) about your topic, you should revise the topic research page for that topic, or make sure that that information is shared with the group.  Then re-submit the topic research page as a text query to find additional on-topic stories, or modify the translations that you have made.

This stage is finished whenever you are satisfied that you've found ALL of the on-topic stories in the collection.  If you have a sneaking suspicion there are additional on-topic stories out there but you're having trouble finding them, let us know.

COMPLETION:
When you are ready to move on to a new topic, simply hit DONE and change the status to DONE.  You can then move on to the next assigned file.  Notify your supervisor if you run out of topics.



strassel@ldc.upenn.edu
8/5/2003