TDT-Pilot | TDT2 | TDT3 | LDC HOME
.

Annotator Manual - TDT3

Introduction
Annotation Goals
 Segmentation
       Second pass
Topic Labeling
        Topic Definition
        Topic Explications
     Rules of interpretation
Topic Research
Information Aids
    Financial News
 
 



 
 

Introduction
    The Topic Detection and Tracking project refers to on-going  research establishing  and the enhancement  of state of  the art techniques in information retrieval and the detection and tracking of topics from a continuous stream of newswire or speech data.
   The  initial TDT pilot corpus  was created at UMASS in 1997.  The Linguistic Data Consortium  has continued the  development of the TDT corpora for TDT2 and TDT3 .  The procedures for the annotation of the corpora has remained largely consistent with only slight variations.

Annotation Goals

    There are two main task objectives regarding the annotation of TDT broadcast and newswire data.
The first task is the identification and classification of the contents of broadcast reports. In the segmentation task, the start points of story types within the broadcast are identified. Most of these story boundaries are fairly evident. There are some types of news reports, however, that require some thought regarding the content. This is particularly true for those reports that are political or financial in nature.

The second annotation task involves the 'labeling' of identifies news stories. From a list of selected events, the task is to identify all news stories that on on that specified event, as well as the direct consequences of the event.
 

Segmentation

  All timestamps at segment boundaries must be correct: when you listen to the portion of the recording that starts at the time stamp for a boundary, you should hear all of the text that follows that boundary without any clipping of the first word, and you should NOT hear any of the text that precedes the boundary. If the first word following the boundary is not fully audible, or if you hear text that comes before the boundary, the time stamp needs to be shifted.
 
Section Type  Meaning  Comment
sr Section Report  Valid News Story 
su Section Un/Under Transcribed Miscellaneous Text 
sn Section Non-News Non-news Story 

   Every segment must be categorized as either a news report, using "<sr", or as non-news, using "<sn". In the original form of the text (as produced from the closed-caption signal), all segment boundaries are marked with "<sx" -- all the "x"'s must be changed.
    Not all of these locations are correct - they merely serve as guidelines to the approximate location of the breaks. One CANNOT assume that the physical location of the mark is the true location of the boundary within the transcript, and conversely, if an <sx is not inserted, a boundary does not exist - listen carefully to the content in order to make the correct determination.  In the same light,  when you come to a topic boundary in the middle of a broadcast and find that it clearly should NOT be a boundary (that is, the material following the boundary is clearly a continuation of the story that precedes the boundary), you may delete this boundary.
    In the event that a report is encountered that has not been fully transcribed with enough content to understand the story, the segment boundary is marked with an "<su". (for "untranscribed or "under-transcribed segment") Please note -
   When you come to a topic boundary in the middle of a broadcast and find that text from a commercial has been included as the last portion of a previous "<sr" (news story) segment, trace back in the text and insert "<sn", with a correct time stamp, to mark where that previous story ends and the commercial begins.

News Story
   - section news report (story)
News items - No phrasal boundary limits are instituted regarding the limits of what is considered to be a valid report..  All forms of news stories,  (including those that are less than two declarative clauses), as well as list items are allowed into this category.  ("List items" include lists of sports scores, stock quotes, and weather reports for various regions.)
News stories do, however, have to contain enough information to identify what the event or subject matter is that the article discusses. Again, if the article is very limited in the transcription provided, it is classified as 'undertranscribed'.
 
 

Not News
   There are several types of information types that are not considered to be news for segmentation purposes. 'Filler' items consist of summaries of upcoming news items that will be repeated, and that will be discussed in greater detail later in the broadcast. This category is particularly evident at the start of a broadcast, during the introductions.  Other  issues that would not be  considered as  news items would be commercials,  station identifications,  and pledge drives.  If you are in doubt about the classification of a news type,  please ask a project manager.
 

Untranscribed News
      There are some instances where the transcript is not fully transcribed all the way through -
    Particularly in the case of closed-caption material, partially transcribed sections are very apparent -
    In most cases, the existance of a report boundary is fairly evident  - oftentimes, however, there is a degree of difficulty involved in deciding on a particular boundary when potentially related stories are adjacent to one another. In these situations,  the annotators are encouraged to use both audio cues (music, reporter pauses, and speaker changes) as well as the informational content to determine where the story boundaries exist.

NOTE - if the amount that has been transcribed is substantive enough to get some information from the report, the story is labeled as <sr, even though more text is missing.  Make sure that the closing tag indicates the end of the report proper, not just the transcribed area that is evident

Make extensive use of the story cues that are present in each of the broadcasts.
 

Segmentation - Second Pass

All files are to be second-passed by an individual other than the initial annotator.
All second passes must incorporate
   *A full review of the entire file,  with particular attention paid to the start and end regions of the story boundaries.
   *A complete re-examination of all existing stories, with particular attention given to stories that appear complicated or having the appearance of  ambiguously related subject matter.
During the second pass phase, do not only rely on visual inspection of the existing boundaries  - there may be errors in the transcripts, particularly closed captioned files.
 
 

Annotation

Annotation Instructions

Your task is to review news stories from a variety of sources, and to label (annotate) each one as relevant or not relevant to a set of pre-defined topics. There are a total of 11 news sources for TDT3 annotation:
 
  English Mandarin
Newswire NYT
APW
Zaobao
Xinhua
Radio PRI The World
VOA English
VOA Mandarin
Television CNN Headline News
ABC World News Tonight
MSNBC News With Brian Williams
NBC Nightly News
---

The timespan of the stories is October 1 - December 31, 1998.

In total there will be 60 topics relating to news events, listed in 3 topic lists of 20 topics each. These topics have been selected by the LDC, according to the selection criteria outlined here.

Most stories will match none of the topic definitions; some will match one; a very few will match more than one. Your task is to apply all relevant topics to each story. You will identify the story as NO (this is achieved by simply hitting the submit button in the interface), YES, BRIEF, or REJECT. Use the YES label for stories that discuss the topic in a substantial way. Use the BRIEF label for stories that make only a passing reference to the topic. To choose betweeen YES and BRIEF, use the following criteria:

Please REJECT a story if: Use Comments generously when rejecting files. Also use Comments to specify an "oddity" about an article. For instance, if the segmentation seems odd, but could be correct, label it or not as appropriate and comment on it.

The listing for each topic includes helpful, but incomplete, information about the derivative event. Dates or date ranges are included where they are relevant. A link to each topic's seed story is also available. Use this story as a reference only, an example of the language or vocabulary used in discussing this topic. You may take advantage of the headline information associated with each article, but cannot rely on that to be complete.  Topics may be mentioned in the middle of long articles.

There is more specific information related to topic definition below. If you have questions about how to label a specific story,  please address them at the weekly meetings, and post to the group mailing list.


Topics

"Topic" is defined in a special way specifically for TDT research. For the purposes of this project, topics refer to specific events or activities, such as the crash of a China Airlines airplane in Taipei, Taiwan on February 16, 1998, and encompass all facts, events and activities that are directly related to them. Here is the definition of topic and a few other essential terms, as used in TDT research:



Rules of Interpretation

Topics generally fall into a few general categories. As an aid to topic labeling consistency and to make the process efficient and accurate, please use the following guidelines in expanding from an event (the list provided in your labeling interface) to a topic.
1. Elections:
Examples - New people in office, new public officials, change in governments or parliaments (in other countries), voter scandals.
The event might be the confirmation of a new person into office, the activity around voting in a particular place and time, the opposing parties' or peoples' campaigns, or the election results. The topic would be the entire process, nominations, campaigns, elections, voting, ceremonies of inauguration.

2. Scandals/Hearings:
Examples - Monica Lewinsky, Kenneth Starr's investigations.
The event could be the investigation, independent counsels assigned to a new case, the discovery of a potential scandal, the subpoena of political figures. The topic would include all pieces of the scandal or the hearing including the allegations or the crime, the hearings, the negotiations with lawyers, the trial (if there is one), and even media coverage.

3. Legal /Criminal Cases:
Examples - crimes, arrests, cases.
The event might be the crime, the arrest, the sentencing, the arraignment, the search for a suspect. The topic is the whole package; crime, investigation, searches, victims, witnesses, trial, counsel, sentencing, punishment and other similarly related things.

4. Natural Disasters:
Examples - tornado, snow and ice storms, floods, droughts, mud-slide, volcanic eruptions.
The event would include causal activity (El Nino, in many cases this year) and direct consequences. The topic would also include; the declaration of a Federal Disaster Area, victims and losses, rebuilding, any predictions that were made, evacuation and relief efforts.

5. Accidents:
Examples - plane- car- train crash, bridge collapse, accidental shootings, boats sinking.
The event would be causal activities and unavoidable consequences like death tolls, injuries, loss of property. The topic includes mourners pursuit of legal action, investigations, issues with responsible parties (like drug and alcohol tests for drivers etc.)

6. Ongoing violence or war:
Examples - terrorism in Algeria, crisis in Iraq, the Israeli/Palestinian conflict.
In these cases the event might be a single act of violence, a series of attacks based on a single issue or a retaliatory act. The topic would expand to include all violence related to the same people, place, issue and time frame. These are the hardest to define, since war is often so complex and multi-layered. Consequences or causes often include (and would therefore be topic relevant) preparations for fighting, technology, weapons, negotiations, casualties, politics, underlying issues.

7. Science and Discovery News:
Examples - John Glenn being sent back into space, archaeological discoveries.
The event is the discovery or the decision or the breakthrough The topic, then, would include the technology developed to make this event happen, the researchers/scientists involved in the process, the impact on every day life, all history and research that was involved in the discovery.

8. Finances:
Examples - Asian economy, major corporate mergers.
The topic here could include information about job losses, impacts on businesses in other countries, IMF involvement and sometimes bail out, NYSE reactions (heavy trading BECAUSE Tokyo closed incredibly low). Again, anything that can be defined as a CAUSE of the event or a direct consequence of the event are topic-relevant.

9. New Laws :
Examples - Proposed Amendments, new legislation passed.
While the event may be the vote to pass a proposed amendment, or the proposal for new legislation, the topic includes the proposal, the lobbying or campaigning, the votes (either public voting or House or Senate voting etc.), consequences of the new legislation like protesting or court cases testing it's constitutionality.

10. Sports News :
Examples - Olympics, Super Bowl, Figure Skating Championships, Tournaments.
The event is probably a particular competition or game, and the topic includes the training for the game or competition, announcements of (medal) winners or losers, injuries during the game or competition, stories about athletes or teams involved and their preparations and stories about victory celebrations.

11. MISC. News :
Examples - Dr. Spock's Death, Madeleine Albright's trip to Canada, David Satcher's confirmation.
These events are not easily categorized but might trigger many stories about the event. In these cases, keep in mind that we are defining topic as the seminal event and all directly related events and activities. (include here causes and consequences) If the event is the death of someone, the causes (illness) and the consequences (memorial services) will all be on topic. A diplomatic trip topic would include plans made for the trip, results of the trip (a GREAT relationship with Canada??) would be on topic.
 

Topic
3012. 

Leonid Meteor Shower Chinese

Seminal Event
WHAT:  Earth passes through comet Tempel-Tuttle's trail of debris, creating a spectacular light show in Earth's atmosphere.
WHERE:  The meteor shower is visible in Europe and Asia.
WHEN:  Late October through November 1998;  the shower peaks on November 16-17.
Topic Explication
The 1998 Leonid Meteor Shower was particularly strong, and scientists from around the world gathered to watch the display.  Although the Shower was expected to peak over Eastern Asia, the best viewing actually occurred in Europe.  On topic:  Stories covering scientists' forecasts for the shower; reports on its observation; concerns over the possibility of the meteors damaging artificial satellites (which proved to be unfounded).
Rule of Interpretation    Rule 7: Science and Discovery News
>
 

Topic Research

    In order to keep abreast of situations and events which may be confusing, complex, or extremely similar to an identified seminal event, please use the topic research  provided for that topic.
Research for every topic is provided to aid you if you have trouble identifying if whether a topic will or will not be on topic for a particular event. The topic research provides situational and event timelines, that track and provide information on both the selected seminal event, and any relevant peripheral  information that would aid in the decision making process, particularly useful for complex political and financial events.
    Every annotator is required to gather, and record the available  research on a particular event, and periodically give updates to the group on any particular scenarios which may have evolved. It is the responsibility of the annotators to demonstrate above average familiarity with the events that they have been assigned.

    i) Reduction of the possibility of false alarms, particularly evident in the cases of similar, yet distinctly different events, for instance  natural  disasters, plane crashes and the like;
    ii) Provision of resources creation for new annotation staff who initially may have difficulties absorbing the new or important dimensions of a topic regarding situational relevance, particularly evident in the annotation of political and financial news articles.
    iii)     Provide a framework for use in the later adjudication of results, particularly to monitor topic development and curb possible topic migration attempts. In addition to this handbook, there will be weekly meetings - if you are unable to attend these, please review the minutes as well as the email archives.
 

Information Aids
Financial News

            Financial news is often difficult to segment, particularly if one lacks an understanding of how this information  fits together. This document seeks to outline the general concepts, key terms, etc. and then looks at each of the  major news sources to see how these programs commonly present such material. The hope is that while this doesn't provide a recipe as such, it does offer some insight into both the subject matter and the way in which these things are presented in various broadcasts and news sources. Most of the actual segmentation is still a
judgement call, but this should give an idea of what concepts are related and why.

Key Terms:

Stocks - Stocks are marketable securities representing a residual interest in a corporation. They are also known as 'equities.' Over 50% of adult Americans own some shares of stock, either as a direct investment or (more commonly) through mutual funds, pension funds, and individual retirement accounts (IRA's). (That fact alone is what makes financial news of immediate relevance to the audience).

Bonds - Bonds are debt and are issued for a period of more than one year. The U.S. government, local governments, water districts, companies and many other types of institutions sell bonds. When someone talks of the bond market, they mean the market for all bonds: federal, municipal, and corporate. Treasury Bonds and Treasury Bills are often quoted separately.

AMEX - American Stock Exchange. AMEX merged with the NASDAQ in November of 1998. "AMEX" is also the corporate trademark of "American Express," and occasionally the company's name comes up in financial news.

NYSE - New York Stock Exchange. This is the "big board" stock exchange on Wall Street. Occasionally, stories also mention trading volume as well as trading level for this exchange.

Dow Jones - The Dow Jones indeces, of which the Dow Jones Industrial Average is commonly called "Dow Jones," is a statistical fiction of the average of top stocks in each sector. The stocks that compose the Dow Jones Industrial Average are known as "blue chip" stocks.

NASDAQ - National Association of Security/Securities Dealers Automated Quotations. Not all securities are marketed on the NYSE or AMEX. The NASDAQ serves to market another entire tier of stocks. Usually, our news sources quote the "Dow Jones" and the NASDAQ. The NASDAQ is quoted as a composite index.

Pink slips - Stocks which are not listed on an exchange or the NASDAQ may be traded by the clearinghouses. For historical reasons, this is called 'trading pink slips.' Mention of this is relatively rare, and usually refers to how a stock was traded before listing on the NASDAQ.

ADR - American Depositary Receipt. These are receipts issued by American banks for foreign shares held by the bank or its branch in the country of issue. Until recently, this is how most foreign firms traded their stock in the U.S. markets. Many still do.

FOREX - Foreign Exchange Markets (General term). Often, sources will quote the foreign exchange rates for British pounds, Japanese yen, German Deutschmarks, and presumably soon the Euro. While stories which quote multiple rates should be segmented as one story, there may be extended explanation of a surprising rise/fall of a particular currency. That can be segmented separately if it seems prudent.

Gold & Silver - Gold and silver, and occasionally platinum, are quoted either immediately before or immediately after foreign exchange reports. These are collectively called 'precious metals.'

Commodities - These include things like lumber, pork bellies, barrels of oil, etc. Commodities markets consist of the buying and selling of futures at 'spot' (current market) prices.

Until recently, stock (and commodity) and bond markets were counter- cyclical. There may occasionally be mention of how these markets may or may not be moving the same direction.

S&P's Index: Standard & Poors Indeces. Most of these are stock indeces, though the S&P has many others. The S&P 500 is the index of the 500 top performing stocks.

Mutual funds - These are funds where investors buy shares of an entire portfolio. The fund managers select the actual investments. They have widely different rates of return.

News of relevance to financial markets:

After discussing the performance of national and global markets, most news reports (particularly MSNBC) also discuss noteworthy events, market trends, and announcements from Washington which may influence the market as a whole.

Employment statistics, housing sales, sales of durable goods, inflation, business failure rates, loan default rates, mortgage lending rates, etc. are all announced according to a monthly calendar.

Usually, the news sources comment only on unexpected announcements, or when expected ones don't jive with expectations. These divergences tend to have noticeable effects on the market. These statistics are what the Federal Reserve Board uses to set monetary policy, including interest rates. All changes of interest rates unless anticipated will have some noteworthy effect on the markets.

In addition to a calendar of governmental statistics, all firms announce their earnings at predetermined intervals. Pre-releases of estimated earnings mean that the market usually knows what to expect. When actual earnings are different than projected earnings, the effect on share price is usually dramatic. Thus, the news sources may mention such information.

Mergers & Acquisitions are always of interest to investors, and often account for dramatic changes in share price of the stocks of the companies involved and their principal competitors. These stories generally stand on their own, even if they are linked thematically with stories preceding or following them. Stock splits are also of significant interest to investors, and occur when a firm exchanges stock at a certain rate (2-1, 3-2, etc) in order to bring down share price so that 100 or 1000 lot shares are more accessible to individual investors.

Lawsuits and settlements thereof also affect share price of various stocks. Again, even while themetically related, these are generally best segmented as independent stories. FDA approvals and bans are of a similar nature.

Reports of greater length often develop around trends or surprising developments vis-a-vis one of the above.

ABC - Their financial news section is called "On the Money." Usually has a brief mention of the Dow Jones industrial average at the close of trading, as well as the change (up/down) for the NASDAQ composite index. These should be segmented together. It often follows it with news about the markets themselves, then with business news (mergers, etc) that influence the prices of various stocks. Each news item should be segmented separately.

CNN - Mentions markets in passing, except when broadcast corresponds with market closure/opening. Usually, their coverage (in a program called "Dollars & Sense") concentrates instead on corporate news, including acquisitions, profits, layoffs, expansions, etc. and their effects on the markets.

VOA - There are usually two to three incidences in each broadcast. The first is a strict news report of the financial news. While not all broadcasts are the same, VOA does touch on many of the areas described above. Stories in these sections should be segmented separately by theme.

The second incidence is normally a brief mention, summarizing market trends (about half-way through the broadcast), which the commentators often use as a point of departure for discussing economic news in general. The broadcasters prefer to speak in terms of 'share prices' instead of 'stock prices.' This discusses international markets as well as domestic ones, but the recap of market finishes should be segmented as one story. The trickiest to segment are those where the commentators offer multiple explanations for changes stock market trends. Most examples should be segmented as one story.

In other instances, foreign exchange reports are used as a segue to news of trends in global markets. If this is brief, it should be included as part of the story, especially if it is introducing a recurring feature like the NASDAQ sponsored market report.

Often the broadcaster goes on to speak about key financial events, with more emphasis on governmental announcements than that of individual companies.

The final incidence is normally at the end of the broadcast. This time, they usually make an effort to give all the information pertaining to the closing of each market, and briefly summarize both economic and corporate stories which influenced the day's trading. Usually, this summary should be segmented together.

PRI - Half way through the broadcast there is usually a brief mention of the Dow Jones and the NASDAQ. This should be segmented as one story. Often this is followed by stories explaining major shifts in these markets. Each of these stories should be segmented separately, and are ordinarily fairly lengthy. Since PRI is a combination of NPR and BBC, occasionally emphasis is on the London markets instead.

MSNBC - This news source concentrates on financial news more heavily than the others. In fact, its primary audience is individual investors and others who work in the financial industry. It does cover major news stories of consequence globally, with special attention on Capitol Hill, but the target viewers are consuming the financial news. The format of the program, however, is looser. Some days it has two or three major feature stories, with a number of interviews and even debate among various experts. In these broadcasts, which often run over, financial news is often truncated and re-visited the next hour. Otherwise, it is presented in enough detail to see clear demarcations of stories around core events.

The guidelines above should give some sense of how this information can be best grouped; however, because MSNBC does on-site reporting, it may often be hard to distinguish a teaser from a story immediately following it. In general, marking it as one story may be prudent.

NBC - Much like ABC, except that it doesn't have a dedicated financial news section because of the MSNBC broadcast. Usually mentions national news, and then notes effect on the markets or otherwise introduces a brief summary of market activity.