
| . |
Introduction
The Topic Detection and Tracking project
refers to on-going research establishing and the enhancement
of state of the art techniques in information retrieval and the detection
and tracking of topics from a continuous stream of newswire or speech data.
The initial TDT pilot corpus
was created at UMASS in 1997. The Linguistic Data Consortium
has continued the development of the TDT corpora for TDT2 and TDT3
. The procedures for the annotation of the corpora has remained largely
consistent with only slight variations.
The second annotation task involves the 'labeling' of identifies news
stories. From a list of selected events, the task is to identify all news
stories that on on that specified event, as well as the direct consequences
of the event.
All timestamps at segment boundaries must be correct: when you
listen to the portion of the recording that starts at the time stamp for
a boundary, you should hear all of the text that follows that boundary
without any clipping of the first word, and you should NOT hear any of
the text that precedes the boundary. If the first word following the boundary
is not fully audible, or if you hear text that comes before the boundary,
the time stamp needs to be shifted.
| Section Type | Meaning | Comment |
| sr | Section Report | Valid News Story |
| su | Section Un/Under Transcribed | Miscellaneous Text |
| sn | Section Non-News | Non-news Story |
Every segment must be categorized as either a news report,
using "<sr", or as non-news, using "<sn". In the original form of
the text (as produced from the closed-caption signal), all segment boundaries
are marked with "<sx" -- all the "x"'s must be changed.
Not all of these locations are correct - they merely
serve as guidelines to the approximate location of the breaks. One
CANNOT assume that the physical location of the mark is the true location
of the boundary within the transcript, and conversely, if an <sx is
not inserted, a boundary does not exist - listen carefully to the content
in order to make the correct determination. In the same light,
when you come to a topic boundary in the middle of a broadcast and find
that it clearly should NOT be a boundary (that is, the material following
the boundary is clearly a continuation of the story that precedes the boundary),
you may delete this boundary.
In the event that a report is encountered that has
not been fully transcribed with enough content to understand the story,
the segment boundary is marked with an "<su". (for "untranscribed or
"under-transcribed segment") Please note -
When you come to a topic boundary in the middle of a broadcast
and find that text from a commercial has been included as the last portion
of a previous "<sr" (news story) segment, trace back in the text and
insert "<sn", with a correct time stamp, to mark where that previous
story ends and the commercial begins.
News Story
- section news report (story)
News items - No phrasal boundary limits are instituted regarding the
limits of what is considered to be a valid report.. All forms of
news stories, (including those that are less than two declarative
clauses), as well as list items are allowed into this category. ("List
items" include lists of sports scores, stock quotes, and weather reports
for various regions.)
News stories do, however, have to contain enough information to identify
what the event or subject matter is that the article discusses. Again,
if the article is very limited in the transcription provided, it is classified
as 'undertranscribed'.
Not News
There are several types of information types that are
not considered to be news for segmentation purposes. 'Filler' items consist
of summaries of upcoming news items that will be repeated, and that will
be discussed in greater detail later in the broadcast. This category is
particularly evident at the start of a broadcast, during the introductions.
Other issues that would not be considered as news items
would be commercials, station identifications, and pledge drives.
If you are in doubt about the classification of a news type, please
ask a project manager.
Untranscribed News
There are some instances where the transcript
is not fully transcribed all the way through -
Particularly in the case of closed-caption material,
partially transcribed sections are very apparent -
In most cases, the existance of a report boundary
is fairly evident - oftentimes, however, there is a degree of difficulty
involved in deciding on a particular boundary when potentially related
stories are adjacent to one another. In these situations, the annotators
are encouraged to use both audio cues (music, reporter pauses, and speaker
changes) as well as the informational content to determine where the story
boundaries exist.
NOTE - if the amount that has been transcribed is substantive enough to get some information from the report, the story is labeled as <sr, even though more text is missing. Make sure that the closing tag indicates the end of the report proper, not just the transcribed area that is evident
Make extensive use of the story cues that are present in each of the
broadcasts.
All files are to be second-passed by an individual other than the initial
annotator.
All second passes must incorporate
*A full review of the entire file, with particular
attention paid to the start and end regions of the story boundaries.
*A complete re-examination of all existing stories, with
particular attention given to stories that appear complicated or having
the appearance of ambiguously related subject matter.
During the second pass phase, do not only rely on visual inspection
of the existing boundaries - there may be errors in the transcripts,
particularly closed captioned files.
| English | Mandarin | |
|---|---|---|
| Newswire | NYT
APW |
Zaobao
Xinhua |
| Radio | PRI The World
VOA English |
VOA Mandarin |
| Television | CNN Headline News
ABC World News Tonight MSNBC News With Brian Williams NBC Nightly News |
|
The timespan of the stories is October 1 - December 31, 1998.
In total there will be 60 topics relating to news events, listed in 3 topic lists of 20 topics each. These topics have been selected by the LDC, according to the selection criteria outlined here.
Most stories will match none of the topic definitions; some will match one; a very few will match more than one. Your task is to apply all relevant topics to each story. You will identify the story as NO (this is achieved by simply hitting the submit button in the interface), YES, BRIEF, or REJECT. Use the YES label for stories that discuss the topic in a substantial way. Use the BRIEF label for stories that make only a passing reference to the topic. To choose betweeen YES and BRIEF, use the following criteria:
The listing for each topic includes helpful, but incomplete, information about the derivative event. Dates or date ranges are included where they are relevant. A link to each topic's seed story is also available. Use this story as a reference only, an example of the language or vocabulary used in discussing this topic. You may take advantage of the headline information associated with each article, but cannot rely on that to be complete. Topics may be mentioned in the middle of long articles.
There is more specific information related to topic definition below. If you have questions about how to label a specific story, please address them at the weekly meetings, and post to the group mailing list.
2. Scandals/Hearings:
Examples - Monica Lewinsky, Kenneth Starr's investigations.
The event could be the investigation, independent counsels assigned
to a new case, the discovery of a potential scandal, the subpoena of political
figures. The topic would include all pieces of the scandal or the hearing
including the allegations or the crime, the hearings, the negotiations
with lawyers, the trial (if there is one), and even media coverage.
3. Legal /Criminal Cases:
Examples - crimes, arrests, cases.
The event might be the crime, the arrest, the sentencing, the arraignment,
the search for a suspect. The topic is the whole package; crime, investigation,
searches, victims, witnesses, trial, counsel, sentencing, punishment and
other similarly related things.
4. Natural Disasters:
Examples - tornado, snow and ice storms, floods, droughts, mud-slide,
volcanic eruptions.
The event would include causal activity (El Nino, in many cases this
year) and direct consequences. The topic would also include; the declaration
of a Federal Disaster Area, victims and losses, rebuilding, any predictions
that were made, evacuation and relief efforts.
5. Accidents:
Examples - plane- car- train crash, bridge collapse, accidental shootings,
boats sinking.
The event would be causal activities and unavoidable consequences like
death tolls, injuries, loss of property. The topic includes mourners pursuit
of legal action, investigations, issues with responsible parties (like
drug and alcohol tests for drivers etc.)
6. Ongoing violence or war:
Examples - terrorism in Algeria, crisis in Iraq, the Israeli/Palestinian
conflict.
In these cases the event might be a single act of violence, a series
of attacks based on a single issue or a retaliatory act. The topic would
expand to include all violence related to the same people, place, issue
and time frame. These are the hardest to define, since war is often so
complex and multi-layered. Consequences or causes often include (and would
therefore be topic relevant) preparations for fighting, technology, weapons,
negotiations, casualties, politics, underlying issues.
7. Science and Discovery News:
Examples - John Glenn being sent back into space, archaeological discoveries.
The event is the discovery or the decision or the breakthrough The
topic, then, would include the technology developed to make this event
happen, the researchers/scientists involved in the process, the impact
on every day life, all history and research that was involved in the discovery.
8. Finances:
Examples - Asian economy, major corporate mergers.
The topic here could include information about job losses, impacts
on businesses in other countries, IMF involvement and sometimes bail out,
NYSE reactions (heavy trading BECAUSE Tokyo closed incredibly low). Again,
anything that can be defined as a CAUSE of the event or a direct consequence
of the event are topic-relevant.
9. New Laws :
Examples - Proposed Amendments, new legislation passed.
While the event may be the vote to pass a proposed amendment, or the
proposal for new legislation, the topic includes the proposal, the lobbying
or campaigning, the votes (either public voting or House or Senate voting
etc.), consequences of the new legislation like protesting or court cases
testing it's constitutionality.
10. Sports News :
Examples - Olympics, Super Bowl, Figure Skating Championships, Tournaments.
The event is probably a particular competition or game, and the topic
includes the training for the game or competition, announcements of (medal)
winners or losers, injuries during the game or competition, stories about
athletes or teams involved and their preparations and stories about victory
celebrations.
11. MISC. News :
Examples - Dr. Spock's Death, Madeleine Albright's trip to Canada,
David Satcher's confirmation.
These events are not easily categorized but might trigger many stories
about the event. In these cases, keep in mind that we are defining topic
as the seminal event and all directly related events and activities. (include
here causes and consequences) If the event is the death of someone, the
causes (illness) and the consequences (memorial services) will all be on
topic. A diplomatic trip topic would include plans made for the trip, results
of the trip (a GREAT relationship with Canada??) would be on topic.
3012.
Leonid Meteor Shower
Seminal Event
|
In order to keep abreast of situations and events
which may be confusing, complex, or extremely similar to an identified
seminal event, please use the topic research provided for that topic.
Research for every topic is provided to aid you if you have trouble
identifying if whether a topic will or will not be on topic for a particular
event. The topic research provides situational and event timelines, that
track and provide information on both the selected seminal event, and any
relevant peripheral information that would aid in the decision making
process, particularly useful for complex political and financial events.
Every annotator is required to gather, and record
the available research on a particular event, and periodically give
updates to the group on any particular scenarios which may have evolved.
It is the responsibility of the annotators to demonstrate above average
familiarity with the events that they have been assigned.
i) Reduction of the possibility of false alarms,
particularly evident in the cases of similar, yet distinctly different
events, for instance natural disasters, plane crashes and the
like;
ii) Provision of resources creation for new annotation
staff who initially may have difficulties absorbing the new or important
dimensions of a topic regarding situational relevance, particularly evident
in the annotation of political and financial news articles.
iii) Provide a framework
for use in the later adjudication of results, particularly to monitor topic
development and curb possible topic migration attempts. In addition to
this handbook, there will be weekly meetings - if you are unable to attend
these, please review the minutes as well as the email archives.
Information
Aids
Financial
News
Financial
news is often difficult to segment, particularly if one lacks an understanding
of how this information fits together. This document seeks to outline
the general concepts, key terms, etc. and then looks at each of the
major news sources to see how these programs commonly present such material.
The hope is that while this doesn't provide a recipe as such, it does offer
some insight into both the subject matter and the way in which these things
are presented in various broadcasts and news sources. Most of the actual
segmentation is still a
judgement call, but this should give an idea of what concepts are related
and why.
Key Terms:
Stocks - Stocks are marketable securities representing a residual interest in a corporation. They are also known as 'equities.' Over 50% of adult Americans own some shares of stock, either as a direct investment or (more commonly) through mutual funds, pension funds, and individual retirement accounts (IRA's). (That fact alone is what makes financial news of immediate relevance to the audience).
Bonds - Bonds are debt and are issued for a period of more than one year. The U.S. government, local governments, water districts, companies and many other types of institutions sell bonds. When someone talks of the bond market, they mean the market for all bonds: federal, municipal, and corporate. Treasury Bonds and Treasury Bills are often quoted separately.
AMEX - American Stock Exchange. AMEX merged with the NASDAQ in November of 1998. "AMEX" is also the corporate trademark of "American Express," and occasionally the company's name comes up in financial news.
NYSE - New York Stock Exchange. This is the "big board" stock exchange on Wall Street. Occasionally, stories also mention trading volume as well as trading level for this exchange.
Dow Jones - The Dow Jones indeces, of which the Dow Jones Industrial Average is commonly called "Dow Jones," is a statistical fiction of the average of top stocks in each sector. The stocks that compose the Dow Jones Industrial Average are known as "blue chip" stocks.
NASDAQ - National Association of Security/Securities Dealers Automated Quotations. Not all securities are marketed on the NYSE or AMEX. The NASDAQ serves to market another entire tier of stocks. Usually, our news sources quote the "Dow Jones" and the NASDAQ. The NASDAQ is quoted as a composite index.
Pink slips - Stocks which are not listed on an exchange or the NASDAQ may be traded by the clearinghouses. For historical reasons, this is called 'trading pink slips.' Mention of this is relatively rare, and usually refers to how a stock was traded before listing on the NASDAQ.
ADR - American Depositary Receipt. These are receipts issued by American banks for foreign shares held by the bank or its branch in the country of issue. Until recently, this is how most foreign firms traded their stock in the U.S. markets. Many still do.
FOREX - Foreign Exchange Markets (General term). Often, sources will quote the foreign exchange rates for British pounds, Japanese yen, German Deutschmarks, and presumably soon the Euro. While stories which quote multiple rates should be segmented as one story, there may be extended explanation of a surprising rise/fall of a particular currency. That can be segmented separately if it seems prudent.
Gold & Silver - Gold and silver, and occasionally platinum, are quoted either immediately before or immediately after foreign exchange reports. These are collectively called 'precious metals.'
Commodities - These include things like lumber, pork bellies, barrels of oil, etc. Commodities markets consist of the buying and selling of futures at 'spot' (current market) prices.
Until recently, stock (and commodity) and bond markets were counter- cyclical. There may occasionally be mention of how these markets may or may not be moving the same direction.
S&P's Index: Standard & Poors Indeces. Most of these are stock indeces, though the S&P has many others. The S&P 500 is the index of the 500 top performing stocks.
Mutual funds - These are funds where investors buy shares of an entire portfolio. The fund managers select the actual investments. They have widely different rates of return.
News of relevance to financial markets:
After discussing the performance of national and global markets, most news reports (particularly MSNBC) also discuss noteworthy events, market trends, and announcements from Washington which may influence the market as a whole.
Employment statistics, housing sales, sales of durable goods, inflation, business failure rates, loan default rates, mortgage lending rates, etc. are all announced according to a monthly calendar.
Usually, the news sources comment only on unexpected announcements, or when expected ones don't jive with expectations. These divergences tend to have noticeable effects on the market. These statistics are what the Federal Reserve Board uses to set monetary policy, including interest rates. All changes of interest rates unless anticipated will have some noteworthy effect on the markets.
In addition to a calendar of governmental statistics, all firms announce their earnings at predetermined intervals. Pre-releases of estimated earnings mean that the market usually knows what to expect. When actual earnings are different than projected earnings, the effect on share price is usually dramatic. Thus, the news sources may mention such information.
Mergers & Acquisitions are always of interest to investors, and often account for dramatic changes in share price of the stocks of the companies involved and their principal competitors. These stories generally stand on their own, even if they are linked thematically with stories preceding or following them. Stock splits are also of significant interest to investors, and occur when a firm exchanges stock at a certain rate (2-1, 3-2, etc) in order to bring down share price so that 100 or 1000 lot shares are more accessible to individual investors.
Lawsuits and settlements thereof also affect share price of various stocks. Again, even while themetically related, these are generally best segmented as independent stories. FDA approvals and bans are of a similar nature.
Reports of greater length often develop around trends or surprising developments vis-a-vis one of the above.
ABC - Their financial news section is called "On the Money." Usually has a brief mention of the Dow Jones industrial average at the close of trading, as well as the change (up/down) for the NASDAQ composite index. These should be segmented together. It often follows it with news about the markets themselves, then with business news (mergers, etc) that influence the prices of various stocks. Each news item should be segmented separately.
CNN - Mentions markets in passing, except when broadcast corresponds with market closure/opening. Usually, their coverage (in a program called "Dollars & Sense") concentrates instead on corporate news, including acquisitions, profits, layoffs, expansions, etc. and their effects on the markets.
VOA - There are usually two to three incidences in each broadcast. The first is a strict news report of the financial news. While not all broadcasts are the same, VOA does touch on many of the areas described above. Stories in these sections should be segmented separately by theme.
The second incidence is normally a brief mention, summarizing market trends (about half-way through the broadcast), which the commentators often use as a point of departure for discussing economic news in general. The broadcasters prefer to speak in terms of 'share prices' instead of 'stock prices.' This discusses international markets as well as domestic ones, but the recap of market finishes should be segmented as one story. The trickiest to segment are those where the commentators offer multiple explanations for changes stock market trends. Most examples should be segmented as one story.
In other instances, foreign exchange reports are used as a segue to news of trends in global markets. If this is brief, it should be included as part of the story, especially if it is introducing a recurring feature like the NASDAQ sponsored market report.
Often the broadcaster goes on to speak about key financial events, with more emphasis on governmental announcements than that of individual companies.
The final incidence is normally at the end of the broadcast. This time, they usually make an effort to give all the information pertaining to the closing of each market, and briefly summarize both economic and corporate stories which influenced the day's trading. Usually, this summary should be segmented together.
PRI - Half way through the broadcast there is usually a brief mention of the Dow Jones and the NASDAQ. This should be segmented as one story. Often this is followed by stories explaining major shifts in these markets. Each of these stories should be segmented separately, and are ordinarily fairly lengthy. Since PRI is a combination of NPR and BBC, occasionally emphasis is on the London markets instead.
MSNBC - This news source concentrates on financial news more heavily than the others. In fact, its primary audience is individual investors and others who work in the financial industry. It does cover major news stories of consequence globally, with special attention on Capitol Hill, but the target viewers are consuming the financial news. The format of the program, however, is looser. Some days it has two or three major feature stories, with a number of interviews and even debate among various experts. In these broadcasts, which often run over, financial news is often truncated and re-visited the next hour. Otherwise, it is presented in enough detail to see clear demarcations of stories around core events.
The guidelines above should give some sense of how this information can be best grouped; however, because MSNBC does on-site reporting, it may often be hard to distinguish a teaser from a story immediately following it. In general, marking it as one story may be prudent.
NBC - Much like ABC, except that it doesn't have a dedicated financial news section because of the MSNBC broadcast. Usually mentions national news, and then notes effect on the markets or otherwise introduces a brief summary of market activity.