Simple Named Entity Guidelines - Revised
Developed by Stephanie Strassel - Linguistic Data Consortium
For the TIDES Surprise Language Exercise - June 2003
(Based largely on the MUC-7 NE Guidelines)
An entity is some object in the world -- for instance, a place
or a person. A named entity is a phrase that uniquely
refers to that object by its proper name, acronym, nickname or
abbreviation. Some examples of named
entities follow:
Coca-Cola
Bottling Co.
Bob Austin
the Eiffel
Tower
IBM
the Yankees
Uganda
Bowdon,
Georgia
Mt. Fuji
the Kremlin
the
Kennedys
The annotation tool displays news stories, one document at a time. Read through the news story. When you encounter a named entity, highlight the text and assign it an entity type. Only material between <TEXTSTREAM> and </TEXTSTREAM> tags should be annotated.
We will identify three types of named entities:
PERSON: Person entities are limited to humans identified by name, nickname or alias.
ORGANIZATION: Organization entities are limited to corporations, institutions, government agencies and other groups of people defined by an established organizational structure.
LOCATION: Location entities include names of politically or geographically defined places (cities, provinces, countries, international regions, bodies of water, mountains, etc.). Locations also include man-made structures like airports, highways, streets, factories and monuments.
Within this document, named entities are indicated by underlining. Red text
refers to Persons; green text refers to
Organizations; and blue text refers to
Locations.
Other types of entities like animals,
inanimate objects, monetary units, times and dates (including holiday names) will
not be annotated.
People may be specified by name, nickname or alias. Family names should also be tagged as PERSON. Names of deceased people, as well as fictional human characters appearing in movies, television, books and so on, should be tagged as PERSON entities.
Titles such as "Mr." and role names such as "President" are not considered part of a person name and are not marked. For instance, in
Vice President Cheney visited the site.
only the name Cheney is marked. There is no markup for the title Vice President.
If a title contains within it the name of a taggable entity, tag that entity. For instance,
Microsoft Chairman Bill Gates stated that...
Some more examples:
GlobalCorp Vice President John Smith
Treasury Secretary Jackson
the U.S. Vice President
UN Secretary General Kofi Annan
Justice Minister Giovanni Maria Flick
Portuguese Navy Commdr. Miguel Oliveira
U.S. Ambassador Nicholas Burns
Mission Control Chief Vladimir Solovyov
Russian President Boris Yeltsin
Independent Counsel Kenneth Starr
You may occasionally encounter an appositive like "Jr.", "Sr.", and "III". These are considered part of a person name and should be marked as part of the name, for instance:
Mr. Albert Franklin, Jr. was part of the research team.
Tag all proper name mentions of groups with a defined organizational structure. These include
Businesses Bridgestone Sports Co. profits
Stock exchanges NASDAQ shares
Multinational organizations European Union representatives
Political parties GOP hopeful
Unions Machinists union
Non-generic government entities the State Department issued a warning
Sports teams the Phillies
Military groups the Tamil Tigers
Proper names that refer to facilities/buildings that are primarily defined by their established organizational structure, and can do things like issue statements, make decisions, hire people, raise money and so on, should be classified as an ORGANIZATION. These include:
Churches Trinity Lutheran Church
Hospitals Finger Lakes Area Hospital Corp.
Hotels Four Seasons Hotel Group
Museums the Guggenheim Museum
Universities the University of Chicago
Government offices the White House
General entity mentions such as "the police" and "the government," should not be tagged, since these are not unique proper name references to specific entities.
When a place name (e.g., a city, state, country) is used to refer to an organization like a sports team or a branch of the government, the name should still be tagged as LOCATION. See Section 3.4.1 for a more complete discussion.
Also see Section 3.4.2 for discussion of how to handle facilities that have both organization and location aspects
Examples of place-related strings that are tagged as LOCATION include named heavenly bodies, continents, countries, provinces, counties, cities, regions, districts, towns, villages, neighborhoods, airports, highways, street names, factories, manufacturing plants, street addresses, oceans, seas, straits, bays, channels, sounds, rivers, islands, lakes, national parks, mountains, fictional or mythical locations, and monumental structures, such as the Eiffel Tower and Washington Monument, that were built primarily as monuments. For instance:
the collapse of Idaho's newly-constructed Teton Dam
the dispute over votes in Dade County
The Walt Whitman Bridge remained closed
repairs began on a 10-mile stretch of the Alaskan Pipeline
The Garden State is known for its tomatoes.
There are several issues surrounding the expression of location names and which parts of a string to tag.
Compound expressions in which place names are separated by a comma in English should be tagged as separate instances of LOCATION.
Kaohsiung in Taiwan
Philadelphia in Pennsylvania
When a "designator" is customarily used as a regular part of a place name, that word should also be included in the extent of the LOCATION entity. For example, include in the tagged string the word "River" in the name of a river, "Mountain" in the name of a mountain, "City" in the name of a city, etc., if such words are contained in the string.
Mississippi River
Often times place names are modified by words like "Southern", "lower", "West", "the former" and so on.
When these modifiers are part of a
location's official name they should be tagged as part of the LOCATION name. For instance:
Upper Volta
North Dakota
Even if the place name does not have "official"
status but has an agreed-upon definition and is in very frequent use, the
string should be tagged as a LOCATION, as in:
the Middle East
the West Bank
Eastern Europe
When these modifiers are not the official name of a place, or when the definition of the place might vary from person to person, do not tag the modifier as part of the LOCATION entity name.
Mississippi River west bank
former Soviet Union
Gaul (present-day France)
lower Manhattan
Northern California
These place names can sometimes be tricky. If you are not sure whether a modifier is part of an official name, you should include the modifier as part of the place name.
There are some situations where deciding what entity type to assign can be somewhat tricky.
Very often in the news, city, country and other place names are used to refer to organizations rather than the geographical places themselves. For instance:
Washington is rainy in the spring.
vs. Washington announced a new tax policy today.
Germany has some beautiful mountains.
vs. Germany invaded Poland in 1939.
Baltimore has a great aquarium.
vs. Baltimore defeated the Yankees by a score of 4 to 3.
The White House was built in 1792.
vs. The White House
released the President's latest proposal.
When the name of a place is used to refer to a government, the name should still be labeled as a LOCATION, as in:
Washington came out with a new tax policy today.
Similarly, the name of a unique structure or building (which we label as a location) can often be used to refer to the government or other organization housed in that facility. In such cases, the name should still be labeled as a LOCATION.
The Pentagon issued a statement about the incident.
The same rule applies to place names referring to sports teams:
Philadelphia lost to Los Angeles last night.
If a facility is primarily defined by its established organizational structure, and can do things like issue statements, make decisions, hire people, raise money and so on, then the name should be classified as an ORGANIZATION. Churches, hospitals, hotels, museums, universities, restaurants and government offices fall into this category. (See Section 3.2 for examples.)
If a facility is primarily defined by its physical structures rather than the people that run it, it should be classified as a LOCATION. This includes transportation infrastructure and large storage structures like airports, streets, highways, airports, ports, train stations, bridges, tunnels, parking garages, airplane hangers, factories and manufacturing plants. Monumental structures, such as the Eiffel Tower and Washington Monument, that were built primarily as monuments, should also be tagged as LOCATION. (See Section 3.3 for examples.)
When a phrase refers to multiple named entities, mark each entity separately.
For instance,
China and South Korea signed the agreement.
contains two entities:
China
South Korea
Similarly,
Jimmy and Rosalyn Carter
North and South America
But be careful not to split apart proper names that contain a conjunction. For instance,
the Fish and Wildlife Service
is the name of one organization and should be tagged as a single named entity (it's not the Fish Service and the Wildlife Service as separate names).
No nested expressions will be marked. When the name of one entity contains within it another entity name, do not pull out the name of the other entity and mark it separately. Only tag the larger entity. For instance, the phrase:
the U.S. Customs Service
contains a single entity. The name "U.S." should not be pulled out of larger phrase and marked as a separate entity. Similarly,
Arthur Anderson Consulting no markup for Arthur Anderson alone
Boston Chicken Corp. no markup for Boston alone
If an entity name modifies another word (even if that word is not a taggable entity type), you should still tag the entity name.
Bridgestone profits
the Clinton government
Treasury bonds and securities
U.S. exporters
Macintosh computers
Texas intermediate crude oil
China film festival
However, if the entity name occurs in the form of an adjective, you should not tag it because it doesn't constitute an entity name.
the American companies [no markup for American]
Cuban citizens [no markup for Cuban]
Chinese food [no markup for Chinese]
When you encounter a possessive construction, tag the two
parts individually as two separate names.
For instance:
Temple University's Graduate School of Business
Canada's Parliament
Aliases, acronyms, nicknames and abbreviations for proper names should be tagged as a name:
IBM [acronym for International Business Machines]
Big Blue [alias for International Business Machines]
Big Board [alias for New York Stock Exchange]
Mr. Fix-It [nickname for candidate for head of the CIA]
the Big Apple [nickname for New York City]
Red Sox [alias for the Boston Red Sox]
Sears [alias for Sears Roebuck and Co.]
Events
Do not tag event names, even if they refer to events that occur on a regular basis and are associated with institutional structures. However, the institutional structures themselves - steering committees, etc. - should be tagged.
the Pan-American Games [no markup]
vs. the US Olympic Committee [Organization]
Artifacts and products
Miscellaneous types of proper names that are not to be tagged as named entities include artifacts, other products, and plural names that do not identify a single, unique entity. For instance,
the Taurus is the latest car model [no markup]
Generics
Also, generic names that do not refer to a specific entity should not be tagged
the Campbell Soups of the world [no markup]