Simple Named Entity Guidelines - Revised

Developed by Stephanie Strassel - Linguistic Data Consortium

For the TIDES Surprise Language Exercise - June 2003

(Based largely on the MUC-7 NE Guidelines)

 

1           Introduction

An entity is some object in the world -- for instance, a place or a person.  A named entity is a phrase that uniquely refers to that object by its proper name, acronym, nickname or abbreviation.  Some examples of named entities follow:

 

Coca-Cola Bottling Co.

Bob Austin

the Eiffel Tower

IBM

the Yankees
Uganda

Bowdon, Georgia

Mt. Fuji

the Kremlin

the Kennedys

2           How to annotate

The annotation tool displays news stories, one document at a time.  Read through the news story.  When you encounter a named entity, highlight the text and assign it an entity type.  Only material between <TEXTSTREAM> and </TEXTSTREAM> tags should be annotated.

3           Entity Types

We will identify three types of named entities:

 

PERSON: Person entities are limited to humans identified by name, nickname or alias.

 

ORGANIZATION: Organization entities are limited to corporations, institutions, government agencies and other groups of people defined by an established organizational structure.

 

LOCATION: Location entities include names of politically or geographically defined places (cities, provinces, countries, international regions, bodies of water, mountains, etc.).  Locations also include man-made structures like airports, highways, streets, factories and monuments.

 

Within this document, named entities are indicated by underlining.  Red text refers to Persons; green text refers to Organizations; and blue text refers to Locations.

 

Other types of entities like animals, inanimate objects, monetary units, times and dates (including holiday names) will not be annotated.

3.1        Person Names

People may be specified by name, nickname or alias.  Family names should also be tagged as PERSON.  Names of deceased people, as well as fictional human characters appearing in movies, television, books and so on, should be tagged as PERSON entities.

3.1.1      Titles and appositives

Titles such as "Mr." and role names such as "President" are not considered part of a person name and are not marked. For instance, in

 

Vice President Cheney visited the site.

 

only the name Cheney is marked.  There is no markup for the title Vice President.

 

If a title contains within it the name of a taggable entity, tag that entity.  For instance,

 

Microsoft Chairman Bill Gates stated that...

 

Some more examples:

GlobalCorp Vice President John Smith

Treasury Secretary Jackson

the U.S. Vice President

UN Secretary General Kofi Annan

Justice Minister Giovanni Maria Flick

Portuguese Navy Commdr. Miguel Oliveira

U.S. Ambassador Nicholas Burns

Mission Control Chief Vladimir Solovyov

Russian President Boris Yeltsin

Independent Counsel Kenneth Starr

 

You may occasionally encounter an appositive like "Jr.", "Sr.", and "III".  These are considered part of a person name and should be marked as part of the name, for instance:

 

Mr. Albert Franklin, Jr. was part of the research team.

3.2        Organization Names

Tag all proper name mentions of groups with a defined organizational structure.  These include


Businesses                                                      Bridgestone Sports Co. profits

Stock exchanges                                            NASDAQ shares

Multinational organizations                           European Union representatives

Political parties                                                GOP hopeful

Unions                                                               Machinists union

Non-generic government entities                the State Department issued a warning

Sports teams                                                   the Phillies

Military groups                                                the Tamil Tigers                                                                

 

Proper names that refer to facilities/buildings that are primarily defined by their established organizational structure, and can do things like issue statements, make decisions, hire people, raise money and so on, should be classified as an ORGANIZATION.  These include:

 

Churches                                                         Trinity Lutheran Church

Hospitals                                                           Finger Lakes Area Hospital Corp.

Hotels                                                                Four Seasons Hotel Group

Museums                                                          the Guggenheim Museum

Universities                                                      the University of Chicago

Government offices                                        the White House

3.2.1      General ORGANIZATION-like non-entities

General entity mentions such as "the police" and "the government," should not be tagged, since these are not unique proper name references to specific entities.

3.2.2      Organization vs. Location

When a place name (e.g., a city, state, country) is used to refer to an organization like a sports team or a branch of the government, the name should still be tagged as LOCATION.   See Section 3.4.1 for a more complete discussion.

 

Also see Section 3.4.2 for discussion of how to handle facilities that have both organization and location aspects

3.3        Location Names

Examples of place-related strings that are tagged as LOCATION include named heavenly bodies, continents, countries, provinces, counties, cities, regions, districts, towns, villages, neighborhoods, airports, highways, street names, factories, manufacturing plants, street addresses, oceans, seas, straits, bays, channels, sounds, rivers, islands, lakes, national parks, mountains, fictional or mythical locations, and monumental structures, such as the Eiffel Tower and Washington Monument, that were built primarily as monuments.  For instance:

 

the collapse of Idaho's newly-constructed Teton Dam

the dispute over votes in Dade County

The Walt Whitman Bridge remained closed

repairs began on a 10-mile stretch of the Alaskan Pipeline

The Garden State is known for its tomatoes.

3.3.1      Extent of Location Names

There are several issues surrounding the expression of location names and which parts of a string to tag.

3.3.1.1     Compound expressions

Compound expressions in which place names are separated by a comma in English should be tagged as separate instances of LOCATION.

 

Kaohsiung in Taiwan

Philadelphia in Pennsylvania

3.3.1.2      Designators

When a "designator" is customarily used as a regular part of a place name, that word should also be included in the extent of the LOCATION entity. For example, include in the tagged string the word "River" in the name of a river, "Mountain" in the name of a mountain, "City" in the name of a city, etc., if such words are contained in the string.

 

Mississippi River

3.3.1.3     Location modifiers and "semi-official" place names

Often times place names are modified by words like "Southern", "lower", "West", "the former" and so on.

 

When these modifiers are part of a location's official name they should be tagged as part of the LOCATION name.  For instance:

Upper Volta

North Dakota

 

Even if the place name does not have "official" status but has an agreed-upon definition and is in very frequent use, the string should be tagged as a LOCATION, as in:

the Middle East

the West Bank

Eastern Europe

 

When these modifiers are not the official name of a place, or when the definition of the place might vary from person to person, do not tag the modifier as part of the LOCATION entity name. 

 

Mississippi River west bank   

former Soviet Union             

Gaul (present-day France)

lower Manhattan

Northern California

 

These place names can sometimes be tricky.  If you are not sure whether a modifier is part of an official name, you should include the modifier as part of the place name.

3.4        Deciding among entity types

There are some situations where deciding what entity type to assign can be somewhat tricky.

3.4.1      Place names that refer to organizations

Very often in the news, city, country and other place names are used to refer to organizations rather than the geographical places themselves.  For instance:

 

Washington is rainy in the spring.                 

vs.          Washington announced a new tax policy today.         

 

Germany has some beautiful mountains.             

vs.          Germany invaded Poland in 1939.                                          

 

Baltimore has a great aquarium.

vs.          Baltimore defeated the Yankees by a score of 4 to 3.

 

The White House was built in 1792.

vs.          The White House released the President's latest proposal.

 

When the name of a place is used to refer to a government, the name should still be labeled as a LOCATION, as in:

 

Washington came out with a new tax policy today.

              

Similarly, the name of a unique structure or building (which we label as a location) can often be used to refer to the government or other organization housed in that facility.  In such cases, the name should still be labeled as a LOCATION.

              

The Pentagon issued a statement about the incident.

 

The same rule applies to place names referring to sports teams:

 

Philadelphia lost to Los Angeles last night.

3.4.2      Organization vs. Location Facilities

If a facility is primarily defined by its established organizational structure, and can do things like issue statements, make decisions, hire people, raise money and so on, then the name should be classified as an ORGANIZATION.  Churches, hospitals, hotels, museums, universities, restaurants and government offices fall into this category.  (See Section 3.2 for examples.)

 

If a facility is primarily defined by its physical structures rather than the people that run it, it should be classified as a LOCATION.  This includes transportation infrastructure and large storage structures like airports, streets, highways, airports, ports, train stations, bridges, tunnels, parking garages, airplane hangers, factories and manufacturing plants.  Monumental structures, such as the Eiffel Tower and Washington Monument, that were built primarily as monuments, should also be tagged as LOCATION.  (See Section 3.3 for examples.)

4           Difficult Cases

4.1        Expressions that refer to multiple entities

 

When a phrase refers to multiple named entities, mark each entity separately.

 

For instance,

 

China and South Korea signed the agreement.

 

contains two entities:

 

China

South Korea

 

Similarly,

 

Jimmy and Rosalyn Carter

North and South America

 

But be careful not to split apart proper names that contain a conjunction.  For instance,

 

the Fish and Wildlife Service

 

is the name of one organization and should be tagged as a single named entity (it's not the Fish Service and the Wildlife Service as separate names).

4.2        Nested Expressions

No nested expressions will be marked.   When the name of one entity contains within it another entity name, do not pull out the name of the other entity and mark it separately.  Only tag the larger entity.  For instance, the phrase: 

 

the U.S. Customs Service       

 

contains a single entity.  The name "U.S." should not be pulled out of larger phrase and marked as a separate entity.  Similarly,

 

Arthur Anderson Consulting     no markup for Arthur Anderson alone

Boston Chicken Corp.             no markup for Boston alone

4.3        Entities as modifiers

If an entity name modifies another word (even if that word is not a taggable entity type), you should still tag the entity name.

 

Bridgestone profits

the Clinton government

Treasury bonds and securities

U.S. exporters

Macintosh computers

Texas intermediate crude oil

China film festival

 

However, if the entity name occurs in the form of an adjective, you should not tag it because it doesn't constitute an entity name.

 

the American companies          [no markup for American]

Cuban citizens                    [no markup for Cuban]

Chinese food                       [no markup for Chinese]

4.4        Possessives

When you encounter a possessive construction, tag the two parts individually as two separate names.  For instance:

Temple University's Graduate School of Business

Canada's Parliament

4.5        Other types of names

Aliases, acronyms, nicknames and abbreviations for proper names should be tagged as a name:

 

IBM                   [acronym for International Business Machines]

Big Blue             [alias for International Business Machines]

Big Board           [alias for New York Stock Exchange]

Mr. Fix-It          [nickname for candidate for head of the CIA]

the Big Apple      [nickname for New York City]

Red Sox              [alias for the Boston Red Sox]

Sears                [alias for Sears Roebuck and Co.]

 

5           What NOT to tag

Events

Do not tag event names, even if they refer to events that occur on a regular basis and are associated with institutional structures. However, the institutional structures themselves - steering committees, etc. - should be tagged.

 

the Pan-American Games          [no markup]

vs.          the US Olympic Committee        [Organization]

 

Artifacts and products

Miscellaneous types of proper names that are not to be tagged as named entities include artifacts, other products, and plural names that do not identify a single, unique entity.  For instance,

 

the Taurus is the latest car model   [no markup]

 

Generics

Also, generic names that do not refer to a specific entity should not be tagged

 

the Campbell Soups of the world      [no markup]