Named Entity Recognition & Applications in NLP

Introduction

The term entity refers to a thing or a being with unique & independent existence. Named entity thus refers to a proper noun uniquely identifying an object. Named Entity Recognition (NER) refers to the process of identification of named entities and their classification into categories like person names, organisational names, locations, dates, time, monetary descriptions etc. NER plays an integral part in Information Extraction in NLP.

Approaches

NER identification can be broadly classified into two approaches as follows:

Rule-based (Linguistic) approach

This approach consists of a set of hand-crafted rules derived from a language’s grammatical and syntactical features.
For example, consider a simple rule to identify person, office & organization in a text.

[person], [office] of [organization]
Vuk Draskovic, leader of the Serbian Renewal Movement

Machine Learning (Statistical) approach

This approach is based on statistical models which predict the entities in a given sentence or text. This approach can be further classified into three categories:

Supervised learning
Supervised learning is a branch of Machine Learning(ML) associated with computing a functional relation between input & output using labelled data. The labelled data (commonly referred to as training data) is used to train the model and hence compute the function correctly.
Major supervised learning techniques like Decision Trees, Maximum Entropy Models, Support Vector Machines have been applied to perform NER. The approach is to read a large annotated corpus, and formulate disambiguation rules as functions of text features. H. Isozaki’s paper on Efficiently performing NER with SVMs and Decision trees provide useful insight to this ideation.
Unsupervised learning
In this branch of ML, the model works on unlabelled data, i.e. there is no true categorization available to the model.
Unsupervised learning is typically used to cluster similar entities on the basis of context or lexical patterns. For example, E. Alfonseca and Manandhar proposed clustering to automate NE classification using WordNet. A topic signature is assigned to each entity by listing its frequently co-occurring words in a large corpus. Next, for a given input text, a neighbouring window (of some fixed size) is used to find the topic signature and hence classify the entity.
Semi-supervised learning
This branch of ML lies between the supervised & unsupervised learning methods and mostly consists of a large unlabelled data along with a small labelled data.
Bootstrapping, a common technique in semi-supervised learning, aims at generating tags for large unlabelled data as per the model generated from the small labelled data. For example, J. Knopp in his paper for Extending a multilingual Lexical Resource by bootstrapping Named Entity Classification proposed the use of Bootstrapping to classify named entities. This is a two step process consisting of: identification of the various types of named entities present in Wikipedia; and use of bootstrapping algorithms to classify the named entities in the given text.

Applications

Information Retrieval - Developed as a subclass to information extraction, NER is an integral part of Information Extraction. NER helps identify the context and focus of a given document. Henceforth, applications of IE like: Extraction of information from large textual data (for example: articles or phone logs); Language parsing: Machine translation & speech recognition; NLP-based search engines & chat bots; Information extraction from emails: text filtering, spam detection, meta-data detection; all are highly dependent on NER.
Medical Science - NER can also be used to categorise diseases, genes, proteins, organisms etc. and is thus being extensively used in the medical industry. In molecular biology, the genome & protein naming and identification can often be a complex and tedious task, NER facilitates easy identification of the same. Moreover, NER models can also be used to formulate structural patterns in chromosome sequencing, and hence ease understanding and replication of the same.
Question Answering - NER helps detect several fact-based answers to a question thus simplifying the task of finding such answers. NER also helps identify the main subject & object of the question therefore defining the search scope, especially in survey questions. In our course project Auto generating answer Wikis for questions on Quora (By - Aashay Mittal, Ojaswi Gupta & Tanya Chowdhury), we plan to use NER to understand the main object of the question.

Amalgam

Search This Blog