Word Sense Disambiguation

Word Sense Disambiguation(WSD) is the ability to identify best sense of a word in a particular context, when the word has multiple meanings. It can be considered as a classification problem: Given a word and it's meanings(senses), classify the word in one of it's sense class based on evidence from the context and external knowledge sources.

For example, consider the following two sentences:

a) The workers at the plant were overworked.

b) The gardener was watering the plant.

In first sentence, the word 'plant' refers to the industrial plant whereas in second one, it refers to a tree.

Word Sense Disambiguation System.

WSD is an important part of many applications such as Machine Translation, Information Retrieval, Information Extraction, Content Analysis, Word Processing(Spelling Correction) , Semantic Web etc. It can help in improving the relevance of search engines, anaphora resolution, coherence, inference etc.

There are mainly two types of Word Sense Disambiguation, namely :

Lexical Sample(or targeted WSD) - In this, the system is required to disambiguate only a set of targeted words, usually one per sentence.
All-words WSD - In this, the system needs to disambiguate all words in a text.

Word Sense Disambiguation has four major tasks, namely:

Selection of Word Senses - It is the process of identifying most appropriate sense of a word in a particular context.It is the key problem in WSD.
External Knowledge Sources - Knowledge resources are fundamental part of WSD. They provide the knowledge required to map a word to its appropriate senses. They vary from labelled or unlabeled corpora of text to machine-readable dictionaries,Thesauri, ontologies etc.
Representation Of Context - In this ,the text is converted into a structured format so that it can be given as input to an automatic method. For this preprocessing is done which includes steps such as tokenization, POS tagging, lemmatization, chunking and parsing.
Choice Of a Classification Method- This is the final step of WSD. There are many approaches to resolve ambiguities which are explained below.

Various Approaches to Word Sense Disambiguation are:

Supervised WSD - In this approach, Machine Learning Techniques are used to learn classifier to classify words into their appropriate senses with the help of labelled training sets.
Unsupervised WSD - In this approach, unlabeled corpora is used to map senses to word.

These approaches can further be distinguished as knowledge-based and corpus-based . The former makes use of machine readable dictionaries, ontologies, thesauri whereas the latter makes use of unlabeled corpora for disambiguation.

Another way to categorize WSD approaches are token-based and type-based. In token-based approach, each word is associated with a specific meaning according to the context in which it appears whereas in type-based approach, it is assumed that the word has same sense within a single text.

List of WSD Algorithms

Supervised Disambiguation Techniques are:

Decision Lists - It is an ordered set of rules for assigning an appropriate sense to a target word.The rules are ordered based on their decreasing score.It can be considered as a list of weighted if-then-else rules.
Naive Bayes- It is a classification technique based on the Bayes' theorem. It predicts the sense associated with a word with the help of the conditional probability of each sense Si of a word w given the features fj in the context.The sense S with maximum probability is chosen as most appropriate sense in context.
Neural Networks - It is an interconnected group of artificial neurons that uses a computational model for processing data. The inputs to neural network are pairs of input feature and desired response.The weights are progressively adjusted so that the desired response has higher activation than any other output unit.Training is done till the desired output has higher activation than any other output unit.

Unsupervised Disambiguation Techniques aims at identifying sense clusters rather than assigning sense labels.Some of them are:

Context Clustering - Each occurrence of a target word in a corpus is represented as a context vector.These vectors are then clustered into groups using clustering algorithms, each identifying a sense of the target word.Clustering is done based on contextual similarity between occurrences.
Word Clustering - This method aims at clustering words which are semantically similar(synonyms).
Coocurrence graph - It is a graph based approach in which vertices V corresponds to words in a text and edges E connects pair of words which cooccur in a syntactic relation, in the same paragraph or in larger context.Each edge is assigned weight based on the relative coocurrence frequency of two words connected by an edge.Weights to edges corresponding to most coocurrent words are assigned 0 and for least coocurrent words,it is assigned as 1.The edges having weights above a certain threshold is discarded.

Knowledge Based Disambiguation Techniques are:

Lesk Algorithm-It is based on the assumption that each word in a sentence is based on a similar topic.In this,all possible definitions of each word is taken and the definition which overlaps the most is taken as the appropriate one.
Selectional Preferences- Selectional preference denotes a word's tendency to co-occur with words that belong to certain lexical sets.In this method, selectional preference is used to restrict the number of meanings of target words in a context.It discards the senses that violate the constraints and prefer those senses which satisfies the requirements.

There are many challenges to Word Sense Disambiguation, some of them are:

Different Algorithms for Different Applications- Different Algorithms are required for different Applications.For Example, in Machine Translation,exact sense of a word is required whereas in Information Retrieval,only confirmation that the sense of a word in query and retrieved documents are same is required,not the exact sense of word.
Representation of Word Senses- The choice to represent word senses and how to divide senses is a fundamental problem in WSD. Ever-changing nature of senses further poses problem in its representation.
Knowledge Acquisition Bottleneck- The problem of manual creation of knowledge and changing it whenever disambiguation scenario changes is known as knowledge acquisition bottleneck.It is one of the major problem in WSD as it relies heavily on knowledge.
Task Dependent Sense Inventory - Sense inventory mechanism is task-dependent.Each task requires its own division of word meaning into senses relevant to task.

References:

Navigli, Roberto. "Word sense disambiguation: A survey." ACM Computing Surveys (CSUR) 41.2 (2009): 10.
https://blog.recast.ai/understanding-word-understand-language/
https://en.wikipedia.org/wiki/Word-sense_disambiguation
http://www.scholarpedia.org/article/Word_sense_disambiguation

Word embeddings and an application in SMT

We all are aware of (not so) recent advancements in word representation, such as Word2Vec, GloVe etc. for various NLP tasks. Let's try to dig a little deeper of how they work, and why they are so helpful! The basics, what is a Word vector? We need a mathematical way of representing words so as to process them. We call this representation, a word vector. This representation can be as simple as a one-hot encoded vector having the size of the vocabulary. For ex, if we had 3 words in our vocabulary {man, woman, child}, we can generate word vectors in the following manner Man : {0, 0, 1} Woman : {0, 1, 0} Child : {1, 0, 0} Such an encoding cannot be used to for any meaningful comparisons, other than checking for equality. In vectors such as Word2Vec, a word is represented as a distribution over some dimensions. Each word is assigned some particular weight for each of the dimensions. Picking up the previous example, this time the vectors can be as following (assuming a 2 dime...

Amalgam

Search This Blog

Word Sense Disambiguation

Labels

Comments

Post a Comment

Popular posts from this blog

NLP in Video Games

Word embeddings and an application in SMT

Discourse Analysis