Skip to main content

Semantic Similarity using Word Embeddings and Wordnet


Measuring semantic similarity between documents has varied applications in NLP and  Artificial sentences such as in chatbots, voicebots, communication in different languages etc. . It refers to quantifying similarity of sentences based on their literal meaning rather than only syntactic structure.  A semantic net such as WordNet and Word Embeddings such as Google’s Word2Vec, DocToVec can be used to compute semantic similarity. Let us see how.


Word Embeddings


Word embeddings are vector representations of words. A word embedding tries to map a word to a numerical vector representation using a dictionary of words, i.e. words and phrases from the vocabulary are mapped to the vector space and represented using real numbers. The closeness of vector representations of 2 words in the real space is a measure of similarity between them. Word embeddings can be broadly classified into frequency based (eg: count vector, tfidf, co occurrence etc) and prediction based (eg: Continuous bag of words, skip grams etc.)
Various models can be used to learn these word embeddings. One such model is Word2Vec based on neural networks, which is used for learning vector space representations of words in huge corpus of text. It takes as input a huge corpus of text and produces a multi dimensional vector space and assigns a vector representation to each word in the corpus. Vector representations of words with similar context are closer in the vector space. Similarly, Doc2Vec is a model that learns vector representation of an entire document/paragraph.

Word2Vec can been used to calculate the Word Mover’s Distance for document similarity based on context of the word. WMD measures similarity between those sentences which have the same context but might not share similar words. We try to find the minimum travelling distance (shortest path) between two documents, i.e. the minimum  distances to move the distribution of words of one document more closer to the other.

For eg: The President of USA greets the press in Chicago.
Obama speaks to the media in Illinois.

These two sentences have the same context but different words, so a traditional bag of words approach may not be able to detect the similarity of the two sentences because of lack of common words. However, WMD measures the similarity between them accurately. This can also be used to compute document similarity. *[2]


Reference: http://proceedings.mlr.press/v37/kusnerb15.pdf (Matthew J Kusner's paper "From Word Embeddings to Document Distances")

Wordnet


Wordnet is an English lexical database of synonyms. Words are grouped into synsets, i.e. sets of synonyms and are linked together by the semantic relationship that they define. This is a graphical structure with synsets being linked to other synsets to form a varied hierarchy of concepts. As we go deeper down this hierarchy, the relationships become more specific whereas at the top, the linking of synsets is quite general. For eg: oragnism is more general as compared to plant, etc. in the ficgure below.

Traditionally, the co-occurrence of words within documents has been used to measure similarity. More the number of common words any two documents share , the more similar they are. However, synonyms of a particular word or relationship between two entities has not been considered. The graphical structure of WordNet allows us to get the shortest path between two words and hence their semantic similarity . Some of the features based on WordNet used for measuring sentence similarity are:
  1. Path Length(L) and depth(D) *[1]: Similarity is some function of path length and the depth of the two words. Path length is the shortest path between the two words of the same synset and depth tells us the deepest subsumer of the two words. For eg:
So similarity(w1,w2) = f(L)f(D).
  1. Scaling Depth *[1]: The concepts at upper level of hierarchy are more general and the ones at the lower levels are more specific, so we need to scale down sim(w1,w2) at upper levels, and scale up sim(w1,w2) at lower levels.
  2. Sentence level similarity *[1]:  A joint word set can be formed out of the distinct words between two sentences and for each word in the first sentence, similarity score of most similar word in the second sentence is taken and vectors s1, and s2 are formed for both sentences. Then we can simply compute the cosine distances between these two vectors to get the similarity score.

Using the above measures we can find document similarity based on context and meaning of a given document rather than using only corpus statistical measures like frequency of occurrence .


References:
1)http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1644735
2) https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
3) https://en.wikipedia.org/wiki/WordNet



Comments

Popular posts from this blog

NLP in Video Games

From the last few decades, NLP (Natural Language Processing) has obtained a high level of success in the field  of Computer Science, Artificial Intelligence and Computational Logistics. NLP can also be used in video games, in fact, it is very interesting to use NLP in video games, as we can see games like Serious Games includes Communication aspects. In video games, the communication includes linguistic information that is passed either through spoken content or written content. Now the question is why and where can we use NLP in video games?  There are some games that are related to pedagogy or teaching (Serious Games). So, NLP can be used in these games to achieve these objectives in the real sense. In other games, one can use the speech control using NLP so that the player can play the game by concentrating only on visuals rather on I/O. These things at last increases the realism of the game. Hence, this is the reason for using NLP in games.  We ...

Discourse Analysis

NLP makes machine to understand human language but we are facing issues like word ambiguity, sarcastic sentiments analysis and many more. One of the issue is to predict correctly relation between words like " Patrick went to the club on last Friday. He met Richard ." Here, ' He' refers to 'Patrick'. This kind of issue makes Discourse analysis one of the important applications of Natural Language Processing. What is Discourse Analysis ? The word discourse in linguistic terms means language in use. Discourse analysis may be defined as the process of performing text or language analysis, which involves text interpretation and knowing the social interactions. Discourse analysis may involve dealing with morphemes, n-grams, tenses, verbal aspects, page layouts, and so on. It is often used to refer to the analysis of conversations or verbal discourse. It is useful for performing tasks, like A naphora Resolution (AR) , Named Entity Recognition (NE...

Dbpedia Datasets

WHAT IS Dbpedia? It is a project idea aiming to extract structured content from the information created in the wikipedia project. This structured information is made available on the World Wide Web. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datsets. BUT? But why i am talking about Dbpedia ? How it is related to natural language processing? The DBpedia data set contains 4.58 million entities, out of which 4.22 million are classified in a consistent ontology, including 1,445,000 persons, 735,000 places, 123,000 music albums, 87,000 films, 19,000 video games, 241,000 organizations, 251,000 species and 6,000 diseases. The data set features labels and abstracts for these entities in up to 125 languages; 25.2 million links to images and 29.8 million links to external web pages. In addition, it contains around 50 million links...