Word prediction model using bag-of-word representation or n-grams model can be limiting. Bag-of-word representation of given text disregarded the linguistic context of the word, the semantic of a given word is not taken into consideration. N-gram models can capture dependencies and relation upto a short distance but fail to capture this over long distances. In bag-of-word representation words such as “small”, “little”and “white” would all be considered the same, that is, they would be considered equidistant form each other; but according to linguistic context “small” and “little” should be considered to be closer than “small” and “white”.
Word Vectors are distributed representations of words in given text. Each word is represented by a unique vector, each vector is a set of features. Word Vectors keep the linguistic dependencies and semantic structures of words intact. Vectors of words “small” and “ little” are lot closer in the vector space as compared to vectors of words “small” and “white”.
Concept of distributed vector representation of words:
Word Vectors
Word Vectors are distributed representations of words in given text. Each word is represented by a unique vector, each vector is a set of features. Word Vectors keep the linguistic dependencies and semantic structures of words intact. Vectors of words “small” and “ little” are lot closer in the vector space as compared to vectors of words “small” and “white”.
Concept of distributed vector representation of words:
- Each word in context is mapped to a specific vector. Each word represents a column in the vector matrix. The ranking of the word is decided by the position of the word in the vector space.
- A neural network is used to generate this representation of words as vectors. Stochastic gradient descent with back-propagation is used to set feature-values of the word vectors. When training converges similar words are mapped closer to each other. These neural nets consider the underlying dependencies of words while training. Word2Vec are a class of algorithms that are used to generate vectors words.
- An aggregate / combination function is used to combine different word vectors. A soft max multi class classifier is then used to then assign output probabilities to possible new words. Figure-a(Source:[1] Distributed Representations of Sentences and Documents;Quoc Le,Tomas Mikolo)
Figure-a
Paragraph Vectors
Paragraph Vectors are unsupervised frameworks for representation of paragraphs as vectors. Paragraph vectors are an improvement in the representation of text form the word vector model. Paragraph vector are continuous distribution vector, of a given piece of text. A unique vector represent a unique text; these vectors are a set of features. The term“Paragraph” emphasis that the length of the text can be varying: sentences, paragraphs, documents etc. Paragraph vectors are created by stochastic gradient descent and back propagation.
Paragraph Vectors can be thought as vectors that capture semantics properties and dependencies that are lost in word vector representation.
Paragraph vector framework:
- Each paragraph is mapped to a unique vector.
- Each word is also mapped to a unique word vector
- The paragraph and word vectors are concatenated.
- New word is predicted using a classifier to assign probabilities to possible next word.
- Paragraph vectors and the word vectors are trained using back-propagation with stochastic gradient descent.
- Refer Figure-b(Source: [1]Distributed Representations of Sentences and Documents;Quoc Le,Tomas Mikolo)
All context generated from/for the same paragraph share that particular paragraph vector. Paragraph vectors are not shared across different paragraphs as two paragraph vectors represent different paragraphs; whereas , the word vectors are amongst paragraphs, this is because multiple paragraphs can have common words.
After training paragraphs and word vectors, learning models such as SVM, logistic regression,K-means etc can be applied.
Alternate framework :
After training paragraphs and word vectors, learning models such as SVM, logistic regression,K-means etc can be applied.
Alternate framework :
- Words are predicted, from words that are randomly sampled from paragraph vectors.
- Each gradient descent iteration , a part of text is chosen from a paragraph vector and random words are chosen from this portion of text. These words then are used to generated new words.
- Refer Figure-c(Source: [1]Distributed Representations of Sentences and Documents;Quoc Le,Tomas Mikolo)
Figure-c
Comments
Post a Comment