Use of Neural Networks in Natural Language Processing

Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text or time series, must be translated.

Neural networks help us cluster and classify. You can think of them as a clustering and classification layer on top of data you store and manage. They help to group unlabeled data according to similarities among the example inputs, and they classify data when they have a labeled dataset to train on. (To be more precise, neural networks extract features that are fed to other algorithms for clustering and classification; so you can think of deep neural networks as components of larger machine-learning applications involving algorithms for reinforcement learning, classification and regression).

What kind of problems does deep learning solve, and more importantly can it solve yours? To know the answer, you need to ask yourself a few questions: What outcomes do i care about? Those outcomes are labels that could be applied to data: for example, spam or not spam in an email filter, good guy or bad guy in fraud detection, angry customer or happy customer in customer relationship management. Then ask, do i have the data to accompany those labels? That is, can i find labeled data, or can i create a labeled dataset, where spam has been labeled as spam, in order to teach an algorithm the correlation between labels and inputs.

There’s been a lot of advancement in using neural networks and other deep learning algorithms to obtain high performance on a variety of NLP tasks. Traditionally, the bag of words model along with classifiers that use this model, such as the Maximum Entropy Classifier, have been successfully leveraged to make very accurate predictions in NLP tasks such as sentiment analysis. However, with the advent of deep learning research and its applications to NLP, discoveries have been made that improve the accuracy of these methods in primarily two ways: a supervised neural network to run your input through several classifications, and an unsupervised neural network optimize feature selection as a pre-training step.

The Motivation for Neural Networks and Deep Learning in NLP

At its core, deep learning (and neural networks) are all about giving the computer some data, and letting it figure out how it can use this data to come up with features and models to accurately represent complex tasks - such as analyzing a movie review for its sentiment. With more common machine learning algorithms, human-designed features are generally used to model the problem and prediction becomes a task of optimizing weights to minimize a cost function. However, hand crafting features is time consuming, and these human made features tend to either over-represent the general problem and become to specific or are incomplete over the entire problem space.

Supervised Learning: From Regression to a Neural Network

The Max Entropy classifier, commonly abbreviated to a Maxent classifier, is a common probabilistic model used in NLP. Given some contextual information in a document (in the form of multisets, unigrams, bigrams, etc), this classifier attempts to predict the class label (positive, negative, neutral) for it. This classifier is also used in neural networks, and it’s known as the softmax layer - the final layer (and sometimes only) in the network used for classification. So, we can model a single neuron in a neural network as computing the same function as a max entropy classifier:

Here, x is our vector of inputs, the neuron computes the function with parameters w and b and outputs a single result in h.

Then, a neural network with multiple neurons can simply be thought of feeding the same input to several different classification functions at the same time. The neural network is nothing more than running a given vector of inputs (x in our above picture) through many (as opposed to a single) functions, where each neuron represents a different regression function. As a result, we obtain a vector of outputs:

…And you can feed this vector of outputs to another layer of logistic regression functions (or a single function), until you obtain your output, which is the probability that your vector belongs to a certain class:

Applying Neural Networks to Unsupervised Problems in NLP

In NLP, words and their surrounding contexts are pretty important: a word surrounded by relevant context is valuable, while a word surrounded by seemingly irrelevant context is not very valuable. Each word is mapped to a vector defined by its features (which in turn relate to the word’s surrounding context), and neural networks can be used to learn which features maximize a word vector’s score.

A valuable pre-training step for any supervised learning task in NLP (such as classifying restaurant reviews) would be to generate feature vectors that represent words well - as discussed in the beginning of this post, these features are often human-designated. Instead of this, a neural network can be used to learn these features. The input to such a neural network would be a matrix defined by, for example, a sentence’s word vectors.

Our neural network can then be composed of several layers, where each layer sends the previous layer’s output to a function. Training is achieved through back propagation: taking derivates using the chain rule with respect to the weights to optimize these weights. From this, the ideal weights that define our function (which is a composition of many functions) are learned. After training, we now have a method of extracting ideal feature vectors that a given word is mapped to.

This unsupervised neural network is powerful, especially when considered in the context of traditional supervised softmax models. Running this unsupervised network on a large text collection allows input features to be learned rather than human designated, often resulting in better results when these features are fed into a traditional, supervised neural network for classification.

References

https://deeplearning4j.org/neuralnet-overview#define

https://www.quora.com/How-are-neural-networks-used-in-Natural-Language-Processing

http://cogcomp.org/papers/SCKKSVBWR16.pdf

Word embeddings and an application in SMT

We all are aware of (not so) recent advancements in word representation, such as Word2Vec, GloVe etc. for various NLP tasks. Let's try to dig a little deeper of how they work, and why they are so helpful! The basics, what is a Word vector? We need a mathematical way of representing words so as to process them. We call this representation, a word vector. This representation can be as simple as a one-hot encoded vector having the size of the vocabulary. For ex, if we had 3 words in our vocabulary {man, woman, child}, we can generate word vectors in the following manner Man : {0, 0, 1} Woman : {0, 1, 0} Child : {1, 0, 0} Such an encoding cannot be used to for any meaningful comparisons, other than checking for equality. In vectors such as Word2Vec, a word is represented as a distribution over some dimensions. Each word is assigned some particular weight for each of the dimensions. Picking up the previous example, this time the vectors can be as following (assuming a 2 dime...

Amalgam

Search This Blog