In the current internet age,
the civilization has undergone rapid changes and NLP research has produced
multiple great things related to artificial intelligence e.g., Google, IBM’s
Watson, Apple’s Siri etc.
In this blog, we will discuss the evolution of NLP
research and present them in the form of the intersection of three overlapping
curves namely Syntactic, Semantics and Pragmatics Curves.
Poising on the Syntactic Curve(Bag of Words):
Syntax centered NLP is still used very popularly to manage
different tasks like information retrieval and extraction, topic
modeling, auto-categorization etc. It is broadly grouped into
three main categories: keyword spotting, lexical and
statistical methods.
Keyword Spotting is the most popular
approach due to its cost-effectiveness. Keyword spotting we can use for text
classification. Some of the most popular project on keyword spotting includes:
(a) Ortony’s Affective Lexicon: it groups words
into effective categories
(b) Penn Treebank: a corpus consisting of over 4.5 million
words of American English annotated for part-of-speech (POS) information
(c) PageRank: the famous ranking algorithm of Google
(d) LexRank: a stochastic graph-based method for computing
relative importance of textual units for NLP
(e)TextRank: a graph-based
ranking model for text processing, based on two unsupervised methods for
keyword and sentence extraction.
Lexical Affinity is
a slightly cleverer mechanism than keyword spotting as, rather
than just detecting obvious words, it assigns to arbitrary words a
probabilistic ‘affinity’ for a particular category.
e.g. ‘accident’ can be considered as a 75% probability of
indicating a negative event and these probabilities are calculated from
linguistic corpora. This approach performs better than keyword spotting
but there are multiple problems with it. Let us say if we will consider a
sentence “I met an accident ” is indicating negative probability. But if I will
say “I met my girlfriend by accident”(unplanned or lovely surprise). Another
one problem with lexical affinity is it is biased towards the class of the
text, which is making it difficult for reusability and creating the
domain-independent model.
Statistical NLP uses language models based on different popular machine learning
algorithms such as maximum-likelihood, expectation maximization,
support vector machines, conditional random fields. Statistical models are
semantically weak so it works with acceptable accuracy when we provide large
text input.
Surfing the Semantics Curve (Bag of Concepts):
Semantics-based NLP focuses on the meaning associated
with the text, rather than just processing the documents with syntax
level.Semantics-based NLP approaches can be broadly grouped into two main
categories: techniques that leverage on external knowledge, e.g., ontologies
(taxonomic NLP) or semantic knowledge bases (noetic NLP), and methods that
exploit only intrinsic semantics of documents (endogenous NLP).
Taxonomic NLP includes initiatives that aim
to build universal taxonomies or Web ontologies for grasping the
subsumptive or hierarchical semantics associated with natural language
expressions.In general, it attempts to build taxonomic resources are
countless and include both resources crafted by human experts or community
efforts such as WordNet and Freebase and automatically built knowledge
bases. Examples of such knowledge bases include:
(a) WikiTaxonomy: a taxonomy extracted from Wikipedia’s
category links.
(b) YAGO: a semantic knowledge base derived from WordNet,
Wikipedia, and GeoNames
(c) NELL(Never-Ending Language Learning), a semantic
machine-learning system that is acquiring knowledge from the Web every day
(d) : A research prototype that aims to build a
unified taxonomy of worldly facts from
1.68 billion web pages in Bing repository.
Noetic NLP embraces all the mind inspired
approaches to NLP that attempt to compensate for the lack of domain adaptivity
and implicit semantic feature inference of traditional algorithms, e.g., first
principles modeling or explicit statistical modeling. Noetic NLP differs from
taxonomic NLP in which it does not focus on encoding subsumption knowledge but
rather attempts to collect idiosyncratic knowledge about objects, actions,
events. Noetic NLP, moreover, performs reasoning in an adaptive and dynamic
way, e.g., by generating context-dependent results or by discovering new
semantic patterns that are not explicitly encoded in the knowledge base.
Foreseeing the Pragmatics Curve(Bag of Narratives):
Narrative understanding and generation are central for
reasoning, decision-making, and ‘sensemaking’. Besides being a key part of
human-to-human communication, narratives are the means by which reality is
constructed and planning is conducted. Decoding how narratives are generated
and processed by the human brain might eventually lead us to truly understand
and explain human intelligence and consciousness. Computational modeling
is a powerful and effective way to investigate narrative understanding. A lot
of the cognitive processes that lead humans to
understand or generate narratives have traditionally been
of interest to AI researchers under the umbrella of knowledge representation,
common-sense reasoning, social cognition, learning, and NLP.There are already a
few pioneering works that attempt to understand narratives by leveraging on
discourse structure argument-support hierarchies, plan graphs, and common-sense
reasoning.
Conclusion:
Word and concept level approaches to NLP are just a first
step towards natural language understanding. The future of NLP lies in biologically
and linguistically motivated computational paradigms that enable narrative
understanding and hence, ‘sensemaking’. Computational intelligence potentially
has a large future possibility to play an important role in
NLP research.
References:
[1] Erik Cambria, Bebo White, “Jumping NLP Curves: A
Review of Natural
Language Processing Research”
[2] L. Araujo, “Symbiosis of evolutionary techniques and
statistical natural language processing,” IEEE Trans. Evol. Comput., vol. 8,
no. 1, pp. 14–27, 2004.
[3] N. Asher and A. Lascarides, Logics of Conversation.
Cambridge, U.K.: Cambridge Univ. Press, 2003.
Comments
Post a Comment