Introduction
We have seen an exponential growth in the volume of users on Online Social Networks (OSN) in the Indian subcontinent over the past few years. This has prompted the attention of several stakeholders and first responders to turn to OSN to make decisions and plan their next move. In multilingual societies such as India, it's ubiquitous to find large volumes of code-mixed online discourses, tweets, and posts. The process of performing Language and Text Analysis on any such data is not a trivial task as all of the traditional tools for NLP are based on English and do not work well for code-mixed data. In this blog post, I would be specifically talking about Hindi-English (Hi-En) Code-Mixed data (however most of the concepts apply to other forms of code-mixing as well.)What is Code-Mixing?
Code Mixing is a natural phenomenon of embedding linguistic units such as phrases, words or morphemes of one language into an utterance of another (Muysken, 2000; Duran, 1994; Gysels, 1992). Code-mixing is widely observed in multilingual societies like India, which has 22 official languages most popular of which are Hindi and English. With over 375 million Indian population online, usage of Hindi has been steadily increasing on the Internet.
Hi-En CS provides users a tool for ease of communication as the variety of phrases and expressions they can use increases. Romanization is a very common form of CS; Romanization, in linguistics, refers to the conversion of writing from a different writing system to the Roman (Latin) script, or a system for doing so. This does provide the freedom to users to romanize the Devanagari script on OSN, but it makes the task of developing linguistic tools for Hi-En CS content that much more difficult.
Obstacles while building Linguistic tools for Code-Mixed Data
Romanization
Since Hindi is typed phonetically the same word can have multiple variations when being Romanized leading to a lot of tokens which have the same meaning contextually.
Word
|
Meaning
|
Apperaing Variations
|
---|---|---|
नहीं | No | nahin, nahi, nai, nahii |
बहुत | Very | bahut, bahot, bohot |
मैं | Me | main, mai, mae |
Language Identification
OSN in India have an abundance of romanized Hindi text, with this arises the problem of token level language identification. This makes it very difficult to decipher whether a word belongs to the English dictionary or to a transliterated Hindi dictionary. For eg: "main" (Hindi - Me, English - Main).
Lack of Core NLP Tools
In order to build any linguistic tool such as sentiment analyzer, NER, LIDF, Topic Modeller, accurate core NLP tools such as Shallow Parsers and Part Of Speech (POS) taggers are essential. Such tools don't exist for Hi-En code-mixed data. Even the ones which exist provide very poor precision.
Data Collection
There is no defined method of identifying Hi-En CS data, which makes it very difficult to collect data from OSN. Due to this, thresholds and heuristics have to be set to identify Hi-En code-mixed data.
Current Research
Code Mixing has been studied by linguists and consequently computer scientists for some time [1][2]. However, studies pertaining to Hi-En are fairly new and scarce. Begum et al. (2016) explored the various pragmatic and semantic reasons as to why users switch between Hindi and English [3]. One of the very first approaches was done using text normalization and Hindi SentiWordNet by Sharma et al. (2015). They normalized different variations of romanized Hindi tokens and non-standard spellings and then performed a lookup in the SentiWordNet which contained the positive and negative score for the word, hence assigning it a sentiment [4]. Rudra et al. (2016) explored the language preferences of users to express opinion and sentiment on Twitter. They came to a conclusion that Bilingual Hi-En users tend to use Hindi to express negative opinions more than they use English. They used a large feature set which contained: intensifiers, lexicons, negation, slang, subjective words, modal verbs and much more to obtain the sentiment of the sentence [5]. Another recent approach by Prabhu et al. (2016) explores the use of sub-word level LSTM to obtain the sentiment of a sentence. According to their analysis, it provides much better results than any existing Sentiment Analysis tool [6].
Code-Mixed Sentiment Analyzer: The nuts and bolts
The system which is presented here is an amalgamation of multiple existing systems and my own inputs. It provides one of the possible ways to find the sentiment for code-mixed data. This is a supervised system, hence some annotated data would be required. The system runs on a collection of features which are fed to a classifier and the text is classified into three classes: Positive, Negative, Neutral. Given below are the pre-processing steps and the description of the features:
Pre-Processing
- Cleaning Symbols: Tweets often include symbols such as "RT", "@", "#" and website links which would hinder the analyzer. All these were removed from the tweets. For eg: "RT: @narendramodi is doing a great job #acchedin" will become "narendramodi is doind a great job acchedin".
- Tokenization: Tokenization is the process of breaking down a document into its basic building blocks along with some cleaning steps. In the tokenization step, the document is converted to lower case, all numbers are removed and each word is separated.
- Removal of Stop Words: Stop words include all the words which occur very often in the English corpora such as "the", "we", these words don't add any specific meaning/sentiment to the sentence and hence are of no benefit to the system. These words are removed from the sentence.
- Text Normalization of Out-of-Vocabulary tokens: OSN textual data includes large amounts of slang, word plays and abbreviation all of which cannot be recognized by existing dictionaries, SentiWordNets and language corpora. To overcome this, text normalization is required. For each token, the system checks if it exists in either the English or Hindi dictionary if it does not, it finds the edit distance of the token with each word in the English and Hindi dictionary. The word with the least edit distance with the out-of-vocabulary token is chosen and replaces the token in the sentence. For eg "hai how are you?" will become "hi how are you?".
- Transliteration: One of the features using Hindi SentiWordNet to generate the Sentiment score of Hindi words. The Hindi SentiWordNet has all of its words written in the Devanagari script whereas OSN text includes phonetic translation of Hindi words. This where the transliterator is used to convert the phonetic Hindi words to the Devanagari script to find their sentiment score.
- Language Identification
Feature Selection
- Bag of Words (BOW): The BOW model represents each word in the tweet as a feature of the classifier and assigns weights to the words based on their occurrence.
- Hindi SentiWordNet: The Hindi SentWordNet made by IIT Bombay was used in this feature. It contains a limited set of Hindi words and their corresponding positive and negative score. The Hindi SentiWordNet contains words in the Devanagari script and the words obtained from the tweets were romanized Hindi words.
- Tweet Length: The number of words in the tweet after removing the stop words and symbols was considered as one of the features.
- English SentiWordNet: The English SentiWordNet is the English version of the Hindi SentiWordNet, it contains an extensive list English words and their positive and negative scores.
- Negation: A negated context is a tweet segment which starts with a negative word and end with a clause level punctuation mark. Each word in the tweet segment which is negated is appended with a "_NEG". In this way, it is treated as a separate negated version of the original word. The negation is done before the BOW model is created, therefore the positive and negative versions of a word are treated separately. For eg: "No one likes this." -> "No one_NEG likes_NEG this_NEG."
- Lexions: Various sets of lexicons could be added such as emoticons, slang, intensifiers, abuse etc.
Conclusion
The system presented above seems trivial but the group of systems that are required to be built before you can get to sentiment analysis is where the challenge lies, such as transliterator, Language Identifier, text normalizer. The system can further be improved with a POS tagger to allow for word sense disambiguation. The lack of these systems is what causes the obstacles, once we have these systems in place, sentiment analysis is a trivial task. The systems do exist but none of them have reached the accuracy of their English counterpart. Sentiment Analysis on OSN has also been used to get insights of sociolinguistics in India. The primary findings were that people often switch to Hindi when they want to express extreme emotions, primarily sadness, emphasize a statement or when they want to be a part of Indian campaigns which many at times are based on Hindi, such as "acche din aayenge". In conclusion, there is a lot of scope for Sentiment Analysis in India, and the current research is building towards better models, but the lack of core NLP tools will always be a hurdle.
References
[1] Bilingual speech: A typology of code-mixing, 2000.
[2] Toward structuring code-mixing: An Indian perspective, 1978.
[3] Functions of code-switching in tweets: An annotation scheme and some initial experiments, 2016.
[4] Text normalization of code mix and sentiment analysis, 2015.
[5] Understanding language preference for expression of opinion and sentiment: What do Hindi-English speakers do on twitter? 2016.
[6] Towards sub-word level compositions for sentiment analysis of Hindi-English code mixed text, 2016.
Comments
Post a Comment