Skip to main content

Applications of NLP in Cyber Security

You wake up one nice morning and open up your laptop to check the mails and find out that you won a lottery worth $1 million. “Wow”, you say and get excited, you open it up, it demands for some of your personal details but who cares for tiny details when you are getting a million dollars. Or you might have got an email for security updates. There my friend, you are just another victim to those cyber criminals. This is just a very basic example of cyber crime.
How to solve such problems? Here comes in Natural Language processing for our rescue. NLP has various applications in Cyber security. Lets break it down to the three utmost areas were NLP can solve infosec problems.

1. Domain Generation Algorithm classification

(DGA or domain generation algorithm is a specific method deployed by cyber attacker to generate large number of domains which can be used as points (Command and control servers )of propagating malicious code

To detect Advanced Persistent Threat(APT) in DNS.
(Solving Domain Generation Algorithm Classification)

On analysing the past APT attacks, it was observed that the APT domains were lexically similar. Generally, they used similar domains to that of a popular software company(Microsoft, Adobe, Java, Firefox, Gmail etc) and asked for software updates. The domain names were creatively crafted while trying to maintain the legitimacy of the domains. They also made sure to use words like “updates”, ”login”, ”billing”, ”register” etc.

Examples: adobe-update[.]net
      adobeupdates[.]com
       microsoft-xpupdates[.]com
      microsoft-update-info[.]com

An example of a lexically similar Paypal domain.
Cisco, Umbrella conducted an investigation along with OpenDNS Investigate to find interesting patterns and lexical similarity in the malicious domains used by APT groups like Anunak, Carbanak and DarkHotel. And thus they came up with a NLPRank algorithm based on natural language processing which identifies malicious behavior in network traffic. It uses Edit Distance algorithm on substrings to check the word distance between the typo-squatting domains and the legitimate one. It quantifies the similarity between the two strings by counting the number of operations/edits required to convert one string into another. The operations allowed are
1. Substitution
2. Insertion
3. Deletion.

With these operations in hand, our goal is to minimize the number of edits.

Eg1: google.com  -  gooogle.com        1 Insertion => 1 edit.

Eg2: linkedin.com -  linkedln.com        1 substitution => 1 edit.              

The algorithm extracts features from the dataset to identify potential typo-squatting phishing domains. It is to block malicious domains to prevent any further phishing activities.

2.Vulnerability Research

Research at  Symantec found that, on average, zero-days exist “in the wild” for over 300 days before identification. In 2016, malware platforms were known to stay on a target system for a minimum of 146 days without detection. Sounds scary, no?. We thus need to find the solution applying to NLP techniques to proactively find vulnerable code segments with the knowledge of previous function patterns in known vulnerabilities.
Words are powerful and details of sentence structure convey subtle changes in meaning. A structured NLP solution for capturing intent and capability from the web, especially in foreign languages, is of great value when a fixed vulnerability title or CVE number is present. Such a solution can dramatically assist vulnerability management teams to quickly respond and patch vulnerabilities.
In order to apply NLP, we need to have a collection of documents or corpus. this is for normal web data but when we apply NLP to vulnerabilities or Malware we should extract data by applying specific techniques they are static and dynamic analysis.

nlp-1.png
Disassembled code of specific binary
The image gives example why normal approach can't be applied to build the corpus of malicious code. Malware family analysis and malicious language processing are still fancy were constant research is going to find better ways to apply NLP and classify malware.

3. Identification of Phishing and spam mails


Phishing is basically a social engineering technique used to tempt people to reply with some sensitive information. These spam/phishing emails have a particular pattern in common like:
  • A Promising offer or luring the user with money: Such emails have a money mention with which they try to extract personal information from the user.
  • Not mentioning the receiver’s name: Words like “Dear Friend”,”Dear beneficiary” are found more likely to be used than the actual name of the receiver.
  • Some sense of urgency: This is a social engineering tactic where the scammer tries to play mind games with the user describing some urgent situation like ‘need to transfer money outside the country’ asking for the bank details of the user to transfer money into.
  • Displayed name/email address is not the same as in the email-id.

By modeling bag of words to implement string/pattern matching algorithms we can determine whether the mail contains phishing or spam elements. Initially, the detection was done by blacklisting malicious/reported websites. Currently, it is done in different ways, such as:
- Analysing URL and the webpage phishing content.
- Keyword extractor - It uses n-grams to tokenize the messages.
- Statistical Classifier - Classifiers trained on a certain set of features take keywords as input and use naive bayes algorithm to detect phishing emails.


The above illustrations are based on industry standard applications of NLP to cyber security.
We can view NLP as a powerful resource in data scientist’s toolkit by which we can apply data science to security problems in order to parse malicious code from normal code.

References:

Comments

  1. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Text Analytics Companies

    Text Analytics Python

    ReplyDelete

Post a Comment

Popular posts from this blog

NLP in Video Games

From the last few decades, NLP (Natural Language Processing) has obtained a high level of success in the field  of Computer Science, Artificial Intelligence and Computational Logistics. NLP can also be used in video games, in fact, it is very interesting to use NLP in video games, as we can see games like Serious Games includes Communication aspects. In video games, the communication includes linguistic information that is passed either through spoken content or written content. Now the question is why and where can we use NLP in video games?  There are some games that are related to pedagogy or teaching (Serious Games). So, NLP can be used in these games to achieve these objectives in the real sense. In other games, one can use the speech control using NLP so that the player can play the game by concentrating only on visuals rather on I/O. These things at last increases the realism of the game. Hence, this is the reason for using NLP in games.  We ...

Word embeddings and an application in SMT

We all are aware of (not so) recent advancements in word representation, such as Word2Vec, GloVe etc. for various NLP tasks. Let's try to dig a little deeper of how they work, and why they are so helpful! The basics, what is a Word vector? We need a mathematical way of representing words so as to process them. We call this representation, a word vector. This representation can be as simple as a one-hot encoded vector having the size of the vocabulary.  For ex, if we had 3 words in our vocabulary {man, woman, child}, we can generate word vectors in the following manner Man : {0, 0, 1} Woman : {0, 1, 0} Child : {1, 0, 0} Such an encoding cannot be used to for any meaningful comparisons, other than checking for equality. In vectors such as Word2Vec, a word is represented as a distribution over some dimensions. Each word is assigned some particular weight for each of the dimensions. Picking up the previous example, this time the vectors can be as following (assuming a 2 dime...

Discourse Analysis

NLP makes machine to understand human language but we are facing issues like word ambiguity, sarcastic sentiments analysis and many more. One of the issue is to predict correctly relation between words like " Patrick went to the club on last Friday. He met Richard ." Here, ' He' refers to 'Patrick'. This kind of issue makes Discourse analysis one of the important applications of Natural Language Processing. What is Discourse Analysis ? The word discourse in linguistic terms means language in use. Discourse analysis may be defined as the process of performing text or language analysis, which involves text interpretation and knowing the social interactions. Discourse analysis may involve dealing with morphemes, n-grams, tenses, verbal aspects, page layouts, and so on. It is often used to refer to the analysis of conversations or verbal discourse. It is useful for performing tasks, like A naphora Resolution (AR) , Named Entity Recognition (NE...