Applications of NLP in Cyber Security

You wake up one nice morning and open up your laptop to check the mails and find out that you won a lottery worth $1 million. “Wow”, you say and get excited, you open it up, it demands for some of your personal details but who cares for tiny details when you are getting a million dollars. Or you might have got an email for security updates. There my friend, you are just another victim to those cyber criminals. This is just a very basic example of cyber crime.

How to solve such problems? Here comes in Natural Language processing for our rescue. NLP has various applications in Cyber security. Lets break it down to the three utmost areas were NLP can solve infosec problems.

1. Domain Generation Algorithm classification

(DGA or domain generation algorithm is a specific method deployed by cyber attacker to generate large number of domains which can be used as points (Command and control servers )of propagating malicious code

To detect Advanced Persistent Threat(APT) in DNS.

(Solving Domain Generation Algorithm Classification)

On analysing the past APT attacks, it was observed that the APT domains were lexically similar. Generally, they used similar domains to that of a popular software company(Microsoft, Adobe, Java, Firefox, Gmail etc) and asked for software updates. The domain names were creatively crafted while trying to maintain the legitimacy of the domains. They also made sure to use words like “updates”, ”login”, ”billing”, ”register” etc.

Examples: adobe-update[.]net

adobeupdates[.]com

microsoft-xpupdates[.]com

microsoft-update-info[.]com

An example of a lexically similar Paypal domain.

Cisco, Umbrella conducted an investigation along with OpenDNS Investigate to find interesting patterns and lexical similarity in the malicious domains used by APT groups like Anunak, Carbanak and DarkHotel. And thus they came up with a NLPRank algorithm based on natural language processing which identifies malicious behavior in network traffic. It uses Edit Distance algorithm on substrings to check the word distance between the typo-squatting domains and the legitimate one. It quantifies the similarity between the two strings by counting the number of operations/edits required to convert one string into another. The operations allowed are

1. Substitution

2. Insertion

3. Deletion.

With these operations in hand, our goal is to minimize the number of edits.

Eg1: google.com - gooogle.com 1 Insertion => 1 edit.

Eg2: linkedin.com - linkedln.com 1 substitution => 1 edit.

The algorithm extracts features from the dataset to identify potential typo-squatting phishing domains. It is to block malicious domains to prevent any further phishing activities.

2.Vulnerability Research

Research at Symantec found that, on average, zero-days exist “in the wild” for over 300 days before identification. In 2016, malware platforms were known to stay on a target system for a minimum of 146 days without detection. Sounds scary, no?. We thus need to find the solution applying to NLP techniques to proactively find vulnerable code segments with the knowledge of previous function patterns in known vulnerabilities.

Words are powerful and details of sentence structure convey subtle changes in meaning. A structured NLP solution for capturing intent and capability from the web, especially in foreign languages, is of great value when a fixed vulnerability title or CVE number is present. Such a solution can dramatically assist vulnerability management teams to quickly respond and patch vulnerabilities.

In order to apply NLP, we need to have a collection of documents or corpus. this is for normal web data but when we apply NLP to vulnerabilities or Malware we should extract data by applying specific techniques they are static and dynamic analysis.

Disassembled code of specific binary

The image gives example why normal approach can't be applied to build the corpus of malicious code. Malware family analysis and malicious language processing are still fancy were constant research is going to find better ways to apply NLP and classify malware.

3. Identification of Phishing and spam mails

Phishing is basically a social engineering technique used to tempt people to reply with some sensitive information. These spam/phishing emails have a particular pattern in common like:

A Promising offer or luring the user with money: Such emails have a money mention with which they try to extract personal information from the user.
Not mentioning the receiver’s name: Words like “Dear Friend”,”Dear beneficiary” are found more likely to be used than the actual name of the receiver.
Some sense of urgency: This is a social engineering tactic where the scammer tries to play mind games with the user describing some urgent situation like ‘need to transfer money outside the country’ asking for the bank details of the user to transfer money into.
Displayed name/email address is not the same as in the email-id.

By modeling bag of words to implement string/pattern matching algorithms we can determine whether the mail contains phishing or spam elements. Initially, the detection was done by blacklisting malicious/reported websites. Currently, it is done in different ways, such as:

- Analysing URL and the webpage phishing content.

- Keyword extractor - It uses n-grams to tokenize the messages.

- Statistical Classifier - Classifiers trained on a certain set of features take keywords as input and use naive bayes algorithm to detect phishing emails.

The above illustrations are based on industry standard applications of NLP to cyber security.

We can view NLP as a powerful resource in data scientist’s toolkit by which we can apply data science to security problems in order to parse malicious code from normal code.

References:

https://umbrella.cisco.com/blog/2017/03/28/domain-names-watching-closely/

https://www.helpnetsecurity.com/2015/03/05/nlprank-an-innovative-tool-for-blocking-apt-malicious-domains/

https://umbrella.cisco.com/blog/2015/03/05/nlp-apt-dns/

https://www.broadwayworld.com/bwwgeeks/article/OpenDNS-Unveils-NLPRank-a-New-Model-for-Advanced-Threat-Detection-20150305

http://delivery.acm.org/10.1145/2660000/2659691/p217-Aggarwal.pdf?ip=103.25.231.102&id=2659691&acc=ACTIVE%20SERVICE&key=045416EF4DDA69D9%2E9B70FA1BECDE5FE7%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=801286546&CFTOKEN=90225682&__acm__=1503596476_ae8328cb3007745e7960628a5d4ce3a5

https://link.springer.com/chapter/10.1007/978-3-319-06483-3_33

https://www.researchgate.net/publication/290624055_An_automatic_method_for_CVSS_score_prediction_using_vulnerabilities_description

Semantic Similarity using Word Embeddings and Wordnet

Measuring semantic similarity between documents has varied applications in NLP and Artificial sentences such as in chatbots, voicebots, communication in different languages etc. . It refers to quantifying similarity of sentences based on their literal meaning rather than only syntactic structure. A semantic net such as WordNet and Word Embeddings such as Google’s Word2Vec, DocToVec can be used to compute semantic similarity. Let us see how. Word Embeddings Word embeddings are vector representations of words. A word embedding tries to map a word to a numerical vector representation using a dictionary of words, i.e. words and phrases from the vocabulary are mapped to the vector space and represented using real numbers. The closeness of vector representations of 2 words in the real space is a measure of similarity between them. Word embeddings can be broadly classified into frequency based (eg: count vector, tfidf, co occurrence etc) and prediction based (eg: Contin...

Amalgam

Search This Blog