Natural Language Processing and the Identification of Malicious Domains Using a Domain Generation Algorithms
Author: Fabiano Barreira
One of the premises for defending against threats, is to know and map the attack vectors we are vulnerable to. Being aware of a certain malicious behavior make us able to prevent against it, leveraging on active monitoring or through static and dynamic restrictions.
One way to prevent/map attacks is using indicators of compromise (IOCs). However, reliance in the use of these indicators can be tricky nowadays. Since IOCs depend, in general, on a prior detection by some source of security or some kind of contextualization. We can say that this makes sense, but the method is not always effective, since the maintenance of these IOCs depend on infrastructure for storage and constant updating to ensure the IOC database integrity. The ongoing maintenance is also required to reduce the number of false positives, which will eventually exist.
The approach described in this article is about using the concept of Natural Language Processing (NLP) and Machine Learning to identify malicious patterns in the generation of domain names, without the need for contextualization, without the need to subscribe to some security feed and without the need to connect an external infrastructure to update the IOCs. It is important to highlight that the approach described here is intended to explore the power of NLP and Machine Learning in Information Security, addressing an existing problem with a different perspective. However, it does not exclude the usage of other existing solutions and approaches and should be seen as an additional layer of intelligence for the security layout.
This text aims to describe the logic of the algorithm and show its actual functioning, with results obtained through execution tests. The applicability of the algorithm, however, is not described in this article because there is no direct relevance with the presentation of its functioning. In other words, the algorithm can be applied to historical data, real-time data, or be scripted to some SIEM solution. The key point is that your operating logic will be the same regardless of the type of application established.
What is DGA?
Domain Generation Algorithm, or DGA, is a routine for the pseudorandom generation of domain names. In the context of malware, the possibility of generating these domains at runtime allows the change and update on the destinations of connections made between malware and its command and control servers (C&C Servers), without the need for malware to have them previously “hardcoded”.
Previously, malware had in their source code the addresses for command and control servers (C&C Servers) destinations they were supposed to communicate with.
This approach proved inefficient in some contexts, since as identified these domains, either by reverse engineering in malware binary or by monitoring Network traffic, it was easy to shut down this communication by adding these domains in blacklists and sharing this information using feeds to other users. This dramatically reduced the malware’s impact and performance.
With the introduction of DGA algorithm, not only the individual action of malware, but also the management of botnets, gain a powerful mechanism: the domains for communication with C&C Servers are generated in a coordinated and pseudorandomized way by malware, using an algorithm known only by the malware and the attacker on the other end. The algorithm instructions are inserted into the malware’ source code and the external attacker has the possibility to predict in which domains the malware will attempt to connect on a certain date, and with this, can register these domains in advance and wait for communication.
Overall, only a few domains from a much larger set are chosen by the attacker for registration.
Mitigation Attempts (Static Blacklists)
Malware-infected machines that use DGA will generate a high number of DNS resolution attempts for unregistered/non-existent domains (NXDOMAIN). Depending on the Log and Audit settings of your DNS server, this information may be available.
Several domains using the DGA routine are discovered every day. Several security players in the market (using different technologies and detection mechanism), can map this type of communication and allow the block of these domains through feeds and precompiled lists that are applied to filters on security equipment.
The crucial point for the effectiveness of this type of block, is that the malicious domain needs to be previously known by the manufacturer to then be blocked, which becomes more difficult because of the random nature of these domains (which can reach the number of thousands in a given time vector). This immediately creates the need for constant update on this block list to maintain its consistency and effectiveness.
Diagnosis demands a lot of effort and in general involves thorough monitoring of network traffic to discover indicators of compromise and malware’s reverse engineer for the extraction of the algorithm itself. Both activities require time to be performed.
The Use of Machine Learning
The approach described in this article consists of applying an NLP algorithm in order to facilitate the identification of these malicious domains generated with the DGA engine without the need of feeds, update online content, blacklists, or reverse engineer malware artifacts.
So far, we understood what a DGA algorithm is, what it serves and why using blacklists to block the data is not the most effective, because of its high generation factor, its random nature and because of the difficulty in mapping these domains (since each malware uses its own algorhythm and in general it is only possible to have access to it through reverse binary engineering).
Thus, the methodology needs to meet the following premises:
- Based on the random pattern of a domain using DGA, be able to identify potentially suspicious domains.
- Do not apply any runtime check/contextualization or use of external feed, only analyze the pattern of characters used in creating these domains.
Considering a real-world scenario, we should also minimize this:
- Generation of false positives for domains with genuinely random nomenclatures using a DGA routine but that have no malicious context.
In my approach, I am using some deterministic factors to train my Machine Learning algorithm. It was built as follows:
TLD’s usually related to malicious contexts: Some top-level domains (TLDs) are common for malicious domains registration using a DGA algorithm. This is because of the possibility of registering anonymously or with very little background information. This context makes them extremely attractive to attackers.
A sample of 155 top-level domains has been selected, where there was link to malicious DGA domains, and these will be used for algorithm training.
Abnormal Lexicography: A hostname has, by its primary and genuine function, the purpose to correlate a name to its respective IP address. This name has the premise of being easy to memorize or at least to refer objectively to the context and content to which that device/resource stands.
The essence for the most domains generated using a DGA algorithm, in turn, is to be random.
When looking initially at a domain using DGA, you quickly notice the absence of words or meaning in its character set.
“Search for absence” of meaningful words in these domains adds another feature to train our algorithm.
An English word dictionary containing 274,739 words was used. To improve the algorithm, we can expand the training using other dictionaries in other languages as well. For testing and execution purposes, we are considering valid domains with English nomenclature only.
Number of Characters: Depending on its random nature, a DGA algorithm generates its domains with a minimum length of characters sufficient to ensure a good level of variation and avoid possible collisions. There is no rule for this, based on a sample from 50 malicious domains generated with the DGA algorithm, we got an average value that will populate the last feature for training our algorithm.
To exemplify the operation of the algorithm, an execution test will be performed. The data submitted to the algorithm are real and consist of 5 previously selected domains. The execution of the algorithm will be conducted on a Core i3 computer with 8GB of RAM.
Three of these domains are reported and linked to Trojan Dyre, one of them is a fictitious domain representing a legitimate page of the company, but with an unusual character pattern and the other one is a legitimate domain from Google:
Dyre Trojan Bank
Dyre Trojan Bank
Dyre Trojan Bank
Google legitimate URL
Fictitious company’s domain
The test was basically established in two stages:
- Submit each domain from the above list to algorithm’s execution.
- Train the algorithm to reduce any false positives.
Phase 1 – First Algorithm’s Execution
The content listed below refers to the output screen after processing the algorithm. Each row represents the result for an individual domain. The value of the “Status” column represents the risk identified by the algorithm.
T3622C4773260c09 7e2e9b26705212ab85.ws. | Status: suspicious u83ccf36d9f02e9ea79a9d16c0336677e4.to, | Status: suspicious v02bec0c090508bc76b3ea81dfc2198a71.in, | Status: suspicious 2axgfcf4n-bpbe.googlevideo.com | Status: suspicious company1d4faf4.com | Status: possible_safe Analysis runtime: less than 1 second
Result: It is possible to observe that the algorithm correctly categorized the domains related to the Dyre malware as “suspicious” and despite the unusual pattern, it categorized the fictitious website of the company as “possible_safe”. This was due to, although it identified some unusual contexts, it also identified other relevant metrics that classify the element as safe. Google’s domain, because it has a construction pattern compatible with the DGA algorithm, has been categorized as “suspicious”.
Phase 2 – Second Algorithm’s Execution: Fine-Tuning
Servers that handle a high number of requests, could use a DGA algorithm to name the domains responsible for distributing the access requests load, but this does not necessarily flag a malicious behavior. This is the case with Google’s domain exemplified in this test.
To reduce the number of false positives generated in this approach, we will train our algorithm to learn domains that come from a trusted source.
This refinement will be used to create a point of distinction in the algorithm, which will help judgment’s check upon DGA domains that are malicious and that are not malicious.
t3622c4773260c097e2e9b26705212ab85.ws, | Status: suspicious u83ccf36d9f02e9ea79a9d16c0336677e4.to, | Status: suspicious V02bec0c090508bc76b3ea81dfc21 98a71.in. | Status: suspicious 2axgfcf4n-bpbe.googlevideo.com | Status: safe company1d4faf4.com | Status: possible_safe Analysis runtime: less than 1 second
Result: The algorithm operated accurately in all tested domains. To work-around false positives involving Google’s domain, the main domain *.googlevideo.com has been trained exclusively in the algorithm, as a reliable source. The algorithm needs to have this “trigger” to ensure that any legitimate and authorized domains that use DGA on its hosts or subdomains can be considered trusted. In this second run, you can see that Google’s domain was considered “safe”.
The use of Machine Learning and NLP proved to be very effective to identify domains using a DGA algorithm for generation of their names. Also, providing extra elements to the features and algorithm’s training, it could also be applied to detect potential patterns related to malicious/suspicious activities.
The speediness in the analysis of these elements is a high point, considering that no extra runtime contextualization or external feed checks are needed.