This post was originally published here by Christopher McCubbin.
In a recent Boston Bsides talk, David Bianco and I briefly mentioned the use of isolation forests to find unusual behavior in cybersecurity log files. Today, we will take a deeper dive into the techniques that we experimented with. These experiments were run in collaboration with Dimitar Karev, our RSI intern. The results we present here are also discussed in a paper that Dimitar wrote for CompSysTech’17. In our experiments, we look at HTTP log data to explore isolation forests’ capability to find malicious outliers under various conditions. We also explore tuning the algorithm parameters and feature space to produce optimal results.
Anomaly Detection and iForest
An Isolation Forest (iForest) is an unsupervised algorithm, which was designed in 2008 by Fei Tony Liu, Kai Ming Ting and Zhi-Hua Zhou. Despite being intended for the purposes of anomaly detection, iForests have not often been used for threat hunting. In view of their effectiveness in other anomaly detection domains, we wanted to explore iForests for use in cyber threat detection.
iForests were presented as a novel type of technique for multidimensional anomaly detection. Previous techniques have used model creation and detection of points outside the model, or angle-based approaches. See this link for a good survey of anomaly detection techniques. iForests leveredge the fact that anomalies are by definition few and different. These characteristics make anomalies more sensitive to isolation than the normal instances, which is the main idea behind the iForest structure.
iForests measure the outlierness of a point by determining how long it takes to isolate that point. To do this, we want the average depth at which the point is isolated in a set of data structures called isolation trees. This is similar to the the notion of random forests, with some key differences. An Isolation Tree (or iTree) is a binary decision tree. Each node in the iTree represents a collection of vectors (n-dimensional points), which are a subset of the original dataset. To construct the tree, at each vertex two random values are chosen: the first one represents a random choice for determining which dimension should be considered and the second represents a random value between the min and max value of the subset. This value is then used to divide the subset. Vectors smaller than this random value in the chosen dimension populate the left child, while larger entries form the right child. The iTree is built recursively for each of the node’s children until there is only one vector in each leaf node. For any given point, we can measure the depth of its personal leaf node in this tree. The assertion here is that the sooner the point is isolated, the more of an outlier the point is. So smaller depth numbers mean the point is more of an outlier.
To smooth out possible bad random choices, we will want to make many of these trees, an isolation forest, and average the depths for a point across many trees. The lower this average value is, the more outliery the algorithm considers that point. We also wish to avoid ‘swamping’ and ‘masking’ effects. Swamping is an effect where outliers are close to, but distinct from, normal data. Masking is when a set of anomalies is formed by a rare process and aren’t random but form a small distribution of their own. In this case the outliers may all be clumped together and artificially increase the isolation depth. One way to mitigate these effects is to train the forests on samples of the original data. Details of this technique are available in the original iForest paper.
Testing iForests on HTTP Data
Many cyber attacks leave traces in HTTP data. For these experiments, we restrict the dataset to HTTP. We tested the concept of using iForests to detect unusual HTTP records by using Bro HTTP logs of normal and known-malicious behavior. For a deeper dive on Bro logging, look at this piece from Josh Liburdi.
We conducted several experiments using two sets of data: one of normal user activities, and one of malicious activities. The malware file was created from web examples of known malicious attacks extracted from two websites: ContagioDump and Malware Domain List. The normal data was recorded using Sqrrl’s own attack lab cluster. The cluster is a mid-sized enterprise hosted by Simspace which contains endpoint machines, services, and network hardware. A multi-day simulation of normal business activity was registered using intelligent agents and the data of network activity was recorded on an internal switch and on a border router. Both normal and malicious datasets consist of Packet Capture (PCAP) data, which are a record of network activity. We decoded the PCAP files through Bro and used the default Bro HTTP log for these experiments. For this test, there were 37,978 malicious entries and 271,129 normal entries in the Bro logs. In real life, the ratio of normal to malicious behavior would typically be lower.
iForests work exclusively on numeric dimensions so we need to transform the non-numeric fields of Bro HTTP logs into numbers for use as a feature. The process to do this most effectively is an open question; we chose to use one-hot representations for these features but there are probably better ways to represent this data. Perhaps these features could be processed through Word2Vec to produce a latent space that would be more amenable to analysis with the iForests.
The table below contains the list of the Bro HTTP columns that we considered. The non-numeric features that were processed through a bag of words or N-grams can produce more than one resulting dimension in the vector space. For bag of words features, each word turns into a new “one-hot” dimension where the HTTP sessions which have this word, receive value of 1 for the dimension, and all other receive 0. The N-gram features are produced by getting all possible n-consecutive symbol sequences from a string and processing them similarly to the bag of words features. The entropy features use the standard Shannon entropy formula. Certain features that are extracted from other fields were made based on our in-house expertise of malware trends; e.g., malicious requests tend to be longer than the normal requests. We were interested in finding a subset of features from this large list, which maximizes the performance of the algorithm.
Despite being unsupervised, we can measure the performance of the algorithm against our data since it is labeled. Given some cut-off in outlier factor, we can measure how many of the malicious records the algorithm identifies (true positives) vs. how many normal records the algorithm identifies as malicious (false positives) and how many malicious records are identified as normal (false negatives).
iForest Parameters
The iForest as described by Liu is dependent on two parameters: number of iTrees in the forest and number of samples in each iTree. Liu et al. showed empirically for their data that 100 iTrees and 256 samples in an iTree is enough for the data they tested. They suggested that the same parameters will be close to optimal for any type of data, meaning the accuracy of the model coverages very quickly at around ψ = 256 and t = 100. We tested their conjecture on our data. To evaluate and compare the results, we use receiver operating characteristic (ROC) curve which is a complete empirical description of decision threshold effect. The ROC curve plots the true positive rate against the false positive rate at various threshold values. To measure the accuracy of the algorithm, we calculate the Area Under the Curve (AUC) in the interval between 0.1% and 1% false positive rate. We really only want to optimize for these low FP rates because a greater false positive rate would be impractical to use as a discriminator. Hence, the maximal result for AUC that can be obtained is 0.009 if we label the data perfectly. The figure below presents an example ROC curve created during our testing. We are interested in maximizing the area under the red segment.
We expected that the optimal parameters in our algorithm will be different from the parameters suggested by Liu et al. because the algorithm implemented in the context of Cyber Threat Detection uses data which are very different from those used in the original paper. Therefore, our first experiment measured both the AUC and the running time required for different values of ψ and t. From our experiments, we can see that the AUC is maximized at ψ = 8192 and t=200 for this dataset, which is quite different from the original paper’s results.
From these initial experiments, the results are promising. Given these values of the AUC, we can definitely distinguish malicious data from normal data, at least in these contrived datasets. We realized that unimportant features may lead to excessive swamping and masking using this technique. This led us to ask, can we do even better using careful feature selection?
Features – Extraction and Selection
A naive approach to finding the best feature set would try every possible subset of features. This is impractical however since testing every possible combination of features requires an exponential number of experiments. We could use any number of stochastic optimization techniques to find a ‘good enough’ set of features. For this experiment we chose to use an off the shelf genetic algorithm (Distributed Evolutionary Algorithms in Python (Deap)) to explore the hyperparameter search space.
Each individual hyperparameter choice is represented by 20 booleans – if a feature is presented in the individual the corresponding value is 1, otherwise it is 0. We ran the algorithm with 25 individuals in each generation and a total number of 20 generations. There is 75% chance of mutation which represents a flip in a random bit, and 75% chance of two-point crossover. For controlling the size of the population we use tournament selection with size of 5. After running the evolutionary algorithm driven by the AUC as a fitness function, we can make the following observations:
- The fields – method, user_agent, URIParamsTokenEntropy, domainNameLength, domainNameDots, uriSlashes, userAgentLength, and referrerPresent – are present in every top 5 individual. Hence, it is likely that they can differentiate normal from malicious data, which implies that they contain important information about the malicious activities. It’s interesting to note that many of our SME-produced features are present in this list, validating their hypotheses that these features were important.
- The fields – host, uri, referrer, status_code, URIparams, subdomain, and tld – are not present in any of the top 5 individuals. It’s interesting to note that all these fields are n-gram or bag-of-words features, lending evidence to the hypothesis that this is not a great way to represent this information for iForests.
- All other fields are present more than once and less than five times in the top 5 individuals. Some of these features may contain some discriminatory information, but it’s hard to tell.
Algorithm’s Performance
We want to measure the algorithm’s performance, so we compare the accuracy and the efficiency before and after applying the found results. The initial settings of the algorithm were 100 iTrees, 256 samples per iTree and using all features described in Section 3. After conducting a number of experiments, we found new parameters – 200 iTrees, 8192 samples in each iTree, and we use only the features from the best individual – resp_p, method, user_agent, response_body_len, URIParamsTokenEntropy, domainNameLength, domainNameDots, uriSlashes, userAgentLength, referrerPresent, numURIparams.
The figures below show the ROC curves for the algorithm’s performance under both the initial and the modified parameters. In the first case, the AUC is 0.00279, whereas in the second case, the AUC is 0.00828. The time required for the program to execute is respectively 836 seconds and 448 seconds, resp. Therefore, we can conclude that the accuracy increases three times, whereas the required time decreases by half.
Some questions for this technique remain. Perhaps we are overfitting to this dataset with this feature selection. More experiments are required with more data to see if these parameters are as effective in other situations. Also, as stated earlier an open problem still remains how to process nonnumeric HTTP features. There’s little extant research on how to do this most effectively in any type of data.
Furthermore, single blog-entry learning has proven to have many issues with false positives that are mitigated by sessionization and entity behavior analysis over time. Could we apply iForest-based techniques to sessions and behaviors over time? Despite these open questions, these initial experiments using the iForest technique are promising. If you’d like to try some of these experiments on your own data, the iForest code is included in our Clearcut distribution. Feel free to download the library and try it out!