Audit your correlation searches against your own Best Practices automatically
Updated: Feb 28
I did a talk at Splunk .conf21 about how to maintain correlation searches: pdf/mp4. One of the topics is Correlation Searches Best Practices.This really makes it easier to maintain correlations searches and ensure a consistent level of quality. Here are the 3 steps:
Define your own best practices. What do you want your correlations to look like?
Find a way to automate the assessment of said best practices. It’s usually possible with rest commands, logic and regular expressions.
Make a dashboard to audit your correlations against your own Best Practices. Now you know where you stand.
Having an agreed set of criteria your correlations should all pass will help drive up quality.
I go into more details in my talk. The dashboard is available for free as part of the ES Choreographer app in splunkbase.
Before we go over a list of ideas for best practices, let’s mention the Best Practices Evolution dashboard. This keeps track of what the best practices compliance was yesterday evening and shows you if any rules has seen any improvements or degradation since. It is particularly useful if you are tweaking the macro that evaluates the best practices (effectively moving the goal post) and you want to check the impact.
The following are only suggestions, you do you. Feel free to comment with your own suggestions!
Remember the steps: first step is to define the best practices. You could only do that if you want. The next steps are about automating the assessment of compliance. Make sure you watch the demo in my talk and check out the dashboard for yourself.
All enabled correlation search should be related to one or more of the Mitre Att&ck techniques in an annotation.
Don’t use real time searches (as in the earliest and latest start with "rt") as they are bad for performance. Unfortunately that is the default for a number of correlation in ES out of the box.
Without using “real” real time searches, you have the choice between “real time” schedule and “continuous” schedule. This is only relevant when the platform is under heavy stress and searches are being skipped. In security, I would say we are afraid of missing stuff and therefore should choose the “continuous” schedule.
The search duration (difference between latest and earliest) should be at least as big as the time between runs. E.g. it's ok to look back one hour every 5 minutes, but it's not ok to look back 5 minutes every hour. It goes without saying but I’ve seen more embarrassing bugs in my time.
You don't want to overwhelm the platform so decide for yourself how much delay you can tolerate: don’t run a search every 2 minutes the alert can wait 5 minutes. Or 10 minutes. Or half an hour.
If your search is taking 2 seconds to run, maybe it’s ok to run it more often. If it takes minutes and a lot of resources, maybe you should run it every half an hour.
For this I consider that there are two types of searches.
For searches that rely on a threshold (i.e. more than X times during Y minutes), you have no choice but make sure latest_time is never later than -10m@m for raw searches and -15m@m for data model searches (possibly further back for Network_Traffic or Endpoint). This is to allow time for logs to make it into splunk and data models to be accelerated, otherwise it would skew the perception of what happened in that period of time. This is of course not so important if the period of time is big (bigger than an hour).
For searches that simply trigger on the occurrence of some event, there is a choice. If earliest_time is far enough in the past that there is enough overlap with the next run of the correlation search to allow for indexing and data model accelerating delays, then you should set latest to "now" to avoid any unnecessary delays: if the info is there, the alert will be raised as early as possible. If it’s not there yet, oh well the next run will catch it. However, if the rule is a bit heavy (using join for instance) and you don't want to run it too often, then it might be best to run it without overlap and with some delay. E.g.: -70m to -10m every hour. No overlap between searches, but enough delay. The delay needed depends on the relevant data model’s typical acceleration lag.
All correlation searches where the overlap between runs is not null should be throttled. You should throttle based on sensible fields, but the best practice has no way of automatically judging what fields should be used. The throttle duration should be at least as big as the search time window. If a search looks back 2 hours every hour but is throttled for only one hour, it will always be raised twice.
This best practice figures out whether the search is an accelerated data model search (tstats summariesonly=t), a plain tstats search not using any data model, a search based on an inputlookup, a raw search over ironport data (allowed because of lack of alternatives!), a raw search over splunk internal logs (index=_internal OR index=main sourcetype=*splunkadmin*).
All of these are allowed and make the best practice green unless the search uses some heavy commands (such as join). Anything else is considered a raw search and should be avoided for obvious performance reasons.
Correlation searches should be formatted nicely and consistently to help maintenance. To achieve this: press ctrl-\ and make sure to add a line break before the where and by clause of a long tstats (short one-liners should remain on one line though). Also add a line break before the from if there are a lot of stuff before it. E.g. | tstats summariesonly=t count from datamodel=... is fine but | tstats summariesonly=t count values(Whatever.something) as blah latest(_time) as lastTime average(Whatever.foo) as bar from datamodel=... needs a line break before the from.
To validate this, the best practice looks for some of the most common SPL keywords and check they have a linebreak before their pipe. And it also checks the various bits of a tstats command.
If there is a "user" field in the search, this best practice checks if the field "is_leaver" is set and affects the notable title. Of course, this is only relevant if the search raises a notable.
One disadvantage of using tstats is that identities and assets are no longer automagically looked up by ES. So you need to explicitly use the get_asset and get_identity4events macros. I would recommend creating wrappers around these to improve them, for instance using a DHCP timed lookup to transform DHCP IPs into hostnames. It’ll all depends on your setup.
Looking entities up is useless if you end your correlation with | table user signature dest. A good way to ensure all the potentially useful stuff is there is to use * at the end of the same table command.
These lookups generate a lot of fields and in order to reduce the amount of visual pollution and horizontal scrolling, you should use the remove_empty_or_null_fields macro.
To provide a maximum of information and context to the analysts, whether in a notable or a risk, it is important that the search returns as much information as possible. However, this cannot be checked automatically as the best practice doesn't know what is relevant to the analyst. For instance if your correlation ends with | table _time dest signature it will think it's all good, but you might be missing useful fields like "user" etc.
For notables, another thing that is important is that the fields returned by the search are displayable in IR. For instance you might have: | table destination_host logged_in_user what_happened at the end of the correlation search. These presumably useful fields won't be displayed in the notable event in IR because they are not listed as displayable in the IR configuration. You would need to either update this configuration (check out the Incident Review Fields dashboard) or, much better if you can, rename your fields to use CIM compliant field names, such as | table dest_host user signature.
The best practice tries to get a list of relevant fields your search might be returning by looking for fields mentioned in the last table command if any and the last stats command if any. If you have both, it'll use their positions in the search to figure out which one is the closest to the end of the search and retain that as the list of key fields. Then it checks these are displayable in IR. If there are any bad fields and the detection raises a notable, the best practice isn't met.
"Bad" fields might be fields you want in the correlation search but not in the notable event (workflow action fields would be a good example). The workaround is to end your correlation search with something like this: | table key-fields-in-an-order-that-makes-sense *. The * ensures all the other fields are there without confusing the best practice, and the list of fields specified indicate the important fields and ensures the best practice will check for you that they are displayable. It also means that if an analyst runs the correlation in a search bar, the key fields are listed first in the results.
The notion of key fields for a correlation can be re-used in other things, for instance for Risk Base Alerting (RBA) analysis. I want to do a post about this one day, if not another conf talk.
Every correlation should have an analyst guide on a wiki page that explains the basics of how to handle the alert. The workflow action best practice (see below) will ensure there is a workflow action for it, but it’s impossible to assess the quality (or indeed the existence) of the wiki page. This therefore needs to be set manually.
Every correlation should have at least one dashboard that helps investigating it. As for the analyst guide, it’s impossible to automate the assessment of the quality or presence of such dashboard, so that’s another manual best practice.
Workflow actions are typically defined to be offered only when a certain field is present. Also, to distinguish them from boilerplate stuff, workflow actions specific to a use case should have a label starting with a "*". Check out the Workflow Actions dashboard. The correlation search that wants to use a workflow simply should create that unique field in its results. Workflows should be used for several things:
there should be at least one analyst guide associated with the use case (see the guide best practice) and it should be linked with one of the workflow_analyst_guide* workflow.
if the search relies on one or more lookup (say for exceptions), and it raises a notable event, there should be a workflow_edit_lookup* workflow offering to edit each of them.
finally, there should be at least one workflow leading to an investigation dashboard (see the dashboard best practice).
The best practice cannot automatically check what the non-guide and non-lookup workflows are for, or if they do work and provide value, hence the need for a couple of manual best practices: guide and dashboard.
Note: workflows are not just for rules raising notables and are relevant to risk-only rules, via the risk overview dashboard magic. I want to do a post about this one day, if not another conf talk.
Ideally: every search should have a harmless way to get triggered, if possible this should be part of the morning checks script performed daily, and success of the check (as in a notable or risk being actually raised) should be monitored on the morning_checks_checker dashboard. Of this the best practice can only automatically check the latter.
In addition the correlation search should handle the morning check by setting is_morning_check to "yes" and tweaking actions accordingly: for a notable, set severity to "informational" and for a risk tweak the score to 0.
Similar to morning checks, but for automated red team exercises. Regular automated red team activity should be identified and is_redteam set to "yes". Just like for morning check, the search should be tweaking actions accordingly: for a notable, set severity to "informational" and for a risk tweak the score to 0 or 1 (tweaks_risk_for_redteam).
You could then have a notable event suppression rule based on is_redteam being "yes" and severity being "informational".
Note: The best practice cannot automatically know whether or not the detection is testable with automatic red team exercices, or whether or not this is monitored appropriately in a dashboard somewhere (which it really should be).
Notables raised by correlation searches should either use defaults for the default_owner and default_status or assign a default_owner and set default_status to "Under Test". Anything else is bad.
There shouldn’t be any suppression in IR that are specific to the correlation. This is because of the lack of visibility for users and developers of the logic or even the existence of the suppression rule.
Instead, the correlation search should set set is_suppressed="yes" in its SPL and there should be a generic suppression for that. That way the logic of the suppression is plainly visible in the rule.
All searches that raise a notable should have a drilldown. Drilldowns are a bit of a maintenance headache.
I used to think a good approach was to have tstats in the rule and a raw search in the drilldown, as that can potentially give more info. However, we've been improving the content of the data models and the fields shown inside the notable events so much that I don't think this is very important any more. Plus we have the workflow actions that are more powerful anyway. And if we had this approach, any tweak to the correlation search would require more careful thinking to figure out how to port it to the drilldown to ensure they don't clash together and confuse the analyst.
So I now believe the drilldown should be an identical copy of the correlation rule. This allows the user to re-run the rule and potentially tweak the time range or even the search filters to try to get further context. It also provides the user a way to bypass any throttling that might have taken place in IR.
Much less important is the name of the drilldown. The best practice will only be 100% happy if the name is exactly `Run the correlation search again`. Again this reinforces the fact that the drilldown is the same as the correlation search and allows consistency through the various notables in IR.
Note: recent changes in ES might make this section slightly out-of-date.
Most correlation should raise a risk score. This helps paint a picture of what happens on a host or for a user over time, where seemingly unrelated events can be correlated by the risk framework.
If the rule has provisions to detect morning checks and/or automated red team activity, these provisions should be extended to tweak the risk score (to zero or one, presumably).
When a risk is raised, the key metadata is:
the risk type: user or system
the risk object: what field holds the thing to raise a risk about, e.g. "dest" or "src_user"
the risk score
the correlation search's name
the correlation search's description
and last but not least the risk's description itself.
The risk object must be as consistent as possible: use the identities macro to help.
Consider having the risk_score specified explicitly in the SPL and then using a custom macro to tweak it based on your own criteria (e.g. user_priority, whether they are leavers, etc).
Descriptions are very important to make the risk framework helpful and readable.
The description at the very top of the correlation search definition is used as the "savedsearch_description" field of the risk score added by that rule (if it does add a risk score). Keep this in mind when writing it. For instance "Looks for excessive quantities of email attachments and applies risks." doesn't read well in this context. Prefer something like "Excessive quantities of email attachments".
Beyond that, it is much better to forget about the savedsearch_description and include specific details in a dynamic field called "description" in the correlation. For instance `| eval description="AV '" . signature . "' file: '" . file_name . "' process: '" . process . "'"` is going to be much more useful than the standard "malware detected" description that comes from the correlation search.
If the correlation search is also raising a notable event, then it should set the "risk_object" field explicitly. This will allow the notable to benefit from the risk overview workflow action.