Establishing a Detection Engineering Program from the ground-up

Sohan G
14 min readFeb 8, 2023

--

This is a concise writeup summarising my perspective on establishing a detection engineering program effectively and efficiently with available resources.

Although I have been working on creating detection content for quite sometime, I have primarily focused on creating and deploying detections to achieve maximum ATT&CK coverage. During this journey, I realised I had overlooked the processes, frameworks, and implementations of Detection Engineering workflows. To gain a better understanding, I took a step back and adopted a bird’s eye view perspective which helped me visualise the broader picture of how the whole process could have been more structured reducing the time to attain the maturity.

Detection engineering is a new approach to threat detection. More than just writing detection rules, detection engineering is a processapplying systems thinking and engineering to more accurately detect threats. Detection Engineering involves the research, design, development, testing, documentation, deployment and maintenance of detections/analytics and metrics.

Why may Detection Engineering be Need of the Hour?

There are many reasons as to why this can/may be “need of the hour”. I have curated a list below which is not all encompassing, but covers most of the important concerns as to be why a structured process needs to be implemented.

  1. Currently the process is half-baked or unstructured, in-turn leading to many other problems with respect to active response actions performed by response teams(SOC, CSIRT) and maintenance of the detections.
  2. Need for knowing the current security posture adhering to Defense-in-Depth and Detection-in-Depth capabilities.
  3. Need for effective time utilisation leading to quick attainment of team’s maturity avoiding duplication of work and channeling the efforts and time spent in peer review and feedback sessions.
  4. Need for effective collaborations with various internal and external teams, unless which it will lead to clogging of the funnel.(Refer below to Blue Team Funnel)
  5. Need for having accountability and change management with respect to the detection content created by detection engineers and triage/hunt responsibilities from response analysts.
  6. Need for effective dissemination of detection’s context documented to internal and external teams as part of awareness to the response process.

Now that we have had a brief understanding on the importance of having Detection Engineering Program. Let’s now understand the important concepts, frameworks and maturity models which will help attain the team’s maturity with minimal time and effort.

  1. Detection Engineering Workflow
    It is very important to start with design(creation) of a workflow according to the internal architecture that can be followed for the ease of operationalising tasks at hand. The SIEM Use Cases Development Workflow by Alex Teixeira and Detection-as-Code from RSA Conference are both amazing workflows which can be used as a reference for creating custom workflow according to internal environment. The workflows help streamline the provisioning, development, testing, delivery, deployment and maintenance of detection content in the production.
  2. The Detection Engineering Maturity Matrix by Kyle Bailey will help us understand the current maturity level of the team with respect to People, Process and Technology(the three pillars of cybersecurity) and help drive future conversations on where and how to spend resources, time and effort to drive forward the program to “Optimised State”.
  3. The Alerting and Detection Strategies(ADS) Framework by Team Palantir helps with the granular documentation template that can be referred to create a custom template according to internal needs/requirements for documenting about the created detections and response workflows, this can be shared to response teams for review and feedback.
    (NOTE: The frameworks and knowledge-bases are discussed below within the Blue Team Funnel section)
  4. Build out a Detection Lab for detection development and testings, performing the testings on dedicated hosts/VMs hinders the collaborative efforts. This should be as close/similar to the real production environment with all the security tools stack installed, it should additionally include defensive and offensive tools as part of testings which help us to be informed of the security gaps, detection coverage and other security tool coverage. The labs can be used for purple teaming exercises, adversary emulations-simulations and atomic tests.
  5. Funnel of Fidelity, Capability Abstraction, Detection Spectrum and Detection-in-Depth are some of the concepts developed by Team SpecterOps which helps us in efficient and effective rule creation by helping us understand about the various abstraction layers and fine-tuning of the detections to have high-fidelity along with broadest coverage possible.
  6. I designed the Blue Team Funnel(inspired by Funnel of Fidelity) for creating a mapping to encompass various verticals, roles, lifecycles, frameworks, and stages to help provide a visual representation for those seeking blue teaming roles in cybersecurity and to showcase collective and collaborative efforts required from the Defensive Teams to accept, reduce, detect and mitigate the risks to an organisation.

Blue Team Funnel

The stages of the funnel are well explained in the Funnel of Fidelity. Let us focus on the automation levers frameworks and collaborative responsibilities of Detection Engineers mapped to the stages along with the lifecycles.

There are mainly three automation levers placed between the first three stages of the Blue Team Funnel, which if deployed and maintained can help reduce the time across the funnel from left to right. This also solves the alert fatigue(SOC), one of the most discussed concerns which clogs the funnel. The more time spent in the early stages of the funnel helps fasten the response(detection, mitigation and remediation) to the risks from adversaries, insiders and other factors alike.

1st automation lever — This is between the Collection and Detection stages, this involves the automations that can be performed in the log management lifecycle. For example — automating the ingestion of new data sources by following standard correlation/normalisation of fields across various sources and possible log enrichments.

2nd automation lever — This is between Detection and Triage stages, this involves the automations that can be performed within detection-as-code CI/CD pipeline — rule sanitary checks for the fields, metadata tagging, testings, simulations and emulations. This also involves automated triage of alerts as part of response workflow. For example — end-user confirmations, process tree correlations, software installation checks, code signing checks, reputation checks, severity increments, etc.

3rd automation lever — This is between Triage and Investigation stages, this involves the automations that can be performed as part of artifact/supporting-information collection with respect to DFIR workflow. For example — network connections, running processes, host and owner details, websites visited, transfer of files(SaaS apps, external devices), UEBA scores, etc.

NOTE
The automations must include human interventions where-ever required / necessary in the response workflow.

The automations need continual maintenance, failure/bypass testings, dependency testings and health checks.

When the root case of the problem/workflow can be solved via process / procedure / workflow corrections instead of automations, go ahead with it and avoid automating-for-fun.

Try to automate from the initial source and not from an intermediate/alternate source, whenever possible, this helps minimise dependencies(if any) and fastens response closure time.

Always try to create reusable automation workflows, which can be re-used in various multiple scenarios/use-cases with very minimal effort.

It is as important to maintain, build reusable, include right human interventions and check abuse cases in the automation workflows as to tracking time saved by creating a new automation workflow with respect to manual response/triage time. In a way, it is not absolute measure of time reduction, rather the right way to measure is to measure the difference between time reduction created by the new automation vs the time spent/required to maintain, validate/test abuse cases, human intervention time, etc for the same which will depict the absolute time saved.

The frameworks, knowledge-bases and lifecycles respective to the stages of the Blue Team Funnel:

Collection(Preparation)

  1. NIST SP 800–92: The “Guide to Computer Security Log Management” from NIST provides guidance and best practices for managing security event logs in information systems.
  2. CIS Benchmarks are a set of best practice guidelines for securing various systems and software such as operating systems, cloud environments, applications, and network devices.
  3. DeTT&CT(Detect Tactics, Techniques & Combat Threats) is a framework that assists blue teams using MITRE ATT&CK to score and compare data log source quality, visibility coverage and detection coverage.

Lifecycle — Log Management Lifecycle

Detection(Preparation and Detection)

  1. MITRE ATT&CK(Adversary Tactics, Techniques & Common Knowledge) is a globally-accessible knowledge base of adversary tactics and techniques based on real-world observations.
  2. MITRE D3FEND(Detection, Denial, and Disruption Framework Empowering Network Defense) is a knowledge graph of cybersecurity countermeasures mapped to respective tactics and techniques from MITRE ATT&CK.
  3. MITRE Engage is a framework for planning and discussing adversary engagement operations that empowers us to engage the adversaries and achieve the cybersecurity goals.
  4. MITRE Insider Threat Framework will help organisations improve their insider threat capabilities and use evidences observed / monitored / collected accordingly.
  5. Threat Detection Maturity Framework helps measure and convey the maturity of the threat detection function to leadership in a structured format. This is similar to the Detection Engineering Maturity Matrix, both the frameworks can be taken as reference and customised accordingly helping design, document and report a roadmap for the program maturity.
  6. ADS Framework(Alerting and Detection Strategies) is a set of documentation templates, processes, and conventions concerning the design, implementation, and roll-out of detections.
  7. Insider Threat Program Maturity Framework is a framework designed to help organisations assess their maturity in detecting, preventing, and responding to insider threat. It provides a set of best practices and a common language for organisations to use when addressing insider threat.
  8. NIST SP 800–171: The “Protecting Controlled Unclassified Information in Nonfederal Systems and Organisations” from NIST provides security requirements for protecting controlled unclassified information (CUI) in nonfederal information systems and organisations. This can be used as a reference for building Insider Threat Program.
  9. NIST SP 800–172: The “Enhanced Security Requirements for Protecting Controlled Unclassified Information” from NIST provides guidelines for securing controlled unclassified information (CUI) in nonfederal systems and organisations. This is a supplement to NIST SP 800–171 and can be used as a reference for building Insider Threat Program.

Lifecycle — Detection Engineering Lifecycle

Triage(Detection and Analysis)

  1. OTHF(Open Threat Hunting Framework) provides organisations with a framework which provides guidance on implementing core organisational, operational, and technical components to launch and mature threat hunting operation.
  2. TaHiTI(Targeted Hunting integrating Threat Intelligence) Threat Hunting Methodology provides organisation with a standardised and repeatable approach to their hunting investigations.
  3. MaGMa(Management, Growth and Metrics & assessment) Use Case Framework is a framework and tool for use case management and administration that helps organisations to operationalise their security monitoring strategy.

Lifecycle — Triage Lifecycle: This is something which I jotted down for simplicity and proven process that can be followed:

  1. Initial Assessment: The analyst has to assess the incoming alerts and try to understand the triggered alert with respect to documented detection context and abuse cases of the procedure/technique.
  2. Workflow Check: The analyst has to familiarise the documented response workflow for the triggered alert along with the automated-response flow if it’s operationalised and understand the potential risks of improper triage action performed.
  3. Response Actions: The analyst has to follow the response workflow and resolve the alert providing appropriate closure by tagging the alert based on alert categorisation within the mentioned SLA.(Service Level Agreement)
  4. Escalate/Hunt: The analyst may choose to escalate the alert to CSIRT by following SOC-CSIRT escalation workflow or choose to perform threat hunt if the detection that triggered the alert is of type TH(Threat Hunt), meaning it has a broad coverage with high recall.

Investigation and Remediation(Analysis, Containment, Eradication, Remediation and Lessons Learned)

  1. NIST SP 800–61: The “Computer Security Incident Handling Guide” from NIST provides guidelines and best practices for incident response and management.
  2. NIST SP 800–86: The “Guide to Integrating Forensic Techniques into Incident Response” provides guidelines and best practices for incorporating forensic techniques into the incident response process
  3. RE&CT Framework is designed for accumulating, describing and categorising actionable Incident Response techniques.

Lifecycle — Incident Response Lifecycle(mapped to stages vertically and horizontally 😇)

The collaborative work responsibilities of a Detection Engineer with the neighbouring teams:

Log Management Team

  • The detection engineer is responsible for validating the quality of metadata and correlations being performed from various log sources. If something can be improved or if there is inconsistency in the incoming enriched or correlated fields, they need to provide feedback and discuss the issues observed with the Log Management Team.
  • The detection engineer is responsible for finding out the logging insufficiency from new/existing sources which would help create detections related to specific use-cases and the visibility insufficiency from specific sources/toolings which will hinder the development of certain use-cases which are essentially security gaps and therefore reduces the detection-in-depth capabilities.

Response Team

  • The detection engineer has to consider the feedback received from the Response Team for any possible queries/issues reported with the existing detections or for creation of a new detection.
  • The detection engineer has to help with any mis-understandings or contextual issues with respect to the detections and their response process and collaboratively work with the response team regarding documentation, severity scores, etc.

Higher Management

  • The detection engineer is responsible for metrics generation to showcase the impact of the created detections and how these overlap with alerts from other security tooling as a defense-in-depth capabilities or to showcase how the detections cover the security gaps left by other security tools.
  • The detection engineer is responsible for conveying the efficient use of existing security tooling and help visualise important tooling or procedural gaps which requires a decision/call-for-action to make it work.

Roles and responsibilities of a Detection Engineer

  • Create detections for the known adversary behaviours(TTPs) by following the Detection Engineering Workflow and above mentioned frameworks, concepts and knowledge-bases. The detections needs to be fine-tuned to the environment, continuous testing and maintenance of the detections have to be performed.
  • Create metrics to help with the reporting to higher management and also for the SOC to visualise, hunt, correlate and/or prioritise the alerts.
  • Correlate the detections with the alerts from other security tool stack and document overlapping of alerts from various tools/detections as part of the testings performed on the existing/new detections, which helps to understand the detection-in-depth and defense-in-depth capabilities of the defensive team and to work on security gaps.
  • Perform peer reviews of the detections created by other engineers to reduce possibility of biases.
  • Work collaboratively with other teams such as Log Management, Security Operations, CSIRTs, Automation Engineers alike and understand their lifecycle or workflow to provide and receive essential feedback which improves the overall team’s effort in achieving the maturity level with respect to detection/defensive controls.
  • Prepare reports to convey the growth and impact of the detection engineering program to the higher management.
  • Available for rotation or collaborative sync-ups across other stages in the Blue Team Funnel which immensely helps understand the challenges/issues faced by other workflows and/or responsibilities, and how the workflow can be made to function effectively when one creates detections - make those changes/corrections to the workflow to reduce the time spent by the teams in later stages of the funnel.

Types of detections

  1. Host-based detections(Behaviour-based and Signature-based) — The host-based detections helps to detect TTPs related to command line executions, script executions, parent-child process relations, API calls, etc. We can refer to Sigma rules and other open source detections available.(use existing rules and contribute new rules)
  2. File-based detections(Signature-based) — The file-based detections helps to detect malware IOCs via Yara rules.
  3. Network-based detections(Behaviour-based and Signature-based) — The network-based detections helps to detect network-related TTPs and/or IOCs via Snort/Suricata rules.
  4. Cloud traffic detections: This involves DLP and Threat policies to monitor real-time and API-based activities within SaaS and IaaS applications.
  5. Insider Threat detections — These detections helps monitor and detect possible insider threats by creating detections and analytics related to use-cases such as collection, exfiltration, user monitoring, etc. correlating with the UEBA(User and Entity Behavior Analytics) scores.
  6. Miscellaneous — There are many other alerting that needs to be setup as part of audit/compliance, health checks, etc.

Alert Categorisations and Metrics

Alert Categorisations (Ref: MITRE Engenuity)
Alert Categorisations (Ref: MITRE Engenuity)
  • True Positives (TP) — The alerts which detects true occurrences of malicious activities.
  • False Positives (FP) — The alerts which detects benign activities as being malicious.
  • True Negatives (TN) — The true benign activities which are not detected.
  • False Negatives (FN) — The true malicious activities which are not detected.

There are mainly two important metrics to valuate the detections with respect to Signal-to-Noise Ratio(SNR):

  • Precision — The ratio of True Positives to Total Positives (True Positives + False Positives)
  • Recall — The ratio of True Positives to Total Malicious Events (True Positives + False Negatives)

The above two metrics act in inverse direction, which means:

  • If we increase the precision(⬆TPs) of the detection, the recall would be reduced(⬇FNs & ⬇FPs).
  • If we increase the recall(⬆ possible FNs) of the detection, the precision would be reduced(⬇TPs & ⬆FPs).

There are a couple of ways to attain the perfect balance between Precision and Recall Scores:

  • The detections should focus on one specific use-case/abuse-case(procedure) which makes the fine-tunings much easier and eases the maintenance of the detections in the longer run. This helps attain good Precision and Recall scores with reduced Response overheads.(by integrating automations as part of response workflow)
  • The detections/analytics which are of type TH(Threat Hunting) can be created broadly(⬆FNs), this acts as an overlapping detection to the precise detections but also covers a lot more benign activities(⬆FPs and TPs). This helps to have a broader coverage and adds another layer of analytic.

I have created the impact track-sheet mapping testings which helps document and track the progress of the team to attain desired maturity level and helps report the program’s growth and impact to the higher management.

Detection Engineering Impact TrackSheet Template (adopted from Keith McCammon’s post)

How to create best possible detection

  • Learnt it the hard way 💪 — always try to avoid combining or clubbing of multiple rules/detections spanning across either procedures/sub-techniques/techniques, it may initially seem as the best way to maintain and have the alerts triggered from a single detection, well, it actually makes life harder in the longer run for both the Detection Engineer and Response Team as the detection would have become too broader in scope, of-course it would probably cover all the “False Negatives”, but the “False Positives” and “True Positives” makes for a rough response action due to low fidelity nature of the alerts. It is recommended to make the detection specific to a particular procedure(not sub-technique/technique/multiple procedures), in this way, it is easier to fine-tune(high fidelity alerts), respond to, and maintain the detections which in-turn reduces overall query time of the detection.
  • For the detections which are of low fidelity or have low SNR(Signal-to-Noise Ratio), always remember to create the detections with absolute rule exemptions and not more(by which I mean, Never-over-do-fine-tunings), the further fine-tunings if required can be done in the automation-response part to correlate, lookup, match to the required conditions and then have a report sent to the analysts(human intervention) to further proceed with the automation-response or take control of the response process.
  • The detections created should always have meaningful and necessary metadata and commenting within the detections and rule creation process. The added metadata will help provide necessary context for the response analysts to understand the incoming alert and respective response process(Ex: detection name, detection description, MITRE TTP details, schedule duration, frequency, other workflow integrations, etc). Commenting within the detection helps clarify or help understand the nuances of detection workings better to the response team.
  • Run the detections against as many atomic tests, emulations and simulations as possible, this strengths our understanding on how durable the written detections are against a variety of procedures and helps understand and acknowledge the underlying security gaps and how to address them via detection-in-depth/defense-in-depth techniques or make a note of them in the residual risks track-sheet.
  • Lastly, while going over the Detection Engineering Workflow for a new detection, try to imagine how much of a valuable alert would it possibly be rather than a mere noise in the response pipeline(try to put yourselves into SOC/IR hat 😅)

Conclusion

In this blog, we have delved into a multitude of subjects, hopefully the post helped provide value as to how to establish and grow a Detection Engineering Program from the ground-up with available resources efficiently and effectively attaining the maturity level with minimal effort and time, alongside few pointers on how to be effective as a Detection Engineer.
Thanks for reading!

--

--