A good portion of the security research done at Cygilant is done around alerting. For us, an alert occurs when a data point in a log message contains a value we were waiting to see. These data points are usually values such as: IP addresses, authentication statuses, network protocols or error codes, for example. This work is ongoing because there are continually new and better ways to determine if something unique or nefarious is occurring on systems. The log messages we parse come from devices and applications that are deployed within the environment and are commonly referred to as SIEM (security information and event management) data. Most of the hardware and software you are familiar with produce SIEM data which makes it useful determining what is happening on the systems you are monitoring.
One of the difficulties with checking SIEM data for values is there is no standardized format for information that these messages contain. Therefore, we develop parsers that read these messages and normalize the data into a standard model. This allows us to create alert rules which check for correlation and aggregation across multiple devices or applications. Additionally, the standardized data model also assists with noticing specific occurrences of values on particular devices or applications. This is useful when two SIEM messages have the same error code but come from different sources.
Figure: Each of these devices produces SIEM data that can be analyzed and used to trigger alerts.
When an alert is triggered it’s assigned a priority based on node criticality, the alert rules and if there are any threat indicators present. The node criticality is determined when the device or application is set-up to send its SIEM data to our platform. The alert rule priority gets set when it is created using regular expressions to check for values in the standardized data model. Threat intelligence checks for known-malicious values such as file hashes or IP addresses. All these values come together and assist the SOC team in determining which alert should get investigated first.
What is interesting about researching alerts is the various constraints we have to overcome and still create effective strategies that identify system attacks and anomalies. Since alerts are the base from which security monitoring is built, one of the constant challenges when writing alerts is balancing the goals of reducing false positives and preventing inundation while still alerting on all suspicious events. From a security research perspective, it is imperative that we are continuously looking for opportunities to improve alerts to reduce the false positive rate.
One way of accomplishing this is by writing alerts with better conditions. To think about how this is done, let us consider a Windows Brute Force alert. A rudimentary approach to alert on this event would be to raise an alert whenever a log has event code 4625 (in Windows). Since users mistyping their passwords a couple times is a normal and valid event for most environments, we would not want to alert every time that happens. One solution is that we only get an alert when we see multiple failures which is where the idea of thresholding occurs. Thresholding allows us to make adjustment based on the client’s environment. For smaller clients 5 failed logins within a few seconds may be worth looking at, but for a larger client where these mistakes are expected to happen more often, it would be useful to set the threshold higher. There is other information that can be used to help reduce noise when monitoring. In order to create a better alert, we need to define what we consider a Brute Force. A decent definition would be something along the lines of: “A Brute Force is many failed logins followed by one successful login.” Although this is an okay definition and certainly better than alerting on each failed login, there is still more than we can gather from logs. For example, we could check to see if the user’s password was changed recently before a potential brute force. Maybe the user forgot to update their password on their mobile device which is causing the failed login attempts. Further, we can investigate what actions the user took after gaining access, such as attempting to elevate their privileges or install software. These kinds of indicators could be used to automatically create a criticality rating for the alert based on contextual evidence.
Another area we have been focusing on as of late is how to leverage different devices work in order to reduce noise. For example, if we are monitoring traffic on a firewall device, certain packets will be denied and others allowed based on a wide variety of criteria set by the firewall vendor. Denied traffic is not always interesting from an alerting perspective because it’s typically an indication that the technology is doing its job. Suspicious traffic that was allowed by the firewall is a lot more interesting as it could indicate a successful exploit attempt or other nefarious activity. We can use this additional information to distinguish between alerts that require investigation and informational alerts that provide a trail for future investigations if needed.
Other unique challenges we face is clients who do not have the resources to monitor their entire network or prioritize which nodes during deployment. In determining criticality, it is paramount to understand the client’s network architecture which makes this process very fit their business use cases. In order to best security perspective without monitoring everything, we recommend monitoring devices that are responsible for security controls, such as firewalls, IDS/IPS, and domain controllers.
These are some of the techniques that help us overcome hurdles associated with crafting alerts from SIEM data. It’s not always as easy as collect SIEM data and create a notification if some value occurs. Often, we have to consider the environment context and what truly nefarious activity might look like. What excites us most about our future development is implementing guided machine learning to establish clients' alert thresholds and detect anomalies. By using machine learning to understand what is normal activity on systems we will be able to better detect indicators of an attack. We will cover this in more detail with future blog posts.
Tags: Incident Management