
Information Security Is Becoming a Big Data Analytics Problem
VIEW SUMMARY
Information security is becoming a big data analytics problem, where massive amounts of data will be correlated, analyzed and mined for meaningful patterns. Investments in additional tools, processes and skills will be required.

Overview
Key Findings
- Security monitoring generates big data, but big data is only a means to an end. The end goal is improved, risk-based information security decision making based on prioritized, actionable insight derived from the data.
- One of the primary drivers of security analytics will be the need to identify when an advanced targeted attack has bypassed traditional preventative security controls and has penetrated the organization.
- Markets for security analytics platforms and for security patterns/algorithms providers will emerge.
- A new role for a "security analytics analyst" or "security data scientist" will emerge in many Type A enterprises during the next five years.
Recommendations
- Type A enterprises should plan on testing security analytics tools and approaches in 2012 with pilot deployments in 2013. Type B and Type C enterprises should look to their security vendors, including security information and event management (SIEM) providers, for this capability.
- Pressure solution providers to deliver a context-aware, risk-based view of IT, combining threat intelligence, vulnerability knowledge, compliance and business impact.
- Don't plan on a single megarepository to solve all needs, because relevant data will be derived from and stored in a number of repositories optimized for different use cases.
- Integrate with operational data repositories, and leverage operational context, such as configuration, inventory, dependency mapping and business value of the asset, to deliver prioritized, risk-based actionable insight.
What You Need to Know
The amount of data required for information security to effectively detect advanced attacks and, at the same time, support new business initiatives will grow rapidly over the next five years. This growth presents unique challenges when looking for patterns of potential risk across diverse data sources. However, "big data," in and of itself, is not our goal. Delivering risk-prioritized actionable insight is. To support the growing need for security analytics, changes in information security people, technologies, integration methods and processes will be required, including security data warehousing and analytics capabilities, and an emerging role for security data analysts within leading-edge enterprise information security organizations.
Analysis
Why Big Data? Why Now?
Gartner contends that big data creates business value by enabling organizations to uncover previously unseen patterns and to develop sharper insights about their businesses and environments, including information security. The volume, velocity, variety and complexity of information relevant to security analytics is growing rapidly (see Figure 1).

Source: Gartner (March 2012)
Gartner defines "big data analytics" as the practices and technology used to pursue emerging and divergent pattern detection as well as enhance the use of previously disconnected information assets. There are multiple interrelated drivers for the adoption of big data analytics approaches within information security:
- Detecting advanced targeted attacks and advanced persistent threat detection. Big data analytics will be needed to detect successful advanced targeted attacks (ATAs; see "Best Practices for Mitigating Advanced Persistent Threats"). ATAs are designed to bypass traditional prevention and blocking controls, such as anti-malware scanning systems and intrusion prevention systems (IPSs), and once established, will attempt to acquire credentialed access, making them extremely difficult to detect.
- How do we detect ATAs? There are two primary ways that we can determine if something is "bad." If we have a good idea (model) of what something bad looks like, then we look for similarities — for example, anti-malware and intrusion prevention system signatures. However, this approach doesn't work for ATAs (see Note 1). As an alternative to looking for "badness," we must have a really solid idea (model) of what something good looks like (baselining), and then look for meaningful deviations from this (also referred to as anomaly detection). Such models and deviations are prone to false positives with the result that a legitimate request is flagged as being malicious. Successfully detecting a successful ATA with minimal false positives will require the linking and analyzing of large amounts of data to detect meaningful anomalous behavior. Increasingly sophisticated models of both "good" and "bad" are needed. Simply stated, better results from models require more relevant data, including additional context-related data.
- The shift to context-aware security and using context to improve security monitoring. The results of security analytics can be improved by linking additional context information. There are many sources of context-aware security-related information (see "The Future of Information Security Is Context Aware and Adaptive") that can be gathered to supplement monitoring data and improve models. This context includes not only environmental context, such as time of day and location, but also application, identity and content awareness (see Note 2). Increased context information from all layers will increase the amount of richness of the data and improve the ability of big data analytics to discern meaningful patterns (see "Effective Security Monitoring Requires Context").
- The need to deliver risk-based security intelligence. The goal of security analytics projects is improved information security decision making as a result of the analytics, not simply to gather more data. To do this effectively, we must distill down vast amounts of data into security intelligence — prioritized, actionable insight. To prioritize actions, there must be linkages to the business value of the assets and an improved understanding of the risk they represent. Specifically, there must be an ability to link vast amounts of security-related data, distill it down and present it in such a way (likely using visualization) so that the information security professionals can understand and address risk in a way that reflects business value and impact (see Figure 2).
- Delivering risk-based, prioritized insight will require information security organizations to incorporate other types of data into the analysis, such as IT operations data (see Note 3). This will require that IT look across its own organizational silos and link to relevant data to better understand and prioritize the risk in its systems.

Source: Gartner (March 2012)
- Increased monitoring to compensate for the loss of direct control. The shift to heavily virtualized data centers, public cloud computing and the adoption of consumer devices and applications means that, increasingly, IT doesn't own or control most elements (network, hardware, end-user device or applications) of the IT services its users consume. To compensate for this loss of direct control, IT must increase security and operational controls to provide end-to-end visibility and monitoring (see Note 4) across virtualized and cloud-based workloads, including hybrid on-premises and cloud-based computing environments.
- Observed relationships versus explicit security definitions. Security analytics will also be used to augment our visibility and knowledge as to the actual state of production systems versus what systems of record show them to be. Detailed observation of how systems are used in production will provide useful insight as to the relationships of users to data, data to systems and systems to processes, to answer questions that systems of record typically don't answer well — for example, sensitivity of the data, ownership and usage patterns. This notion of entity link and relationship analytics is already used for social network analytics and identity and access intelligence (IAI). The same approaches can be used more broadly on enterprise entities and their relationships to systems and data. To augment static, predefined security policies, relationships determined from observed behaviors will be used to support adaptive, risk-based access control.
Big Data Is Addressing Real Information Security Issues Today
The use of big data and analytics approaches to address information security problems is already occurring within enterprise security providers to take advantage of their visibility and context across large communities:
- Threat intelligence networks. Most large security vendors and others have an extensive back-end set of worldwide databases against which correlation and analytics are performed.1 The results of this analysis are used to identify new threats and to feed this intelligence back into their on-premises security solutions in the form of rules, signatures and reputation scores.
- Interenterprise threat correlation. Cloud-based security service providers are able to gain visibility into threats across a large population of enterprise customers as a benefit of their cloud-based models.2 Through big data analytics across multiple enterprises and regions, these providers are able to identify interenterprise threats, including vertical and geographic-specific threats.
- Community-based reputation scoring and malware detection. Anti-malware vendors and several application control providers are augmenting the ineffectiveness of signature-based protection with executable file reputation systems that analyze and score millions of executable files in cloud-based repositories. In addition, many of these vendors also link IP address and URL reputation as a part of their analytics.3 In a few cases, vendors have shifted entirely to cloud-based big data analytics for malware detection.4
However, big data analytics will also be needed inside of organizations as the need to understand and baseline behaviors of our internal systems becomes critical:
- Network flow and packet analysis for anomaly detection and visibility. As organizations increase monitoring of network traffic in their own data centers, this will greatly increase the amount of security-related data being analyzed. This may range from network flow analysis all the way down to network packet capture.5
- Sensitive data monitoring and flow mapping. One of the problems with traditional content-aware data loss prevention (DLP) solutions is to establish ownership and sensitivity for data that hasn't been explicitly tagged. An emerging approach is to observe how data is being accessed — by what users, what locations and what types of data — and analyze this data for patterns that would infer sensitivity and ownership.6
- User activity monitoring and anomaly detection. As more detailed monitoring is performed across the entire IT stack, end-user activities are also being monitored and analyzed for anomalous behaviors,7 including privileged accounts under the control of IT.
- Identity and access intelligence. IAI enables actionable, context-specific insight, as well as the documentation, review and approval of access controls based on analytical models of identity-related data. IAI improves identity and access management (IAM) and security processes, and it assists business professionals in managing risk associated with access.8
- Fraud detection. Fraud detection and prevention solutions are used to protect customer and enterprise information, assets, accounts, and transactions through the real-time, near-real-time, or batch analysis of activities by users, Web application logs, and other defined entities against accounts and records. Fraud detection examines users' and other defined entities' access and behavior patterns, and compares this to a profile of expected "normal" behavior.9
- Prioritized, real-time risk dashboards. Enterprises need a way to prioritize the risk in their systems. Big data analytics will be used to combine vulnerability knowledge, topology context, business value of the IT asset, and mitigating control awareness (such as firewall and IPS signatures), as well as other forms of context, to provide an overall risk "heat map" visualization of IT systems risk10 (see Figure 3).

Source: Gartner (March 2012)
Implications for Information Security Departments and Infrastructure
The increasing amount and types of data to be analyzed for information security purposes will require changes in information security people, processes and systems:
- Significantly more data storage and processing power will be needed to support the volume, velocity, variety and complexity of context, packet, flow and event data that need to be gathered, linked and analyzed.
- There is an emerging market for security analytics platforms. Enterprise Hadoop-based infrastructure (such as that offered by Zettaset) is one option receiving attention. However, there are many alternatives. For example, RSA acquired NetWitness, in part, to gain a scalable analytics platform.
- There is a separate and emerging market for security pattern and analytics providers that will sell and differentiate on their algorithms and their ability to derive actionable insight from large datasets. Some of these will include their own scalable underlying data store, like Palantir Technologies and Red Lambda, and others that will run against the security data warehousing platforms of others, including Hadoop. In addition, established providers11 will apply generalized big data analytics expertise to security use cases.
- There is a need to receive meaningful results from the analysis of large datasets in a matter of minutes, not hours. Security analytics platforms should provide a way to distribute and optimize the analysis using techniques such as MapReduce or HBase/Cassandra on Hadoop.
- SIEM doesn't go away and is still needed to perform critical real-time correlation of incoming normalized data streams. Further, Gartner has clients that have expanded the role of SIEM to successfully address postcapture analytics use cases against large datasets (see Figure 4).
- However, this does not mean that all SIEM systems will capably act as security data warehouses to support ad hoc query and historical analytics use cases. Consider the organization's SIEM provider as a low-friction alternative to expand into security analytics, but don't assume your SIEM will take on this role by default. Require SIEM vendors to demonstrate the capability to support very large datasets with reasonable query times and licensing models.

Source: Gartner (March 2012)
- Licensing models will be a factor in the evaluation of security analytics platforms. Favor security analytics vendors that don't charge based on the number of entities that data is collected from and that provide reasonable licensing models for potentially hundreds of terabytes of data.
- To simplify the ongoing administration of the security analytics platform, favor scale-out architectures where additional storage can be added without requiring reconfiguration of the analytics platform or event sources.
- When considering a pilot of the use of big data analytics in information security, partner with Web and e-commerce fraud analytics projects that may already exist in your organization, or look internally to business intelligence analysts.
- Skills for information security data analytics will be a challenge. Security data scientists are needed with a mind-set of looking for malicious intent and security-relevant anomalies where patterns haven't been predefined and where data mining will be aided with machine learning techniques. In most cases, organizations will not have the skills internally and should look first for security analytics providers that provide a base library of algorithms and enable the enterprise to tune these for their specific use cases.
Bottom Line
Big data analytics are needed to solve the next generation of information security problems and are already in use at large security providers. However, the need for increased monitoring and ATA detection will require that most enterprises implement their own security analytics capabilities as an evolution of their SIEM or as a stand-alone capability.
The amount of data analyzed by enterprise information security organizations will double every year through 2016.
By 2016, 40% of enterprises will actively analyze at least 10 terabytes of data for information security intelligence, up from less than 3% in 2011.
By 2016, 40% of Type A enterprises will create and staff a security analytics role, up from less than 1% in 2011.
Evidence
1 All of the major security platform providers (network and host) perform cloud-based analytics on massive datasets as a part of their efforts to improve detection capabilities. Examples include the large, nondedicated security vendors, such as Cisco, HP, IBM, Juniper Networks and Microsoft, as well as the lab capabilities of security-focused platform providers, such as Kaspersky, Sophos, Sourcefire, Symantec and Trend Micro.
2 Cloud-based security-as-a-service providers, such as Qualys, WhiteHat Security and Veracode, analyze massive amounts of data as a part of their cloud-based services. By leveraging viability across a large number of customers to establish meaningful patterns, these providers are able to improve their offerings for all subscribers.
3 Application control/whitelisting vendors, such as Bit9, CoreTrace, Lumension Security and McAfee, are developing cloud-based file reputation databases that look across large user populations to establish relative trustworthiness of executable code. In addition, Kaspersky, McAfee, Microsoft, Sophos, Symantec and Trend Micro all track IP, URL and file reputation, and incorporate this visibility to improve their enterprise offerings.
4 Sourcefire's FireAMP technology (acquired with Immunet) and the technology from Prevx (see "Cool Vendor in Security and Privacy, 2006"; acquired by Webroot in 2010) are examples of security providers that determine malicious intent by analyzing vast amounts of observed executable behaviors and metadata.
5 Vendors, such as NetWitness (acquired by RSA), Global DataGuard, Narus (acquired by Boeing), Solera and Fidelus Technologies, and network behavior analysis solutions, such as Lancope, collect large amounts of network packets and/or flows to support the analysis for anomalous activities. In addition, some SIEM vendors, such as Q1 Labs (acquired by IBM) and HP ArcSight, can directly consume and analyze NetFlow data.
6 For on-premises file activity monitoring, Imperva, STEALTHbits, Symantec and Varonis Systems offer solutions. For cloud-based data within Google Apps, CloudLock (see "Cool Vendors in Cloud Security Services, 2011") provides detailed monitoring and analytics.
7 Example user-activity-monitoring vendors include Overtis, Raytheon (which acquired Oakley Systems), SpectorSoft and Verdasys. Securonix is an example of one vendor that uses analytics of user behaviors to identify anomalous behavior that might be indicative of a threat.
8 IAI is used within IAM to refine an identity data model, a logging model for IAM activities and events as part of the data model, a collection and correlation engine for multiple IAM data sources, and an analytics tool for deriving intelligence from the IAM data stored and collected. There are many vendors providing domain-specific analytics of identity-related data. Example vendors include Aveksa, Brainwave, CA Technologies, CrossIdeas, Deep Identity, Oracle, SailPoint, Securonix, Tuebora, UpperVision, Whitebox Security and Veriphyr (see "Magic Quadrant for Identity and Access Governance" and "Identity and Access Intelligence: Making IAM Relevant to the Business").
9 Example fraud detection and analytics vendors include Palantir Technologies, Silver Tail Systems; RSA and Guardian Analytics (see "Magic Quadrant for Web Fraud Detection").
10 Example vendors providing risk-based dashboards include Core Security, FireMon, McAfee, and RedSeal Networks. In addition, HP announced in March 2012, at the U.S. RSA conference, the availability of its EnterpriseView, which it calls a "Security Intelligence and Risk Management" platform that provides a risk dashboard using information gathered across security and IT operations technology.
11 Examples of larger vendors with more generalized business intelligence and analytics capabilities looking at bringing these capabilities to information security use cases are IBM, SAS, and EMC, as well as startups such as Centrifuge Systems.
ATAs, by definition, evade traditional protection mechanisms because we don't know in advance what signature to look for. Further, the attacker may have gained a foothold and may be masquerading as a legitimate user with credentialed access, further complicating detection efforts.
Monitoring and behavioral baselining data (such as network packet data, context data and application monitoring) originates from within the enterprise and potentially represents a significant amount of data (and corresponding large amount of bandwidth to send). This will make it difficult to use cloud-based service providers to process the data, and solutions and service providers that offer an on-premises collector and analysis option will be favored.
Examples include operational intelligence and context, such as inventory information, identity information, configuration, vulnerability information, dependency mappings and business value of the asset.
This includes performance monitoring, SLA monitoring, compliance monitoring, security monitoring and access management.

