Core Functions of a Threat Intelligence Platform - Part 1: Intelligence AggregationPOSTED BY WAYNE CHIANG
To support threat intelligence management inside a Security Operations Center (SOC), a threat intelligence platform (TIP) has core functions that it must deliver. In this series of blog posts we will take a deeper look at these functions and share some of the logic and thought that went behind designing them into the ThreatQ Threat Intelligence Platform.
The first of these capabilities we will look at is the aggregation of threat intelligence from multiple sources. While on the surface this sounds like a simple task, true intelligence aggregation means more than just importing lots of indicators into a single indicator store – it must process all of the intelligence and link it together to build a centralized dataset.
Wayne Chiang, ThreatQuotient co-founder and Chief Architect talks us through it.
Making sense of source types
The first challenge of threat intelligence management is finding a way to approach the wide variety of intelligence source types that are available. While retrieving content from an URL is relatively straightforward, the level of complexity increases one you start dealing with the different methods in which data is provided, such as REST APIs, compressed files, email inboxes, and site scraping. With the diversity in how intel is delivered, we devised a system of “connectors” to handle the various different data flows.
The simplest of these connectors is a basic HTTP GET script to pull from a remote web source. These type of scripts can be quickly expanded to handle different API capabilities and query for specific types of information (date ranges, IOC types, searching). Date ranges are especially important as users of a TIP frequently want to take a historical view of intelligence available from a data source and not just import data from ‘today’. Depending on how the information is presented by the intelligence feed, a connector will also need to perform a number of post-processing steps including file unarchiving or even IOC extraction/scraping from an unstructured page.
Another category of connectors is one that connects to email inboxes. This type of connector can perform a number of functions from creating intelligence from spear phish samples that have been forwarded to a mailbox (using header analysis/indexing) to extracting IOCs from an email body or attachment. It all depends on the nature of emails being residing in the inbox. Correct processing of files found inside a mailbox depends on the intention behind the file being there.
A key design of these connectors is their ability to be templatized so that they can be easily reused to handle similar source types. There are potentially thousands of intel sources available and these connectors will need to be designed so that an analyst can quickly deploy new connectors for data sources that they come across.
Having all your intelligence linked together and related automatically allows an analyst to have immediate access to multiple opinions and viewpoints of any indicator from a single screen
Scheduling when data is pulled
Now that we have a toolbox of connectors, it becomes quite apparent that we will need to build a scheduling system. We don’t want these connectors to run all at once and we also need to be able to control how often they are initiated. Many times, intel sources will have restrictions limiting how often you can pull their data to prevent requests from exhausting their resources (API polling limits). Additionally, understanding how often/when the data is updated is important for resource management so that the connectors are not constantly pulling already processed intel.
Once you have a system of scheduled connectors running, the next piece of the framework is a processing system to handle the inevitable flood of threat intel streaming in. What kind of data is being created and how do we handle it? The most basic of concepts here include:
- Normalization – How do we structure IOCs into a predictable/expected structure so that the detection infrastructure can expect well formed data? For example, should we keep port numbers at the end of IP address, or do we include the query strings at the end of URLs? How do the different sources provide the same type of data?
- Validation – Is the IOC well formed? Many times, threat intel is manually processed which introduces an element of error. I’m sure we’ve all seen a MD5 hash with extra characters/spaces, or even an IP address with an invalid octet.
- Whitelist – Is it really a good idea to send 18.104.22.168 to the detection/blocking grid? Or an even scarier thought, what happens if you block a /1 CIDR block?
Intelligence linking and association – Is this new indicator or intelligence from one source related to intelligence from another or some pre-existing data? For example has this new IoC been seen in a PDF threat report you imported into the TIP a few weeks back? This one of the core value propositions of a TIP but is frequently misunderstood. Having all your intelligence linked together and related automatically allows an analyst to have immediate access to multiple opinions and viewpoints of any indicator from a single screen.
Data Format and Schema
The last and perhaps the most important piece of the puzzle is the format of the intel. Are we processing a flat file of unstructured data or are we dealing with a complex schema of XML/JSON? Parsing well formed structures can be relatively easy due to existing library support, but that’s not always the case.
The harder part is understanding what to extract from the data. Careful thought and design should be applied to parsing only relevant/valuable information so that you don’t end up indexing a bunch of useless information that can’t be leveraged. A TIP’s internal schema architecture should focus on a strategy to link key pieces of data together such that security and intelligence analysts can quickly jump to what they’re looking for and pivot to related intel. Even once you’ve built a fancy schema, consideration should be applied into how to present the data to the end-user whether it be an actual person or a security tool. There needs to be a fine balance of complexity and simplicity to make the system cohesive and usable.
Icon Credit: Gregor Črešnar, Kanda Euatham, and Nicole Portantiere from The Noun Project.