Threat Hunting Security Automation with Python, Selenium and Cisco Talos Blog

In this article we will see how we can significantly streamline a threat hunting process by implementing a custom and yet easy security automation via script. To do that, we will use python, a powerful scripting language that runs on all platforms, Selenium, a python module to make browser automation, and Cisco Talos Blog series, to extract fresh information about latest Indicators of Compromise (IOC).


email scraper
Introduction

In this article we will see how we can significantly streamline a threat hunting process by implementing a custom and yet easy security automation via script. To do that, we will use python, a powerful scripting language that runs on all platforms, Selenium, a python module to make browser automation, and Cisco Talos Blog series, in order to extract valuable threat hunting information from open-source cybersecurity feeds.

The peculiarity of this approach is that we are using selenium to perform a kind of hack that will allow us to take information that was created for humans (the talos blog) in a programmatic way, so that the script that we will create will allow you to automatize the process of gathering threat hunting information. The security automation is the key part of this article, because we show a way to setup a script that can extract IOC information from a blog and then ingest it, for instance, onto a cyber security platform (an XDR system, a SIEM, and so on..) in an automatic way, without needing user interaction. So that you could, for instance, run this script, extract the IOCs, and then insert them on your security platform via REST API, so that your system will be updated with the latest security indicators from TALOS blog without needing that you manually copy these indicators.

Obviously, you could also schedule the script to run on a weekly basis, and that would completely remove any burden from your side, but for now we will stick with the easy steps, that is how to programmatically extract valuable IOC information from a Cisco Talos blog that is designed to be consumed from a human-being (indeed all content is loaded via Javascript, so if you try to curl it you will get no content, except plain, useless html code).

You can use pip to install selenium: open a shell and type

pip install selenium


We will need the help of a browser to carry out browser automation. For us, this basically boils down to load a dynamic webpage by executing the javascript code in a browser and only after the page has been successfully loaded, we will be able to scrape content from the webpage. We will use Chrome as web browser. To link the chrome execution to our python script we will need WebDriver, an open source tool for automated testing of webapps. You can download it from here. Then, the webdriver should be in PATH in order to be executable. The easiest way is to copy it in the same folder of your script, or you can specify the specific location by adding to your script the following line:

driver = webdriver.Chrome(executable_path='C:/path/to/chromedriver.exe')

Now let's start writing a script. Let's assume that we want to extract programmatically the IOC hashes from the Talos blog.
We will need to import the webdriver module and WebDriverWait class, to open a browser and give it some time to load the webpage, before we start scraping data.

from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait

Now we we will use the webdriver.Chrome class to drive the browser and the WebDriverWait class to wait.
Since we want to scrape data only after the page has been dinally loaded, and we have noticed that once loaded, the html has a div with id = page_wrapper, we insert the following line of code, to tell selenium to return only after it has waited for the webpage to load an element with that id:

myElem = WebDriverWait(browser, timeout).until(lambda x: x.find_element_by_id("page_wrapper"))

At this point the dinamically generated html code has been loaded and we are able to scrape the web content.
For this goal we could use ad-hoc modules like Scrapy, but since in this case the task is quite simple we do that by hand, and the inner function parse_hashes does exactly this, and at the end returns the list of IOC hashes present in the blog:

def parse_hashes():
  all_hashes = []
  text_divided_by_hashes = blog_text.split("Hashes")
  for i in range(1, len(text_divided_by_hashes)):
    hashes = text_divided_by_hashes[i].split('</code>')[0].split('<code>')[1].split('\n')
    hashes = [_hash.strip() for _hash in hashes]
    [all_hashes.append(_hash) for _hash in hashes if _hash != '']
  return all_hashes

So at this point we have retrieved programmatically the list of hashes: we can simpy feed them to our security service via API and in this way we have created an automation that updates the security feeds without needing user intervention.


extract_ioc.py

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

def parse_talos_webpage(url, parse_hashes=False):
    def parse_hashes():
        all_hashes = []
        text_divided_by_hashes = blog_text.split("Hashes")
        for i in range(1, len(text_divided_by_hashes)):
            hashes = text_divided_by_hashes[i].split("\</code\>")[0].split('<code>')[1].split('\n')
            hashes = [_hash.strip() for _hash in hashes]
            [all_hashes.append(_hash) for _hash in hashes if _hash != '']
        return all_hashes   

    options = webdriver.ChromeOptions()
    options.add_experimental_option('excludeSwitches', ['enable-logging'])
    browser = webdriver.Chrome(options=options)
    browser.get(url)
    timeout = 20 # seconds
    try:        
        myElem = WebDriverWait(browser, timeout).until(lambda x: x.find_element_by_id("page_wrapper"))    
        #myElem = WebDriverWait(browser, timeout)  
        blog_text = browser.find_element(By.XPATH, '//html').get_attribute('innerHTML')   
        hashes = []
        if (parse_hashes):
            hashes = parse_hashes()
        return {'hashes' : hashes}
    except TimeoutException:
        print ("Loading took too much time!")
	    
url = input("Insert the url of the talos webpage whose IOCs you want to parse \n\n")
parsed_data = parse_talos_webpage(url, parse_hashes=True)
print(parsed_data['hashes'])

Indeed, we can also run this script periodically once per week, and although the url changes from week to week, there are some regularities in the Talos blog that can be exploited to understand which new article is suitable for this process.

That's enough for now. I encourage you to try this script and if you want, it is possible to extend it to extract other IOC indicators like domains or ip addresses, exploiting the structure that we already developed.

In a future technical article, we will see how this information may be automatically fed to a security system like an XDR, SIEM and so on to improve and update its feed in an automatic way.

Demo of the tool

As usual, below you can find a demo of the tool. I chose one of the 'Threat Roundup' articles from Talos blog, which summarize the threats observed in the last week: you can feed the tool with any of this kind of articles.

More info about the tool can be found on the GitHub page.

host discovery animated