An OSINT tool to crawl a list of publicly available emails from websites

In this article I will introduce a simple tool that you can use to gather emails from one or more websites. This way, you are able to check which emails are published on your website.Indeed, this is important because these emails can be crawled to be used in spam campaigns. By the way, I show you that obfuscating your email by replacing the @ with other characters is not enough to avoid crawling.


email scraper
What is email crawling and why should we care?

Email crawling is the practice of downloading webpages to extract valid e-mail addresses. The collected emails can be organized in lists, Excel sheets or stored into databases. They are grouped to form mailing lists that are usually sold in the dark web. Who buys these lists of emails, uses them to carry out spam campaigns. In certain cases the goal of these campaigns is the advertisement, but unfortunately in many cases these mailing lists are used for phishing campaigns. An attacker writes a mail with a malicious link, trying to trick you into clicking the link. If you do it, you download a malware on your machine. If you also execute the downloaded file, your machine is infected. The effects vary based on the kind of malware, but in any case you just opened the door to your attacker. In some other cases, the malware is embedded into the attachment delivered with the email. An example of this may be a pdf document which embeds malicious code inside, or a Microsoft Word document that carries a malicious VBA macro.
So, a malicious email can be the first step that an attacker carries out to compromise the security posture of your organization: this is the reason why knowing which are the emails that we publish on the Internet is a good thing, because we could pay special attention, maybe imposing proper policies, on such emails, that are part of the attack surface of our organization.

Why obfuscating the email is not sufficient?

The first times that I saw strange emails like name.surname(AT)university.com or name.surname[AT]company.com I believed it was a matter of encoding, that maybe some old browser were not able to render the @ character. That is, I thought it was dealing with Usability. Instead, it was a matter of security: people wanted to avoid that their emails were crawled, and so they were replacing the @ character with some combination of parentheses and of 'AT'. So, they sacrified Usability for Security. And if you had to write an email to such people, you could not just copy and paste, but you had to write the email by hand (at least a part), paying attention not to introduce any typos.

Now, we are not in the days of static pages and of Web 1.0 anymore, pages or part of pages can be generated dinamically and surely crawlers are sufficiently smart to understand that you are replacing your @ with AT.
Anyway, you would be surprised to see how many people still carries out this smart substitution even today. What I want to show you in few words is that this is absolutely meaningless, and you should think to other ways to avoid being crawled, if this is important for you.

The RFC 822 specifies the valid format of emails, and in addition there are several discussion threads on Stackoverflow for parsing or validating emails.
I don't want to validate emails, so I could go for the easy way: for me a valid email is comprised of any characters (lower and upper case letters, numbers and some symbols ), possibly with dots. This forms a group of the regular expression. This group may be repeated more times, until we find the special symbol @. After that, we have the final part of the email, the suffix, where we can still find characters belonging to the group that we identified earlier, possibly with dots, possibly both repeated, but for sure we need to have at least a dot in this right part of the email, and after the (last) dot a short sequence of characters, like com, eu and so on. Formalizing this idea with a regular expression, we catch the vast majority of email addresses, although we can miss something. The only important thing that we want to deal with is to give an alternative to the @ symbol: we say that @ is equal to (AT) and to [AT] and in general to anything where we have the sequence of characters 'AT', both uppercase and lower case, surrounded by non-alphabetic symbols (like (), [], and so on). At this point we are able to catch the disguised emails of the smart guys, but we may introduce some false positives. This false positives are mitigated by the shape of the regular expression itself, and by a further restriction that we will impose on the legal email suffixes (for instance, .com is a valid email suffix whereas .pdf is not).

All that said, we are able to catch a lot of email addresses, including the 'hidden' ones, with the followwing Python regular expression:

[\w\.]+(?:@|\W{1}AT\W{1})[\w\.-]+\.[a-zA-Z]{2,3}

You can run the tool against a webpage with such a 'hidden' email address to see that it is able to catch them!

Prevention and Remediation

If we want to publish our email on our website and to avoid that is is being crawled, it is definitely a bad idea to perform substitutions of any kind. I was able to unveil them with a line of code and in very few time (although the solution is far from being optimal), so imagine what a willing attacker could do, with more time and resources!

So if you want to publish your email but don't want it to be crawled, you could:

  • Use images.

  • Write your image with a graphics program, then export it in a png or jpeg image, and upload the image to the website. This makes harder the detection of the email by a random crawler, although it is technologically easy (but computationally expensive) to run object recognition tools that can extract the text from your image.

  • Dinamically generate the email address (FRONTEND)

  • You could avoid writing your email in a static page, but write a javascript function, maybe triggered by the user click, that generates the email address. This is more robust than just using static pages, but still does not guarantee that it is not being crawled.

  • Dinamically generate the email address (BACKEND)

  • This is the strongest solution. Implement a solution that oblige to have a user (not a bot!) clicking on a button and solving a CAPTCHA. Then, only if the CAPTCHA was solved by a human-being, generate the email address server-side and send it back to the user's browser.



Demo of the tool

After all these words is time to run the tool! As you may see, we have set a limit on the pages to be crawled to 10 and feeded a page of The Times as starting point of the crawling process. After downloading and parsing the web pages, the tool extracts and print the list of gathered pages.

More info about the tool can be found on the GitHub page.

host discovery animated