What is email crawling and why should we care?
Email crawling is the practice of downloading webpages to extract valid e-mail addresses. The collected emails can be organized in lists, Excel sheets or stored into databases.
They are grouped to form mailing lists that are usually sold in the dark web. Who buys these lists of emails, uses them to carry out spam campaigns. In certain cases the goal of these
campaigns is the advertisement, but unfortunately in many cases these mailing lists are used for phishing campaigns. An attacker writes a mail with a malicious link, trying to trick you
into clicking the link. If you do it, you download a malware on your machine. If you also execute the downloaded file, your machine is infected. The effects vary based on the kind of malware,
but in any case you just opened the door to your attacker. In some other cases, the malware is embedded into the attachment delivered with the email. An example of this may be a pdf document
which embeds malicious code inside, or a Microsoft Word document that carries a malicious VBA macro.
So, a malicious email can be the first step that an attacker carries out to compromise the security posture of your organization: this is the reason why knowing which are the emails that we
publish on the Internet is a good thing, because we could pay special attention, maybe imposing proper policies, on such emails, that are part of the attack surface of our organization.
Why obfuscating the email is not sufficient?
The first times that I saw strange emails like name.surname(AT)university.com or name.surname[AT]company.com I believed it was a matter of encoding, that maybe some old browser were not
able to render the @ character. That is, I thought it was dealing with Usability. Instead, it was a matter of security: people wanted to avoid that their emails were crawled, and so they
were replacing the @ character with some combination of parentheses and of 'AT'. So, they sacrified Usability for Security. And if you had to write an email to such people, you could not
just copy and paste, but you had to write the email by hand (at least a part), paying attention not to introduce any typos.
Now, we are not in the days of static pages and of Web 1.0 anymore, pages or part of pages can be generated dinamically and surely crawlers are sufficiently smart to understand that you are replacing
your @ with AT.
Anyway, you would be surprised to see how many people still carries out this smart substitution even today. What I want to show you in few words is that this is absolutely meaningless, and you should
think to other ways to avoid being crawled, if this is important for you.
The RFC 822 specifies the valid format of emails, and in addition there are several discussion threads on Stackoverflow for parsing or validating emails.
I don't want to validate emails, so I could go for the easy way: for me a valid email is comprised of any characters (lower and upper case letters, numbers and some symbols ), possibly with dots. This
forms a group of the regular expression. This group may be repeated more times, until we find the special symbol @. After that, we have the final part of the email, the suffix, where we can still
find characters belonging to the group that we identified earlier, possibly with dots, possibly both repeated, but for sure we need to have at least a dot in this right part of the email, and after the
(last) dot a short sequence of characters, like com, eu and so on. Formalizing this idea with a regular expression, we catch the vast majority of email addresses, although we can miss something.
The only important thing that we want to deal with is to give an alternative to the @ symbol: we say that @ is equal to (AT) and to [AT] and in general to anything where we have the
sequence of characters 'AT', both uppercase and lower case, surrounded by non-alphabetic symbols (like (), [], and so on). At this point we are able to catch the disguised emails of the
smart guys, but we may introduce some false positives. This false positives are mitigated by the shape of the regular expression itself, and by a further restriction that we will impose on the legal email
suffixes (for instance, .com is a valid email suffix whereas .pdf is not).
All that said, we are able to catch a lot of email addresses, including the 'hidden' ones, with the followwing Python regular expression:
[\w\.]+(?:@|\W{1}AT\W{1})[\w\.-]+\.[a-zA-Z]{2,3}
You can run the tool against a webpage with such a 'hidden' email address to see that it is able to catch them!
Prevention and Remediation
If we want to publish our email on our website and to avoid that is is being crawled, it is definitely a bad idea to perform substitutions of any kind. I was able to unveil them with a line of code
and in very few time (although the solution is far from being optimal), so imagine what a willing attacker could do, with more time and resources!
So if you want to publish your email but don't want it to be crawled, you could: