02 Apr 2023 15 min read data_science

Detecting bot traffic

How can we differentiate bot from human traffic? What are some typical features?

📕

This article forms part of the notes from Week 2 of the Data Science for Security and Fraud online course. Access the full course outline here.

How to identify (bad) bot traffic?

It is always possible to detect bad bot traffic, as long as there is sufficient volume.

Aggregating traffic via dimensions

Grouping traffic together based on certain attributes (or dimensions) allows us to abstract the traffic so that we can measure and compute certain properties for each group, and compare across groups.

We will first discuss some frequently used dimensions and then move on to common traits for bots and for legitimate human traffic.

These are some common dimensions that we can derive purely from web traffic logs:

IP address

Use this with caution. While devices generally maintain the same IP address over the course of a short session or series of transactions, this is not necessarily the case over a longer time period. And each IP address may map to one device or thousands of devices -- for example, a corporate network may have thousands of (very similar!) devices behind a single IP address. Even a single mobile IP address, such as one belonging to AT&T's mobile network, may simultaneously represent multiple users. Also, watch out for IPv6 addresses: with the huge address space, attackers can use a huge amount of IPv6 addresses in short order. As an example, a persistent attacker on a popular social network sent in hundreds of millions of requests over several days -- using a unique IPv6 address for every 1-2 requests!

It is tempting to block obviously bad traffic from specific IP addresses, but this is a cat-and-mouse game. IP addresses are cheap, and it is trivial for attackers to use new IP addresses. There is also a very real risk of accidentally blocking legitimate users sharing the same IP address as an attacker. This can be sidestepped somewhat by letting IP address blocks expire, but figuring out the optimal expiration timestamp is also challenging.

Detecting bot traffic

How to identify (bad) bot traffic?

Aggregating traffic via dimensions

IP address

ASN

This post is for paying subscribers only

How to identify (bad) bot traffic?

Aggregating traffic via dimensions

IP address

ASN

This post is for paying subscribers only

You might also like...

Personal reflections: fighting fraud with AI and data science

Analyzing web application data

Week 1 Project: Attacking Alpha Bank