Detecting bot traffic
How can we differentiate bot from human traffic? What are some typical features?
How to identify (bad) bot traffic?
It is always possible to detect bad bot traffic, as long as there is sufficient volume.
Aggregating traffic via dimensions
Grouping traffic together based on certain attributes (or dimensions) allows us to abstract the traffic so that we can measure and compute certain properties for each group, and compare across groups.
We will first discuss some frequently used dimensions and then move on to common traits for bots and for legitimate human traffic.
These are some common dimensions that we can derive purely from web traffic logs:
IP address
Use this with caution. While devices generally maintain the same IP address over the course of a short session or series of transactions, this is not necessarily the case over a longer time period. And each IP address may map to one device or thousands of devices -- for example, a corporate network may have thousands of (very similar!) devices behind a single IP address. Even a single mobile IP address, such as one belonging to AT&T's mobile network, may simultaneously represent multiple users. Also, watch out for IPv6 addresses: with the huge address space, attackers can use a huge amount of IPv6 addresses in short order. As an example, a persistent attacker on a popular social network sent in hundreds of millions of requests over several days -- using a unique IPv6 address for every 1-2 requests!
It is tempting to block obviously bad traffic from specific IP addresses, but this is a cat-and-mouse game. IP addresses are cheap, and it is trivial for attackers to use new IP addresses. There is also a very real risk of accidentally blocking legitimate users sharing the same IP address as an attacker. This can be sidestepped somewhat by letting IP address blocks expire, but figuring out the optimal expiration timestamp is also challenging.