Bot Traffic Detection Method Teases Real Website Traffic from Fake

Early this year, our client was in a panic. Their direct web traffic dropped more than 35 percent over the course of a single month—and their board wanted answers. Yesterday. The marketing team was under intense pressure to explain the dip in traffic and its impact on the company’s sales pipeline.

We rolled up our sleeves and the deep dive into our analytics data began. It wasn’t long before bot traffic bubbled up as the likely cause of the mysterious and alarming drop in website traffic. So, our investigation quickly pivoted into a search for a bot traffic detection tool.

We needed a way to distinguish human users from zombie users—and our search revealed insights that were—all at once—exciting, confusing, and frightening along the way.

More importantly, our search prompted us to develop a new bot traffic detection tool to answer our client’s pressing questions.

What follows is a summary of our investigation and that new tool.

Uncovering oddities in the website traffic data

We approached the challenge like detectives evaluating a crime scene. First, we laid out all the facts:

  • From Jan 8 to Jan 28, we saw a decrease of 5,579 sessions (out of 23,902)
  • The decline in traffic was largely the result of a drop in direct traffic (users manually entering the URL into the search bar or leveraging bookmarked links)
  • The client’s site had an unreasonably high rate of direct traffic (could that many people really have been typing in the URL directly?)

After spending hours looking for patterns in the traffic, we found our first smoking gun.

We noticed that almost 14 percent of direct traffic hits on the client’s website in the first quarter of 2017 came from a suspiciously old version of Flash — Flash 11.5 r502.

Why so weird?

Google’s Chrome browser automatically updates both itself and its built-in Flash player. So, many of these visits were using Chrome/Flash combinations that were never published.

Analytics were telling us an old version of Chrome that accounts for less than 0.01 percent of Internet users in the world was bringing in almost 6 percent of the hits to our client’s website.

This disparity was another instant red flag.

Perhaps these out-of-date versions of Flash and Chrome contained vulnerabilities that impersonator bots could exploit?

Testing our hunch: Impersonator bots were at fault

We developed a hypothesis that was later confirmed by our research and experiments: The sudden decrease in traffic to our client’s site was not due to people suddenly losing interest in their product, but rather a sudden loss in visits from website traffic generator bots.

What are fake traffic bots, exactly?  They’re malware tools that disguise themselves as human users to avoid detection by cyber security tools.

To test this hypothesis, we needed to a way to distinguish traffic generated by real people versus that driven by fake traffic bots.

Here’s how we approached the problem:

We started by filtering out all users of the outdated versions of Flash and Chrome that were generating suspiciously high traffic volume, to give us a better ballpark estimate of how much traffic could be caused by these traffic generator bots. Running this simple filter showed us that up to 21.31% of the client’s hits in Q1 were potentially fake website traffic.

While we’d been telling our clients they were likely viewing inflated web traffic numbers for months, even years—on this day we were able to deliver good news supported by the evidence.

When the potentially fake website traffic from outdated browsers was removed, the remaining traffic showed an increase.

If our hypothesis was proved correct, this meant that real website traffic from human users was actually on the rise.

In order to avoid future crises from traffic generator bot spikes or declines, we needed to provide the client a better way to distinguish real website traffic from fake website traffic. The old technology signatures were a start, but they also could have filtered out many real people who just happened to be using old technology (*Cough, Grandma. Cough*).

There had to be a way to evaluate the type of activity each user engaged in.

After many more hours of research and testing, we found what we were looking for: an activity signature. Or, more accurately, an inactivity signature.

When people visit a web page, they tend to move the mouse around, scroll the page, and click a few links. At the very least, people will usually move the mouse to exit out of the page.

In Q1, our client had tens of thousands of visits from users who showed absolutely no interaction. There was no mouse movement, no clicking, no keypresses, no touch screen events, and no scrolling. But somehow, some of these visitors managed to load multiple pages within the site.

Obviously, it’s highly unlikely that a person could engage in this kind of activity. Even if they walked away from the computer to get coffee immediately after opening a page, they would likely still register at least one mouse movement when they returned.

This kind of activity pointed to one answer: the bots’ scripts opened a series of web pages, sat on them for a predetermined amount of time (usually 30 or 60 seconds), and then closed them without any interaction.

Our answer to impersonator bots

Using Google Tag Manager and JavaScript, we found a way to label inactive users as traffic generator bots, and then filtered them out. We then built a tool to check user activity at regular intervals. After 15 seconds, 30 seconds, 90 seconds, etc., JavaScript sent a report to Google Analytics documenting the number of interactions the user had with the page.

Users that showed zero interaction get tagged as traffic generator bots. Even one mouse movement classified the user as a human under this system.

This particular filter can’t be used to evaluate bot website traffic from the past, but it can be used on all current and future traffic. So we implemented the code to track user interactions in April, and reevaluated the traffic a month later.

When we filtered our client’s traffic numbers using this more sophisticated inactivity signature, we caught even more bot traffic.

From April 18, 2017 to May 18, 2017:

  • 14,347 sessions were flagged as fake traffic using the old filters for removing old Flash and legacy browser versions
  • 16,791 sessions were flagged as non-interacting users, some of which visited multiple pages and demonstrated a median session duration of 39 seconds and a bounce rate of 0%

We now had our partial answer: We could now explain their traffic spikes.

But we didn’t have a good answer for why.

Why do these traffic generator bots exist?

Why are these impersonator bots visiting websites, but at low enough numbers that they don’t really impact your bottom line?

It was disappointing, but not at all surprising to learn that 20 percent (or more) of this global company’s direct website traffic was bot traffic. It wasn’t surprising, because it fit with the global trend.

In 2016, 51.8 percent of all web traffic came from bots, according to Incapsula. Of that, at least 28.9 percent were confirmed “bad bots.”

But most of the time, bad bots do bad things. They overload websites with bot traffic so that legitimate users can’t use them (you may know this as a DDoS attack), or they crawl websites looking for vulnerabilities in order to break into secure databases. The bad bots on our client’s website? They don’t appear to do anything, except screw up our analytics.

Unfortunately, we have to admit that we still don’t have a great answer to the why question. Perhaps this bot traffic was:

  • Coming from a botnet host performing routine performance checks in an effort to see how many websites its network was able to reach
  • Probing for vulnerabilities to create a list of exploitable websites to sell on the dark web
  • A subtle form of corporate subterfuge

Maybe there was no motive at all.

The important thing to note is that this can and does happen to everyone.

Our client may be a well-known service provider in its field, but size does not seem to be a determining factor in whether a website will attract fake bot traffic. For reference, we also evaluated the traffic of a small, local marketing agency and found that at least 7.5 percent of their direct traffic came from non-interacting users.

We don’t believe these are targeted attacks. It now seems to be just another fact of life on the Internet.

Why should you care about website traffic generator bots?

In the grand scheme of cybersecurity concerns demanding your attention, this one does fall below major threats to your business, like ransomware infections and true DDoS attacks.

But bot traffic can still impact your bottom line.

The extra traffic does incur incremental increases in hosting costs, but for most companies, their traffic is low enough that a 20 percent boost won’t have a huge impact on their operating expenses.

Of course, there’s also the concern that if your site isn’t secure, these bots might stumble upon your vulnerabilities and sell that intel on the dark web, leaving you open to an attack.

But the real, tangible impact is on your marketing strategy.

As we saw with our client, the marketing team came under a lot of heat from their board to explain the sudden drop in traffic. That drop turned out to be a fluke of analytics, but without the research we did and the filtering tools we came up with, the marketing team might still be scrambling to answer the board’s questions.

This sort of thing can become very expensive for a company when they have to spend labor-hours investigating traffic drops and developing solutions for problems they don’t fully understand.

It can also threaten the job security of those on the marketing team.

At the end of the day, to plan and execute successful digital marketing strategies, you need to have an accurate read on your website analytics. Letting impersonator bots run wild on your site distorts that critical data.

Why does Google allow this kind of bot traffic?

We wish there was a better answer to this question, as well. Google, in all its wisdom, still doesn’t seem to know how to filter out these bots from its analytics. That may surprise you if you’re already familiar with the fact that Google does filter out the “good” bots (remember, according to Incapsula, those good bots make up 22.9 percent of all web traffic). But these bots also do a very good job of mimicking human behavior online. A cursory reading of your Google Analytics reports wouldn’t identify any suspicious patterns.

Many cyber security service providers offer DDoS attack protection. CloudFlare, Akamai, and others can scan your website’s traffic looking for suspicious patterns, but these services rarely pick up the kind of bot traffic activity we’re concerned about.

These bots elude detection because they don’t display the kind of abusive behavior that DDoS protection is programmed to stop.

They’re not coming in at a high enough volume to shut down your site, and they don’t appear to do anything else malicious.

Since this fake website traffic appears to come from real people’s computers, it’s unwise to block traffic from all computers with old technology signatures, because that will definitely exclude some human users.

At this point, there doesn’t seem to be a practical way to prevent impersonator bots from visiting your site and eating up your server resources. You could install a captcha to certify human users, but this would have a seriously negative impact on your user experience. You might also be able to block or limit traffic from IP addresses that have been identified as compromised, but this would make only a small dent in the problem. These bots tend not to re-use IP addresses, so blocking each bot would be futile.

What we learned from sleuthing around in our client’s Google Analytics

After we came back to our client and showed them their real website traffic had not suddenly fallen off a cliff, and was instead on the rise, we shared a few other takeaways.

First, this investigation underscored for our clients that that traffic alone should never be used as a key performance indicator. Traffic generator bots are just one of many factors that make traffic an inaccurate indication of the success of your marketing efforts. Whether it’s your own website or it’s your social media efforts, your performance should be measured by conversions.

Viral levels of traffic are useless to you without conversions.

Second, we should all be more committed than ever to deeply analyzing our data. Without the sudden traffic slump drawing our attention, we wouldn’t have looked into this problem. But it still would’ve been there.

A curious mind and a keen eye for unusual patterns will help you uncover suspicious activity before it becomes a major problem for your website or your brand as a whole.

And third, we realized an unexpected benefit from implementing user interaction event tracking. Yes, it allowed us to separate real users from zombie bot traffic users, but it also gave us a whole new set of performance data to track. By capturing scrolling and mouse movements, we can see whether people are actively reading our content or just skimming through it in search of a specific answer or link.

Is user-interaction event tracking a permanent solution?

Probably not.

A botnet could easily change the instructions it sends to zombie users, making them fake human-like interactions (telling the zombie users to move the mouse just one time would fool our filter into thinking it was a real person).

We expect cybercriminals to keep upgrading their software to become more sophisticated, so in the future we may need to take a machine-learning approach to this problem and track more information like mouse acceleration and direction, or use a program like captcha.

Alas, we can never outsmart some robots.

For anyone who wants to implement their own robot tracking and traffic removal segments — we’re looking at you, data nerds and advanced marketers — we’ll be releasing a white paper explaining exactly how we did it. Stay tuned!