How to Identify (and Remove) Spam and Bot Traffic From Google Analytics

Analytics are only as valuable as their accuracy. Anything that adds inaccurate information to the acquired data makes forward decision making that much harder. How can you trust that your premise and conclusions are accurate if the data you see may not be real? And one of the most common ways for data to be false is via hits from bots and spam sources.

Spam traffic artificially inflates traffic and bounce rate while artificially deflating pages/session. Big and small accounts alike are susceptible to being hit by spam bots. While most are benign in intent (e.g. web crawlers), many can still log their visits in your analytics and heavily skew performance data. If traffic is not properly identified, you may think your site is generating more traffic and less per-user engagement than is actually being generated. For Google Analytics, there are three typical levels of spam bots, each with its own level of difficulty to identify and remove from forward data.

1. The first, and easiest, bots to identify are the major data center crawlers. These are typically going to be denoted by disproportionate traffic from single cities. The most common examples are Ashburn, Virginia, and Coffeyville, Kansas. Some sites may see more traffic from users in these locations than the populations that actually exist there.

How do we fix this? At the “view settings” level, enable “Exclude all hits from known bots and spiders.”

2. The second type of spam traffic shows up as hits from invalid sources. These are “referrals” that exist only to try to get the analyst to visit their site. Common examples include darodar.com, seo-buttons.com, monetize-traffic.com and numerous sites that end with .xyz extensions.

The best way to fix these is with a custom filter to exclude the source. Individual hits may vary, but a good starting place is this pattern:

\.xyz|social\-|\-social|button(.*)\-|\-button|\-seo|seo\-|monetize\-|video\-|\-video|darodar

3. The hardest types of spam traffic to identify are bot networks. These may hit a site from a variety of locations, landing pages and device types, ultimately showing a relatively normal distribution. However, they are identifiable as spam bots because they create spikes of “Direct” traffic with high bounce rates that hit a large number of landing pages (most legitimate direct traffic will land on the homepage or high category level pages).

Because of their relatively normal distribution of traditional user data, they are also difficult to remove. Removing a large breadth of cities can ultimately exclude legitimate traffic. The solution lies in greater data segmentation using dimensions you likely do not frequently use within Google Analytics: browser size, browser version, network domain, screen resolution and Flash version. We frequently see outdated or not set values for these bot sources.

Create custom reports for Direct Traffic + secondary dimensions. The goal is to create rows that have higher traffic volumes but a 100% bounce rate and no engagement that can then be excluded using account filters. It may take a half dozen combinations before you are comfortable that you have most of the bots identified. If this is still insufficient, then you may have to identify specific user agents (we will address this in a future post).

On this item, make sure that any spike in Direct traffic is from bots and not untagged marketing. If you are running marketing campaigns and not using UTM tags, it is highly possible that those are being logged as Direct traffic.

Please note that while the above are critical to maximizing data relevancy for your reporting views, we always recommend having an Unfiltered/Raw View for every Analytics property. Do not apply any of these solutions to that view (which should never be used for reporting) to ensure that you can still see the inbound data feed as Google is receiving it for comparison.

Keep reading in Analytics