How to Identify Cities with Bot, Crawler, VPN, and Spam Traffic in GA4
Image Credit: Famartin on Wikipedia
Have you seen a large amount of traffic from Ashburn, Virginia, in your Google Analytics? You may have also noticed a ton of visits from other cities that aren’t part of your usual market, such as Boardman, Oregon or San Jose, California. In this article, we’ll help you identify what cities are actual bot, crawler, spam, VPN (virtual private networks), or other misleading sources in your GA4 and build filters to find where your most valuable users are coming from.
Why Are Locations in Google Analytics Not Always Accurate?
The shortest answer to why location information in GA4 skewed is that internet infrastructure, VPN usage, hackers, and large tech companies cause inconsistencies. For instance, Ashburn, VA is known as “Data Center Alley”, hosting 70% of the world’s internet traffic, especially Amazon Web Services (AWS). For a fuller picture, we explore reasons that both bot and human traffic affect Google Analytics’ location data.
Robotic Traffic: Beneficial to Criminal
As the internet has grown, so has the amount of spam and the demand for information — with bot traffic increasing alongside human traffic. Bots have various objectives, from helpful indexing by search engines and AI to more malicious hacking or spamming. Google Analytics has gotten better at identifying and filtering out much of this bot traffic, with GA4 doing so automatically. Of course, some of these slip through the cracks.
Human Traffic: Yearning for Privacy
Of course, not everything is because of bots. Usually in the pursuit of privacy or security, many choose to use tools like VPNs to hide their true locations, making it seem like they’re browsing from somewhere else. This could a data center in Ashburn, Virginia or even Amsterdam, Netherlands. Similarly, since 2022, Apple’s Safari browser automatically masks IP addresses, so GA4 often shows users in a more general area — such as a state, but not a city.
Common Cities in the United States that Skew Google Analytics
There are many hotspots that skew GA4 location data, and they’re constantly changing. It’s important to note that bot traffic can originate from anywhere and legitimate traffic can come from these cities, too. However, here are cities that TwoSix Digital has found to be consistently associated with discrepancies in Google Analytics:
- Ashburn, Virginia
- Coffeyville, Kansas
- Cheyenne, Wyoming
- Boardman, Oregon
- Moses Lake, Washington
- Abbeville, Louisiana
- Council Bluffs, Iowa
- Hampton, Arkansas
- Pembroke Pines, Florida
- Hialeah Gardens, Florida
- Chicago, Illinois
- Seattle, Washington
- Los Angeles, California
- San Jose, California
- Miami, Florida
- Denver, Colorado
- Kansas City, Missouri
- Dallas, Texas
- San Antonio, Texas
- New York, New York
- Columbus, Ohio
- Atlanta, Georgia
- Charlotte, North Carolina
- Salt Lake City, Utah
- Boston, Massachusetts
- Phoenix, Arizona
- Washington DC
If you look at this list and think that almost every area in the United States is represented, you’re not wrong. It can be very difficult to home in on exactly where users are coming from. Luckily, there is a way.
How to Identify Skewed Location Data in Google Analytics
Since there are various causes of skewed traffic, identifying such traffic will also need to be broken down into a few different types. And, like everything so far, these are more guidelines and trends than hard-and-fast rules.
Identifying Bots & Crawlers in GA4
Generally speaking, bot, crawler, or spider traffic will come in via specific types of tech. Usually, this type of traffic will have very low Engagement rates and Average Engagement Time in GA4. Note that these methods should all be tools in your toolbelt and used in conjunction.
1. Using Screen Resolution to Identify Location Anomalies in Google Analytics
Employing Screen Resolutions is a trending tactic to identify bot traffic. You can apply it as a “secondary dimension” in GA4 and see the sizes that have large amounts of traffic, but very low engagement numbers compared to other screen resolutions. The most common offenders are:
- 1280×720
- 800×600
- 2000×2000
- 1024×768
- 1024×2160
- 1220×1280
2. Using Operating System to Find Skewed Geos in GA4
Utilizing Operating System is another common method to identify such traffic. The most common offenders are:
- Linux
- Windows (less so)
However, make sure it’s relative. For instance, excluding all Linux traffic isn’t the best idea. As we see in the example, traffic from Boardman using Linux is actually fairly engaged, while Macintosh and Windows Operating systems lag behind.
3. Using Channel Group to Isolate Inaccurate Cities in GA4
One of the easiest approaches day-to-day is by using Session default channel groups. A quick tip here is to look out for when the number of users and new users is almost exactly the same. Typically, that indicates new bots come through. The usual suspects are:
- Direct
- Unassigned
Again, identify in relation to other channel groups. Coffeyville has very poor Unassigned traffic — 435 users, all new, spending an average of 0 seconds on a website is very improbable. With Boardman, we see a similar situation for Direct traffic.
Identifying Human Traffic that Skews Data in GA4
While detecting bot, crawler, and spider traffic is nuanced, recognizing types of human traffic that can skew data in Google Analytics is even more so. The primary culprits are users that use VPN and those that are using Apple’s Safari browser.
1. Using Nearby Cities to Identify VPN Location Inconsistencies in Google Analytics
Generally, VPN (Virtual Private Network) traffic detection is very complex. However, it can potentially stand out because it’ll be meet the following three criteria:
- A city that you don’t have any connection with
- A city in the “Common Cities in the United States that Skew Google Analytics” list above
- A city’s traffic is out of proportion to its population
For instance, if you’re an organization in Michigan and you see a lot of traffic coming from Texas, then investigate it further. In the example above, we see 38% of sessions attributed to Dallas, Texas (part of the list) — while it’s the 3rd largest city in Texas, comprising about 4.3% of the state’s population. On the other hand, Houston has 7% of traffic and comprises 7.7% of the state’s population.
2. Using to Identify Apple Privacy Location Irregularities in Google Analytics
In essence, spotting locations masked by Apple’s privacy measures will work in the same manner as detecting VPN traffic. GA4 has adapted to utilize other signals for location data. Yet, some data may still not be accurate, so be on the lookout for traffic that:
- A city that you don’t have any connection with
- A city in the “Common Cities in the United States that Skew Google Analytics” list above
- A city’s traffic is out of proportion to its population
- A city’s traffic is primarily from Browser of Safari or Device Brand of Apple
Looking at the example above, Atlanta has 90% of users utilizing Safari browsers. While Safari is one of the most used browsers in the US, with 31.2% across all devices and 55.2% of mobile devices, it probably isn’t dominating Atlanta by this large of a margin.
How to Filter Out Inaccurate Locations in GA4
By applying these tactics, you won’t be able to filter out all the inaccurate location data in Google Analytics. Instead, the aim is to filter out enough to be able to make informed strategic choices.
The Easiest Filter in Google Analytics to Rule Out Location Inaccuracies
In GA4, it’s possible to build filters that apply to the report you’re on. At the top of each report, under its name, there’s a button that says, “Add filter +”. The easiest filter we’ve found is very oversimplified but can be useful. It’s formulated in the following way:
- Screen Resolution doesn’t exactly match: 1280×720, 800×600, 2000×2000, 1024×768, 1024×2160, 1220×1280
- City doesn’t exactly match: Ashburn, Coffeyville, Cheyenne, Boardman, Moses Lake, Abbeville, Council Bluffs, Hampton, Pembroke Pines, and Hialeah Gardens
While this will certainly filter out some genuine traffic, you’ll be able to make better strategic and tactical decisions by using it.
The Most Precise Method to Exclude Distortions in GA4 Traffic
On the other hand, using Google Analytics in Looker Studio to form more precise filters is recommended, as GA4 has very limited functionality in this respect. Looker Studio is much more flexible and allows layering of filters. An example of layered filters you could use is:
- Filter 1: Low Engagement and Suspicious Location
- Include “Average Engagement Time per Session” < 5 seconds
- AND Exclude “City” exactly matches (list of suspicious cities)
- Filter 2: Suspicious Technology and Low Engagement
- Exclude “Operating System” exactly matches “Linux”
- AND Exclude “Screen Resolution” exactly matches (list of suspicious resolutions)
- AND Exclude “Average Engagement Time per Session” < 5 seconds
- AND Exclude “Events per Session” < 1
- Filter 3: High New User Rate from Specific Channels
- Exclude “Session default channel group” exactly matches “Direct” OR “Unassigned”
- AND Exclude “New users” / “Users” > 90% (adjust as needed)
- AND Exclude “Average Engagement Time per Session” < 5 second
Mastering Your GA4 Data for Accurate Insights
By addressing the various nuances of bot traffic, VPN usage and privacy regulations, it’s possible to more accurately use your GA4 data to understand your website visitors, especially in terms of location information. The recommendations offered are generalizations, so be sure to regularly review your data, look for abnormalities in engagement metrics, and update your process as needed. With the more precise data, you’ll be able to effectively measure your marketing success and make informed decisions based on genuine understanding of your audience.