It could happen to you, analyzing data from Google Analytics, to find strange and suspicious traffic. A sea of visits with a bounce rate of 100% and strange, unknown to you, Referral.

Most likely it is Spam Referral, a technique used by some black seo sites to try to trick the search engines. A more detailed description can be found in the following Wikipedia article.

The typical situation that you might find on your Analytics console is the following:

spam

This phenomenon can be very annoying and it hides interesting data among a thousand false visits.

Several solutions to the problem are suggested: from WordPress plugins to filtering data directly in Google Analytics. The approach chosen by us is a little more technical and directly related to Apache.

The idea is using mod_rewrite to identify calls that return a known referrer as spammer and block it with a 403 – Forbidden.


<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)?.*thespammer2\.org.*$ [NC,OR]
RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)?.*thespammer1\.net.*$ [NC,OR]
RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)?.*thespammer\.com.*$ [NC]
RewriteRule ^(.*)$ – [F,L]
</IfModule>

For a constantly updated list of spammers you can use a public repository maintained on GitHub piwik/referrer-spam-blacklist. Below you’ll find a small template and a python script to generate a configuration file for apache.

Template to create .conf for Apache

This is a minimal template, which can be used to be included. As an alternative, you can prepare the templates of your own virtual hosts.

<IfModule mod_rewrite.c>
  RewriteEngine On
  $spammerList
  RewriteRule ^(.*)$$ – [F,L]
</IfModule>

Script

The script fills the template with the data downloaded from the repository. It is written in python and you can easily schedule it.

#!/usr/bin/python
from string
import Template
import urllib

SPAMMER_SOURCE = “https://raw.githubusercontent.com/piwik/referrer-spam-blacklist/master/spammers.txt“

template = Template(open(‘template’, ‘r’).read())
spammers = urllib.urlopen(SPAMMER_SOURCE).read().splitlines()

RULE = “RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)?.*$domain.*$$ [NC,OR]“
RULE_TEMPLATE = Template(RULE)
LASTRULE = “RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)?.*$domain.*$$ [NC]“
LASTRULE_TEMPLATE = Template(LASTRULE)

formattedLines = []

for (i, line) in enumerate(spammers):
  line = line.replace(‘.’, ‘\.’)
  if i == len(spammers) - 1:
    formattedLines.append(LASTRULE_TEMPLATE.substitute(domain = line))
  else :
    formattedLines.append(RULE_TEMPLATE.substitute(domain = line))

output = template.substitute(spammerList = “\n“.join(formattedLines))
print output

To verify the correct operation of the configuration, any method of protection you have chosen, you can use wget

Example of blocked call

You have to simulate a http call that has as referrer a site belonging to the list of spammers, the answer must be 403 Forbidden.

wget \
  --server-response \
  --spider \ 
  --referer='http://thespammer.com/' \
  https://www.opengate.biz

...
HTTP request sent, awaiting response... 
  HTTP/1.1 403 Forbidden
...

Example of successful call

You have to simulate a http call that has as referrer a licit site (not belonging to the list of spammers), the answer must be 200 OK.

wget \
  --server-response \
  --spider \ 
  --referer='http://legitsite.com/' \
  https://www.opengate.biz

...
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
...