HOWTO Apply uegosdeben1ogrande to Web Analytics

Google Fixed Bing’s Work Product

Who is checking your work?

Google and Bing are in the midst of an intense discussion regarding Google alleging that Bing has been copying Google search results via the Internet Explorer 8 tool bar. What is relevant to web analytics is that we could see similar effects, unless we take care to use some method, potential solutions are below, and watch how we work.

Nate Silver’s unbiased referee of the Bing and Google disagreement is the best I have come across, the most relevant portion reads:

Microsoft has not really disputed the claim — instead, they’ve said that it’s not such a big deal. Yes, Google search results — which it obtains by tracking the behavior of people who use its Internet Explorer 8 tool bar, perhaps along with other means — are one of the inputs that Bing uses, Microsoft says. But it also uses more than 1,000 others. The reason that Google’s results appear verbatim in response to nonsense queries like “uegosdeben1ogrande” is because in such a case, only the Google results are “relevant”; the other 999 variables don’t shed any light on the problem.

Besides the fact that Bing apparently blindly trusts Google more than their own employee’s hard work, the most glaring omission is that Bing never bothered to check the facts on the internet. Even without being an expert in the search field, natural language programming or some other search specialty, common sense should prevail.

If a result such as “uegosdeben1ogrande,” that string of characters having no basis to ever return a result, were to appear, the flag for further investigation by the machine alleged to be learning would have been thrown.

The machine should have been pinged that an outlier had been detected, and then some sort of process to crawl the site and verify the relevancy of the search term. Instead Google found out about the situation, set Bing up and outed them publicly.

You, the web analyst, do not have the luxury of a Google coming by your work station, checking the dashboards, reports and metrics you are sending out to ensure that they are optimized for both precision and accuracy.

Web Analytics

Considering that the online measurement and influence sector are essentially trying to do what Google does, process some sort of behavioral data from the visitor and optimize the ad or site for their utility. In the end, we hope to make it easier for them to buy more stuff faster; Google and Bing want to increase the utility and speed of their search in a similar fashion.

There are some amazing tools built by some amazing people in the web analytics space, but your job as a web analyst is to always remember:

Just because you can segment the sample, doesn’t mean you should.

While there may not be people using your site just to mess up your data, although not out of the question, the same principles Bing should have applied applies here:

  • When you are presented with outliers, what is the context of the outlier?
  • Was that outlier activity entirely within the WIDE range of the variation of human behavior?
  • Are they statistically valid?
  • How much precision are you trading away for how much accuracy?

The further a web analyst runs down the rabbit hole chasing every minutia of a slice for the population they are studying, the closer they get to returning “uegosdeben1ogrande” to someone.

Solutions

Knowing when you should continue to segment is a touch easier than knowing when your results are becoming untrustworthy.

If you are using the previously covered Analysis DataPak for Excel, bias and variance of your estimator should approach zero. If you are segmenting your sample, and the variance goes up, then take a minute to digest the problem and give it another go.

In Python and NumPy, no disrespect to R, we can also implement the Gini coefficient and generalized entropy index with a trivial amount of effort:

The Gini impurity:

def giniimpurity(l):
    total = len(l)
    counts = {}
    for item in l:
        counts.setdefault(item,0)
        counts[item] += 1
    imp = 0
    for j in l:
        f1 = float(counts[j]) / total
        for k in l:
        if j == k:
            continue
        f2 = float(counts[k]) / total
        imp += f1 * f2
    return imp

The generalized entropy index

def entropy(l):
    from math import log
    log2 = lambda x: log(x)/log(2)
    total = len(l)
    counts = {}
    for item in l:
        counts.setdefault(item,0)
        counts[items] += 1
    ent = 0
    for i in counts:
        p = float(counts[i]) / total
        ent -= p * log2(p)
    return ent

Both are widely available, I first came across them coded in Python in the excellent "Programming Collective Intelligence" written by Toby Sagarin. Entirely in Python it is a must read for analysts looking for inspiration on a project or study.

The goal is to reduce the noise in a segment while maintaining statistical significance, lower the Gini and or entropy to be on your way to a better study of your data.

I’ll cover how to test a segment for statistical significance sometime next week.

eMetrics Contest

Don’t forget to spread the word about the free eMetrics pass, please enter if you could yourself use the pass.

Submittal Instructions

  • Send an email to emetrics@michaeldhealy.com
    • With the subject “Submittal”
    • ATTACH IMAGE FILE AND TEXT DOCS
    • On or before Friday February 11, 2011 12:00 Midnight PST
    • Winner notified Monday February 14 at some point BY EMAIL
  • Required In The Email:
    • Your Name
    • Your Email
    • Your Phone Number
    • Your Physical Address
    • FEEDBACK YES or NO
  • Optional:
    • Your Personal Website
    • Your Twitter Account
    • Your LinkedIn Profile

Questions?

Email me at emetrics@michaeldhealy.com about the contest, mdh@michaeldhealy.com otherwise.

Technorati Tags: Analytics, Bing, Econometrics, Generalized entropy index, Gini Coefficient, Google, HOWTO, Machine Learning, Microsoft, NumPy

Share
This entry was posted in Analytics, Econometrics, HOWTO, Machine Learning, NumPy, Python, Statistics and tagged , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>