Let me start with a confession: I'm not a political wonk. I don't watch cable news. I don't follow political pundits on Twitter. For the most part, I keep my political opinions to myself. So while I tend to shield myself from the presidential campaign rhetoric, I am fascinated by one element. Probably because it appeals to the I-O Psychologist in me (and no - it isn't trying to understand the personality profile of a person driving the late 90's Dodge Caravan plastered with Gore/Lieberman, Kerry/Edwards, Franken '08, and "Vote Nader" bumper stickers – although I have pondered that).
What fascinates me is the competition among data scientists to predict the outcome of the election.
In 2008, a colleague of mine told me about Nate Silver's FiveThirtyEight blog. Not for the politics, for the stats! If you aren’t familiar with Silver’s work, let me summarize what caught my attention:
Silver successfully predicted the winners in 49 of the 50 states in the 2008 presidential election.
In 2012, he correctly predicted the winners in all 50 states plus the District of Columbia.
If that doesn’t impress you, you should know that hundreds of individual polls in both elections were all over the (electoral) map, and no individual poll was even remotely close to being as accurate as Silver and team at FiveThirtyEight.
Many polls even predicted the wrong candidate to win, and those that correctly picked the winner, still flubbed many of the actual state outcomes.
Did Nate Silver have some magical political foresight enabling him to predict the election outcome? Was he just lucky? Were the other pollsters using flawed research methods? The answer to these questions is a resounding "No."
Nate Silver isn't some political mastermind who knows more about the complex factors that contribute to how voters will actually vote. The individual polls (or most of them, anyway) use very sophisticated, solid research methodologies with representative samples. And while some of these incorrect forecasts may be manufactured political spin used to motivate people to get to the polls, the reason that even objective/non-partisan pollsters fail comes down to sampling and margin of error within a sample of a population.
Why? Simply stated, humans are complicated, and predicting human behavior is extremely complicated.
By the way, wouldn’t, “I’m complicated” be a great excuse to use when you forget to do something? Here is an example of a recent conversation I had:
My (lovely) wife: “Did you call the guy about the diseased tree in our yard?”
Me: “Oops! No, but I’ll do it tomorrow, I promise.”
My (incredibly forgiving and patient) wife: “Ok, but please don’t forget, because we have 30 days to remove it or the city will fine us.”
Me (what I could have said): “I didn’t forget; my behavior is just really complicated.”
You see? We’re complicated. Even in a situation where I should be motivated to do something or face a consequence (like call the tree guy or face a fine from the city), and I even committed to making the call, I didn’t. My behavior may be a little more challenging to predict than most, but I think you get the point.
Polls face countless challenges when predicting election outcomes, but I’ll focus on one of the biggest challenges: Margin of Error.
Suppose a poll reports that 48% of voters are likely to vote for Candidate A, while 45% of voters are likely to vote for Candidate B. That’s a 3-point lead for Candidate A, right? Wrong.
Because the poll only used a sample of the electorate, there’s a margin of error in their results. Depending on a poll’s specific sample, that margin of error could actually be off by 5 or more percentage points for each candidate! It is financially and logistically impossible to create a single poll that could predict the voting outcome with a margin of error small enough that a 3% difference in the poll actually meant there was a 3% difference in the whole population. You’d need to poll tens or hundreds of thousands of voters, but polling organizations struggle to get 500 or more voters in their sample.
This is where Silver's methodology excels:
Rather than do his own sampling research (which would have its own margin of error), he aggregates data from other polls and creates meta-samples that reduce the margin of error in his predictions.
Of course, there’s more to his method than simply aggregating data, but let’s focus on this element.
In the end, his accuracy doesn’t come from some kind of political prescience, but from reducing the margin of error in his estimate by including larger, more representative and diverse samples.
So, what can this teach us about employee assessments?
A question I'm often asked is, "Can we put our top employees through the assessment so we can prove that the assessment works?" While I can appreciate the request, there’s actually very little we can learn from a study with extremely small sample sizes. Remember when I said predicting human behavior (including job performance) is very difficult?
Just as in election polls, there’s a margin for error in studying how assessments predict job performance. When we develop assessments, our studies use very large sample sizes (200+ individuals per organization) from several organizations. Even if we saw results from an individual study demonstrating that our tests are highly predictive of job performance in that sample, we don't rely on that single study. Why? Margin of error! After a few high fives, we continue conducting several studies with multiple clients, each with samples of several hundred employees, and then examine the pattern of prediction across those studies. In this way, we're following Silver's methodology.
Multiple studies aggregated into meta-analyses tells a more accurate and complete story about the true predictive power of these assessments. Just as Silver can be more confident in the story his data tells through aggregated polling, we know our story will hold up when our clients use our assessments to make critical talent decisions.
I realize this isn't the response most of my clients want to hear, but I challenge them to examine a vendor's tests by looking at the pattern of prediction across multiple studies (or meta-analyses). Any test vendor can get lucky and show a single study with a small sample “proving" that their test works. Heck, we can probably get lucky and go a perfect 3-for-3 in proving the assessment correctly identified the 3 employees you put through the test as high potential.
I’ve spent my entire career in the talent assessment industry, and my job security relies on the ability to successfully predict job performance with assessments, but I’ll be the first to admit that no test can be 100% accurate. If test vendors speak in absolutes, telling you their tests are always 100% accurate, they're lying (or they're about to become really rich). But if they tell you that they can prove it by putting a few of your employees through the assessment, remember this: all studies have a margin of error, and the results can be misleading. The best way to know that the tests will work for you is to see aggregated proof through multiple studies so you can know for sure that the results you're seeing are likely to be replicated in your organization.
I look forward to seeing how the statisticians do this year. Several new stats wizards have come up with their own version of FiveThirtyEight's poll aggregation, so it promises to be a fierce competition to see whose model reigns supreme.
And seriously, why did the man driving the 90's Dodge Caravan stop putting bumper stickers on his car in 2008? Did he buy the car with the bumper stickers already in place? I may never know, but it gives me something else to think about during this election season. Oh, and I need to call the tree guy!