These are undoubtedly interesting times politically. Many of us followed the lead up to the Brexit vote; many more continue to follow the ongoing process of the US presidential elections. And here in South Africa, our local government elections are right around the corner.
With all that in mind, one prominent part of these political processes is polling: the process whereby voters are asked about their preferred political parties or candidates. Polls are conducted by a number of organisations, and this information is compiled and often reported in the news media and online. As an example, a recent RealClearPolitics average of polls shows Hillary Clinton with support from 43.8% of voters, against Donald Trump’s 41.1%.
The poll results are always very interesting to follow, but many people seem to misunderstand what they actually represent and how credible they are. With that in mind, the point of this article is to give an intuitive understanding of how the process works and what polling results actually mean.
Surveys and samples
Imagine there are 100 people in your class (or, if you prefer, a social club, or place of work). For whatever reason, you want to find out what proportion of people in the group likes pizza. There are some problems, though: you’re never able to get everyone together in one place to ask them – and when you do ask people whether they like pizza, you never seem to remember exactly who said what.
You decide to randomly choose a sample of 10 people to survey at every opportunity you get, and record their responses. In the first sample, 5 people (50%) say they like pizza. In the next sample, 6 people (60%) say they like pizza. In the next three samples, there are 7, 6, and 8 people who like pizza.
Later on, someone who managed to interview all 100 people tells you that exactly 60 people like pizza; 60% is the true proportion that you were trying to estimate.
So we see here that not every sample of people interviewed accurately reflected the overall group – there was variation in the sampling results.
Cue the statistics
Here’s a good rule of thumb: whenever someone is studying or analysing something that exhibits some form of variation, statistics will almost certainly be involved. By ‘statistics’ I am referring not to the figures quoted in reference to sports players, the economy, or crime – rather, I am referring to the science which allows us to produce those figures.
We need statistics because it’s often impossible to actually collect all of the relevant information: we’re unable to interview everyone in the country about their income; we cannot measure the diameter of all trees in a large forest; we cannot perpetually record how often cars pass through a particular stretch of road. Most importantly, there is no person who is going to come and tell us, “this is the true value,” as in our example.
So there is some true value that we want to know, but we are unable to measure every single unit in a given ‘population’. Statistics is thus used to estimate the number we’re interested in, despite these limitations.
Statistics also allows us to evaluate how precise that estimate of the true value is. In our earlier example, you may have noticed that the estimates from some samples were above the true value; other estimates were below the true value. This leads us to the idea of a confidence interval (CI).
A confidence interval is basically a range of possible values that we believe – with some level of confidence (hence the name) – contains the true value we are estimating. If CIs are narrow, it means our estimate is quite precise; if they are wide, it means they are not very precise.
The graph below illustrates everything that has been discussed up until now very well. The dark red line shows the estimates of divorce risk based on the age at which people first got married. The shaded pink area around the line shows the confidence intervals for those estimates. As you can see, the confidence intervals around the estimate for the 20-25 age range is very narrow – so those estimates are very precise. Below the age of 15 and above the age of 35, the CIs are wider; the estimates are less precise.
The size of confidence intervals is affected by, amongst other things, the volume of data available. Estimates calculated from larger samples are more precise than those calculated from smaller ones, so larger samples lead to narrower CIs, and vice versa.
The results from political polls – which involve interviewing a sample of voters – are usually reported as an estimate with some margin of error. The margin of error indicates the size of the confidence intervals around that estimate.
So when Hillary Clinton is shown as having the support of 44% of voters, and there is a 4% margin of error, this actually means that, she may actually have the support of as few as 40% of voters, or as many as 48% of voters.
The graph below, produced by the Huffington Post, shows the estimates and CIs for the proportions of people polled who support Trump and Clinton.
It’s important to note when confidence intervals overlap; this was the case during August and October last year, and it seems that this may be the case in the near future. When this happens, it means that we cannot reliably say that the difference in the estimates is significant. Without going into the technical details of why this is, overlapping CIs indicate that even if Clinton is estimated to have 44% of the vote, with Trump at 41%, the true underlying situation may possibly be that Trump is actually leading Clinton in terms of voter support. This is why – in my view anyway –the excitement that is sometimes generated about Trump or Clinton beating each other in the polls is sometimes completely unwarranted.
Hopefully those who read this have gained additional insights into how political polls – and statistics more generally – work. In the ‘Information Age’, where there is often more than meets the eye, it’s certainly helpful to know exactly what a given piece of information means.