BARCELONA, SPAIN - FEBRUARY 19: Xavi Hernandez of FC Barcelona looks on from the bench prior to the La Liga match between FC Barcelona and Valencia CF at Camp Nou stadium on February 19, 2012 in Barcelona, Spain. FC Barcelona won 5-1. (Photo by David Ramos/Getty Images)
Have you ever looked at possession figures and wondered exactly how they're calculated? I have, and they've never really made much sense. How do you deal with the time elapsed between a pass, for example? Is a pass counted only as 'possession' if it's recovered at the end? How about clearances? Do people sit with stopwatches counting the time that they think a team's in control, and, if so, what does that mean?
This is a problem that's been bothering me since I first started seeing possession statistics presented. There's no obvious solution, as far as I can tell - every possible implementation has flaws. But, in general, when people are presented with possession, they take it as meaning the amount of time one team is in control of the ball vs. the other. It's a pretty simple concept which runs into a whole host of problems once you actually probe it in any depth.
That hasn't stopped outlets reporting possession. Either they've solved the problem, and I'm too dumb to comprehend it (my first guess), or they've made some ludicrous over-simplifications and there are some shenanigans going on (guess number two). The fact that possession appears to be defined differently depend on who's reporting it is a major hint that something trippy's going on.
With possession kings Barcelona on our minds this week, I figured that this would be as good a time as any to look in a little more detail at the inner workings of the statistic itself. How does possession work?
It turns out, at least in Opta's case, that possession may not even exist. It's reported, sure, but rumour had it that the formula was simply passes attempted by one team over total passes attempted in a game, which would be more accurately defined as pass volume (Vp).
This, to me, was a silly rumour. I've designed statistics before, and the first thing you do when you plot out a serious stat is run through the internal logic. There are so many design flaws in a 'possession' stat that simply uses Vp that I found the idea essentially impossible to accept straight up.
So I ran some tests to see how closely I can predict Opta's possession numbers by pass volume. I figured that a hypothetical ideal possession statistic would have a very strong correlation with Vp, because you can't pass the ball unless you have it in the first place, but there'd be other important factors as well. In my mind, I was expecting Vp to correlate with possession in the 0.6-0.8 range.
After a morning of running checks with help from WhoScored's excellent statistics database, I ended up with a correlation that was basically 1.0. Vp is Opta's possession statistic. To the right is the result of one of those checks, a graph of Vp vs. reported possession from the Champions League knockout stages. I've added a line for y=x, in case it wasn't clear enough that they're the same statistic to within rounding error.
This is bad. This is really, really really bad. Why? Because we're seeing something presented completely incorrectly. Sure, Vp has to inherently match up to possession to some extent - they're measuring something close to the same thing. But that doesn't mean that they are the same thing, and it doesn't mean it's acceptable to simply label Vp as possession without any apparent thought.
If the general public has a perception of possession as being a time based statistic, the data providers have, in my view, an obligation to provide that statistic. The formula has to be some variant of time over time in order to be a valid measurement of possession as most of the world understands it.
Instead, in this framework, the basic unit of possession appears to be the pass attempt*, meaning that unless you try to play a pass, you have not possessed the ball. If true, and the correlations we're looking at sure make it seem like it, that leads to a whole catalogue of problems.
*If this isn't the intent of the construct, it does a superb job of ignoring all other terms. It's entirely possible that Opta are counting the volume of player possession endpoints, which tends to be heavily dominated by passes, and that that domination is such that the other terms are reduced to the scale of rounding errors. The same criticisms apply.
The first and most obvious is the assumption that all teams will pass as frequently while on the ball. This is obviously untrue, and it stops any chance of Vp mapping to true possession. Imagine you have one team in ostensible control of the match half the time that passes at half the rate of their opponents. The resulting 'possession' figure, would have the first team at 33 percent of possession and the second team at 67 just because the one team plays the ball faster. That's a bizarre error to introduce.
We're also looking at situations in which a missed pass appears to count for more than holding the ball in the corner flag for thirty seconds. A one-two between players counts for two units of possession - the first player simply retaining possession of the ball over that period of time counts for none. The issue here is that we're treating a football match as though it steps through a series of events rather than through a continuum.
If we treat the sport as though things only happen when we measure them happening, and then deliberately choose not to measure mundane events (such as holding possession), we're going to end up with a very, very confused situation. That's what appears to be happening here.
I'm completely fine with keeping track of passing volume - I've done it before myself. What's frustrating, from an analyst's point of view, is that we're being sold a dud. A statistic that ostensibly measures possession measures something that is not possession, and gets repeated as authoritative anyway.
And people wonder why football statistics don't get taken very seriously.