MY CONFUSEMENT IS THIS BIG How big is that, Pep? I DON'T KNOW
Earlier today (or yesterday, if you're not on the West Coast of the USA), I wrote an article about Opta's possession stat. I may have been slightly unkind to it, but I figure that's probably justified considering it doesn't measure what it purports to measure - Opta took the shortcut of defining possession as pass volume, which are two different, but related concepts. This is, at best, thoroughly misleading.
However, I have to admit that Opta do have excellent reasons to attempt to find some sort of shortcut in their possession calculations, even if I don't think very highly of the shortcut that they eventually decided on. As it stands, possession's definition is fairly murky, and if they can get a rough approximation of the numbers without doing any work... well, I'm lazy too. I get it.
Since telling people they're doing things wrong isn't particularly productive - about all I can accomplish with the pass volume argument is to warn people off using possession entirely - I figured I'd find a better shortcut. Here goes.
For my money, the most important thing to think about when you're designing a statistic is the sport itself. If you start with footballing first principles, you can work to your statistic, but you also see the different directions you can take and the problems you need to overcome.
You'll note that I'm leaving aside the question of whether possession means anything in terms of winning or losing games. Evidence suggests that it probably doesn't*, but that's besides the point. Descriptive statistics are needed before predictive ones can be properly implemented, and we don't have the full slate (or really, anything close) of the descriptive stats we'll need to really push forward on the predictive stuff. So, right now, simply trying to record what happens in a football match is a worthy goal.
*Although we're obviously not measuring possession properly as it stands, so things could change.
Anyway, assuming perfect data, what's the best way to split up a football match? I think we should use a four-chase system that looks a lot like this:
- Team one in possession.
- Team two in possession.
- Possession contested.
- Ball not in play.
Those four states cover the entirety of a football match pretty easily. At this point, however, we have to make our first major design decision - what to do about dead ball situations. I'll revisit this later, but for the main body of this post I've decided to ignore the ball when it's out of play on corners, throw ins and the like.
Anyway, it's pretty obvious that a football match can be described (albeit not particularly usefully) as a shuffle between the above states, and it's also obvious that that the transition is more or less instantaneous. Therefore the time spent in any given state is measurable, and you can patch together the time t of the indivudual state instances to get a total time spent in that state. That might sound trivial, but if there was ambiguousity in the transition between one state and another, this would all become impossibly complex.
Let's assign each state a label s1-s4, to make things a little simpler.
From here, it's relatively easy to define possession P as follows:
If equations scare you (and I admit that I find the equation editor in Word to be entirely too much fun - a personal failing), the wordy explanation is that you can define a team's possession Pi as the ratio between the sum of the time the match spent in that team's possession state si over the sum of time it spent in states one and two.
That's the easy part taken care of - the more challenging aspect of this problem is adequately defining the boundaries of your state and then measuring it. The 'chess clock' approach fails at least one of these tests, and possibly fails at both. It's hard to say.
The 'standard' possession measure - before Opta took over the world, that is - was to have a stringer time both team's possessions with what was essentially a stopwatch. I'm calling this 'chess clock', because a) typing 'chess' makes me feel smart and b) it's what Opta call it:
Some use calculations based on the data, but most use a "chess clock" approach where each team has a button which is hit when they are in possession. Some do this in the broadcast truck, others have analysts who call it out and inputters who hit the buttons.
Opta used this method originally, but the problem we found with a chess clock approach for time is that you are reliant on the person logging the data remembering to hit the button and the person doing it usually has other tasks to perform and other data to log.
That's from the article where they admit they're calculating possession wrong, by the way.
At any rate, the limitations of the chess clock method should be pretty obvious. The primary issue, as Opta point out, is the fidelity of the data. It's not at all clear that a method that relies on a human to log everything via stopwatch for ninety minutes is tenable, and Opta's stringers have other things to worry about, like getting pass location and timings right, which they're really rather good at.
A secondary set of problems is that the chess clock method doesn't define state boundaries, and doesn't actually require s3 or s4 to exist at all. We can solve these issues by laying down some tough standards, but the data quality one will be more problematic. This is why Opta decided to use the pass volume shortcut. But there's a better way, I think.
Here's what I propose we do to fix things.
- Let s1 include all instances when a player from team one is in direct control of the ball - i.e. dribbling, standing with possession, ball held by goalkeeper, etc. Let it also include all completed passes (no deflections, clearances, loose balls) between teammates on team one.
- Let s2 be defined similarly.
- Let s3 be defined as all instances of the ball being in play that are not included in s1 or s2.
- Let s4 be defined as all instances in which play has ceased*.
*And not just because Aston Villa are playing heyoooooo.
We now have a sensible, firm boundary for states one and two which I believe match the expectation of what a possession statistic should look like. It measures all time on the ball, time in which the ball is under a team's control via direct passing, and nothing else. Since we've defined Pi as the ration of one of the first two states to both of the first two states, we can happily ignore s3 and s4. They exist, but are of zero use to us, and don't need to be presented.
Better yet, the definition I've chosen for the first two states mesh perfectly with Opta's current system. They track (and timestamp) whenever a player receives the ball, and they also timestamp whenever they give it away, by shot, pass, tackle, or Florent Malouda-esque running it over the touchline. That means it'd be fairly straightforward to run through the database and get the inputs for (1.1) above - I'm pretty sure I could set up and test a query to do that in under a week, and at least two days of that would be spent remembering how to properly query databases.
Following the above steps would result in an accurate measure of possession. Standardise these rules and suddenly everyone is a lot less confused. If Opta - or anyone else, for that matter - is serious about providing top-notch football statistics, they should consider using these possession rules (or something fairly similar). Otherwise, that 'possession' statistic they cite on air will continue to be approximately worthless.
Statistics providers have a real chance to add value here. They should take it.
Design Decisions: Including s4
Many of you anticipated the in play/out of play possession divide I've used here, and there's a fascinating discussion going on in the comments thread of the original possession post. Obviously I've come down on the 'ignore dead balls' side of the discussion, so I wanted to explore the reasons for that without interfering with the main body of the post.
The easiest way to include s4 in possession would be to add it to the denominator in (1.1) and then assign the numbers in a similar way to that described for the standard form I've detailed above. Alternatively, you could split it up via the number of set pieces and throws awarded to each team, but I don't really know why you'd do that (or why I'm writing this sentence, thinking about it).
I am not a big fan of using dead ball time to bad out possession for a few reasons, and the biggest one is philosophical. The point of possession, so far as I can tell, is to measure how often your team has the ball during the match, because you can do damage when you have said ball. Sure, denying the other team possession is great too, but that's besides the point - the s4 term does that anyway.
You can rewrite the denominator of (1.1) as a function of s3 and s4, which means that increasing s4 will increase the impact of individual timing sequences on overall P. And since the team with the ball on a set piece can easily recycle possession, they're net winners on possession if they're the ones wasting time.
At any rate, a compromise is to give the percentages as shown in the main body of the post and report total times for both s1 and s2 on top of that. See? Easy.