One of the things I’ve been promising for a while now is a look at the confidence levels in the Pitch f/x data this year. If you don’t already know, MLB added a pitch type to the data, and a confidence in that type. So, for example, you’ll see it say FA for Fastball, and 0.86 for the confidence.
I thought this was a pretty neat thing for them to have added, so I’ve been crunching some numbers on it. Others have looked at the pitch type, and not been greatly impressed with the results, and I’ve been working on the confidence intervals themselves.
While doing this work (and the reason it has been so delayed), I discovered, to my chagrin, that I shouldn’t concentrate so much on the Rangers pitchers (and their opponents), but rather I should be looking more league-wide when I do things like this. Sometimes it doesn’t matter, when, for example, you’re directly comparing two pitchers in a game. In that case, even if the values are off for some reason, they’re likely to be the same for both pitchers, in which case comparisons can still be made. But if you start to stretch things out, by, say, looking at all of Kevin Millwood’s starts this year, you suddenly discover how your analysis doesn’t scale.
You see, for the first part of this I had manually downloaded the xml files for the Rangers pitchers and their opponents (in the opening Seattle series) and then scattered others, principally Millwood’s starts. Once I expanded, I discovered one critical change: the scale of the confidence level. In the initial files I looked at, the confidence levels ranged from about .4 up to about .98 or so, which I took to mean that they were basing their confidence in the call of a pitch type on a scale of 0 to 1, 1 being completely confident they had it right, and lower numbers indicating how unsure they were.
When I looked at Millwood’s later starts, I suddenly discovered that he was getting confidence levels over 1, ranging up to almost 1.5. At first I thought it might be a mathematical problem, and worked on that a little, but then I realized that no, the values had pretty much been multiplied exactly 1.5 times compared to his initial start. That was oddly suspicious, so this past weekend, for the first time this season, I ran my automated process for downloading the files, and my program to install them into SQL Server. I then started crunching numbers.
Here’s a picture of a waterfall for you to enjoy:
Okay, so really it is the confidence value from all games through 4/15. Sorting it by date and then by the home team, it kind of separates each game into a distinct (in some places) vertical line. If you can read the numbers at the bottom, it’s simply a number from my spreadsheet, sequential pitch number. Doesn’t mean anything other than a sort of representation of time.
What we see is, for the first half of the numbers (through about pitch 30,000, which happens to be around April 8, that the vast majority of games had a maximum confidence level at about 1. This maps perfectly with what I discovered in the first few files from the Rangers starts in Seattle. Oddly, though, there are about twenty or so games in that early part of the season where they spike up to 1.5. This also coincides with what I saw happening. There is a single odd game where it spikes to 2 (which happens to be Cleveland at Anaheim on 4-8), and then right around that 4-8 date all of a sudden every game goes to a 1.5 maximum.
So what happened? Take a look at this chart, which although colorful is not nearly as pretty as the other one:
As clear as mud? This is a table showing the maximum confidence level for each game, based on the home team (horizontally) and date (vertically). The red shows days where the confidence was below 1, the yellow is around 1.5, and the sole green one is the 2. What does all this mean?
I believe that MLB changed their algorithm, and released it publically on 4/8 or 4/9. Everything after that date is using the 1.5 maximum, everything prior is using the 1.0. No it’s not, I hear you say, prior to 4/9 it is mixed! This is true, but why? Now, I believe that the yellow days on the left side are days when they were testing their new algorithm – and not necessarily live, but with old data. It is interesting to me that the three days in the middle where the data is all red are a weekend (Saturday-Monday, to be precise), which is exactly when a programmer may not be working on the changes. I think perhaps some of these games were processed live with the 1.5 maximum, but some of them they went back and ran on the data later. Why do I believe that? Because I had manually downloaded some of those games, and can compare what I got on the day of the game, to what I downloaded this past weekend, two weeks after the games were played. There would be no reason for them to go back and change those games – except if they were testing a new algorithm. Although why it would then go into a live directory instead of a test I don’t know.
So, what changes were made? Let’s take a look at the ones I manually downloaded and see what happens:
On 3-31 I got the xml files for Millwood and Bedard, who opened the season in Seattle. My manual file has a maximum around 1, but if you look really closely in the chart above you will see the file I downloaded later has a 1.5 maximum (it’s the only yellow one in the first column).
What changed for the two pitchers? With one exception, for both pitchers the pitch type stayed exactly the same, and the pitch confidence multiplied exactly by 1.5. That sole exception was the one FS pitch ( a splitter in their nomenclature) thrown by Millwood, where the pitch stayed as a FS and the confidence stayed the same.
Okay, so evidence their algorithm increased some or most pitches by 1.5 in confidence, not necessarily a big deal, right?
On 4-1 I got Felix Hernandez and Vicente Padilla’s files manually, and later automatically. They were identical.
On 4-2 I got Jason Jennings’ and Brian Burres’ files, and they showed radical changes. Many of the pitches simply showed the 1.5 increase. But – and this is the most important finding, I think – a number of the pitches (18, to be precise) changed their confidence by different values, and also changed the pitch type. CH, CU and FC pitches (changeups, curves and cut fastballs) changed to fastballs, sliders and splitters, and in no direct relationship (e.g. the changeups didn’t all change to fastballs).
I would keep going, but I think you get the point: in the first few weeks of the season, MLB changed the values not only of the confidence levels, but also the pitch types they were showing. This was probably in part influenced by articles about the lack of quality in the pitch types, such as the one I linked above, which suggested they were only getting about 70% right. I’m sure it was also influenced by their own continual desire to improve the data, and the results they are giving. For all I know, they may have been intending to do this, as they got more and more data.
I have not attempted to find which games Mike Fast analyzed in his Hardball Times article, but it is reasonable to assume that some of them might have been affected by this problem. With 18 out of 61 games in the first five days, that’s about 30%, and if Mike had downloaded the night of the game, he might not have seen these changes (it took me two weeks to discover them). Thus, it is possible that at least some of the pitches may have changed with the new algorithm. It would be interesting to see how many were affected. Given that the first Millwood game had a single pitch change, and the Jennings game had 18 pitches changed, there probably would not be a big change in Mike’s results, perhaps only a percentage point or two.
This has all been interesting for me because it underscores the real-time and transient nature of Pitch f/x. It is exciting to be looking at this stuff, it is like being on the frontier of a new science (or at least of a new way of measuring things). As time goes by it will only get better and better, and we can answer so many questions we have now. Recognizing these changes and riding along with them will be very interesting. Not as interesting as watching the Rangers win a 14 inning thriller after leaving 20 men on base, but interesting nonetheless. I hope to introduce more study of the pitch type and confidence in the near future.