Ivan Nova has a secret. His fastball acts like a sinker – lots of groundballs, big platoon split – but the pitch does not look like a sinker. By that I mean his fastball is rather bizarre in terms of pitch f/x data; it doesn’t really move like any kind of traditional fastball at all. I began by sifting through his data to try and find out what makes his fastball disguise itself as a sinker, and I came out with an exploration as to what makes any fastball result in a groundball.
I began my expedition by creating a sample from pitch f/x data. If you are not familiar, pitch f/x data comes from a system that is owned by Sportsvision, and is operational in all major league stadiums and some minor league ones. It works by using three cameras that take images of the pitched ball to determine position in space, and then uses physics equations to extrapolate movement, break, velocity, and final location. It is the same system used in ESPN’s “K-zone” and Gameday. I randomly selected 5000 fastballs of any type (FF, FA, FT) from right handers, excluding side-armers. I could have focused on each type of fastball individually, but that can result in bias due to pitch f/x classification issues. I also had to select only pitches that were put into play, including homeruns. This is important because it means my sample is limited to pitches that were both swung at and put into play. This is necessary because we are interested in what quantifiable pitch attributes induce groundballs, not what pitches batters like to swing at. Below I have broken down the data into the most important variables.
Does Pitch Location Matter?
For those of you who like heat mappery:
This graph shows groundball rate by pitch location, where red represents a low groundball rate and blue a high groundball rate. The plot is from the catcher’s perspective, so the right side of the graph is near left-handed batters and the left side is near right-handed batters. The dotted box indicates the strikezone. Predicted groundball rates come from a generalized additive model (with only location as the variable) with cross validated smoothing paramters and a specification for binomial data. Both smoothed terms (horizontal and vertical location) were significant at a 99.9% level, with vertical location having much higher significance.
*I used Millsy’s awesome website as a reference for the code. I also got the idea to use the mgcv package in R from him. Check out his site if you use R or want to learn.
Turns out that pitches low in the zone become grounders more often than other pitches. Duh. The graph also almost appears symmetric (horizontally), suggesting that horizontal location plays a role here too. Problem with this graph is that it’s really the product of two separate distributions – left-handed batters and right-handed batters. The following graphs demonstrate this point:
The graph shows groundball rate by horizontal location for both right-handed batters (blue) and left-handed batters (red). Gray bands indicate confidence. The dotted lines represent the horizontal borders of the strikezone.
This graph confirms my earlier suspicion that the symmetric appearance of the heat-map was due to it being the product of two different distributions. As you can see, righties hit the most grounders on pitches middle to outside and on pitches way inside. On pitches middle-in, they loft the ball in the air. For lefties, they hit lots of groundballs on pitches away from them, but not in. One might expect the groundball rates of lefties and righties to be mirrored images (as in flipped across y-axis), but that’s not the case here. This can be explained by the fact that I only used right-handed pitchers in my sample.
Graph shows groundball rate by vertical location for both right-handed batters (blue) and left-handed batters (red). Gray bands indicate confidence. The dotted lines represent the vertical borders of the strikezone. Axes are flipped for aesthetic purposes. This graph and the above graph use loess to create the trend lines, mainly for ease of plotting. n=2386 for left-handed batters and n=2614 for right-handed batters.
Unsurprisingly, pitches that were down were pounded into the ground. Again of interest is the different behavior of righties and lefties. Because there are only right-handed pitchers in the sample it’s to be expected that righties are always hitting more grounders with this data. Strangely, there is a large difference between the types of batters on pitches that are in the bottom half of the zone. This can probably be explained by difference in pitch selection in this area of the strikezone to the two different types of batters.
So yea, location matters. Vertical location turned out to be more important than horizontal location, which is not surprising. However, this does not account for a bias in pitch selection. The distribution of pitch types is not constant throughout the strikezone. This is important because a higher proportion of sinking fastballs are located in the bottom of the zone, while four-seam fastballs are located more up in the zone. I have not accounted for this bias in pitch distribution, so we need to exercise caution with concluding how important vertical location is.
Does Velocity Matter?
Graph shows groundball rate by velocity for both right-handed batters (blue) and left-handed batters (red). Gray bands indicate confidence.
Most major-league fastballs are located within the 88-95 mph range, which is confirmed by the size of the error bands (smaller when sample is larger). In this range groundball rate seems pretty constant, perhaps with a slight positive trend. The effect of velocity is also likely underestimated, because a higher proportion of slow fastballs (< 90) are going to be of the sinking variety than their faster counterparts. But that’s boring. Of interest here are the extremes. For lefties, the groundball rate on slow pitches is very high, probably because all of these pitches are miss-classified cutters or changeups. When we look at the other range, we see some odd behavior; elite velocity results in worm-burners for righties, but flyballs for lefties. A possible explanation is that both types of batters are late, just that being late manifests itself in different results. This is part of what makes truly elite velocity (high-nineties) coveted. It is also possible that there is a difference in pitch distribution on high-nineties fastballs to righties and lefties, but that does not seem like much of an issue; once you throw that hard, you probably don’t need to have a different approach against lefties and righties (with the fastball). We should also not be too hasty with these conclusions about pitches with extreme velocity, because the sample gets kind of small at these points.
Does Movement Matter?
Graph shows groundball rate by horizontal movement (pfx_x) for both right-handed batters (blue) and left-handed batters (red). Gray bands indicate confidence. Movement is defined as relative to ball thrown without spin. This is similar to what scouts refer to as “tail.” Since all these pitches are thrown by right-handers, nearly all of these fastballs have negative horizontal displacement. Most of the pitches with positive horizontal displacement are probably miss-classified cutters. Thankfully there aren’t too many of those.
For righties, groundball right spikes at about -7.5 inches. This is because this is where the line between four-seam and two-seam is blurred; naturally, two-seamers have more negative horizontal tail than four-seamers. So what we are seeing here is likely an example of “correlation, not causation.” Somewhat of a strange phenomenon is that pitches that are [likely] miss-classified cutters (positive on x-axis) cause many more groundballs for lefties than righties.
Graph shows groundball rate by vertical movement (pfx_z) for both right-handed batters (blue) and left-handed batters (red). Gray bands indicate confidence. Movement is defined as relative to ball thrown without spin. This is similar to what scouts refer to as “sink.” You may be surprised that most of these pitches have positive vertical movement. This is because even sinkers are still thrown with backspin. The average sinker has about 5 inches of vertical “rise” and the average four-seam has about 10 inches of “rise”. Again, rise here is a relative term, it doesn’t actually mean the force generated by the backspin of the pitch is stronger than gravity.
Unsurprisingly, pitches with more sink get more groundballs. Of interest is that the behavior is very similar for both righties and lefties, which we haven’t seen so far. Again, we have to be cautious with how much importance we place on vertical movement, due to the same problem we had with vertical location; the pitches with more sink are thrown lower in the zone. The reason the pitches with negative vertical movement get few groundballs is because they are probably miss-classifications.
Bringing it all together:
Using the same method described in the production of the heatmap (GAM model with smoothing and logit link) I created a model with all of these variables put together: horizontal movement and location, vertical movement and location, and velocity. I did not include anything about batter handedness. Horizontal movement had the highest p-value, being significant at a 95% level. All other smoothed terms were significant at a 99.9% significance level. Velocity and horizontal location had about the same significance. Being far more significant than anything else were vertical location (pz) and vertical movement (pfx_z), which is unsurprising. Because of the nature of the model, there is no concrete mathematical function to share like what we would get with linear regression. Additionally, a model with only pitch location as the variable(s) did a little better than a model with only movement as the variable(s). Seems that we can conclude vertical movement and vertical location are the most important factors.
Feel free to skip this part and go to the “finishing thoughts” section:
Now time to test out the model with some specific pitchers. I will exclude cutters. My database is updated through may 27th and I will use only 2011 data (except for one case):
Predicted GB% of 64% for only his sinkers (pretty much only fastball he throws)
Actual of 61.9% on only sinkers
Predicted GB: of 55.5% for all fastballs (sinkers and four-seam)
Actual of 53.1% for all fastballs
Predicted GB: of 57.7% for all fastballs (sinkers and four-seam)
Actual of 63.8% for all fastballs
I don’t have the computing power to run this model on a million pitchers, but it did a really good job with these extreme sinker-baller guys. I’m super pleased with these results so far, though obviously I chose the examples.
known flyball pitchers:
Predicted GB: 32.1% for all fastballs
Actual of 27.5% for all fastballs
Predicted GB: of 31.4% for all fastballs
Actual of 27.9% for all fastballs
Phil Hughes (2010):
Predicted GB: 32.9% for all fastballs
Actual of 30.00% for all fastballs
Ok, the model also handles extreme flyball pitchers well.
Predicted GB%: 40.9%
Predicted GB: 38.9% for all fastballs
Actual: 32.9% for all fastballs
Predicted GB: 34.8% for all fastballs
Actual: 33.7% for all fastballs
The model did not do as well for these three pitchers that I selected.
predicted 50.8% for all fastballs
Predicted 43.3% on all fastballs
Through admittedly unscientific methods, the model seems to perform pretty well overall. Again, these predictions were made using only velocity, movement, location, and the knowledge that the batter had put the ball into play. This suggests that a great deal of batted ball results can be explained by the actual pitch. What we don’t know is the size of the effect of sequencing (previous pitches) and deception. I initially attempted to include data about the previous pitch thrown (as suggested by our own Will Moller) but that quickly became more complicated than originally anticipated. We have looked into what makes a groundball pretty thoroughly, but seems that this exploration also carries with it a message about both the power and limitations of pitch f/x data.
The implications of information like this are also relevant. Theoretically, we can use models like this to get around small sample sizes. Assuming pitch f/x data normalizes faster than traditional statistics, we can use models based on pitch f/x to get faster reads on prospects and other notable pitchers. We can also use this information to confirm common sense, which is what mainly happened here. Perhaps one day we will be able to construct a pitch f/x ERA…
-*Pitch f/x data from MLBAM through Darrel Zimmerman’s pbp2 database *http://princeofslides.blogspot.com/ – used as reference for R code/logistic regression