The 2017 MLB season saw a number of fresh faces in postseason play. In fact, 8 of the 12 teams that made the playoffs in 2017 missed the playoffs in 2014. Our group of baseball fanatics is interested in analyzing how these eight teams- the Arizona Diamondbacks, Colorado Rockies, Minnesota Twins, Houston Astros, Boston Red Sox, Cleveland Indians, New York Yankees, and Chicago Cubs- managed to transform their clubs from being at the bottom of the league to becoming championship contenders in just three seasons.
In order to determine these teams’ success, we analyzed team pitching statistics in 2014 and 2017. Specifically we looked at strikeouts per nine innings (K/9), strikeouts to walk ratio (K/BB), walks and hits per innings pitched (WHIP) and earned run average (ERA). We charted these statistics against team wins to try and determine if there was a substantiated correlation between improved pitching statistics and overall team performance.
We believe that team WHIP and ERA are the best pitching statistics to determine overall team success, which we are measuring by wins. Furthermore, we believe that teams that have sizeably improved on these statistics from 2014-2017 will likewise have more wins over this time span. In summary, these statistics may be useful for baseball executives to pay particular attention to when acquiring and releasing players.
## [1] "Correlation between Wins and K/9 2014"
## [1] 0.3276595
## [1] "Correlation between Wins and K/BB 2014"
## [1] 0.5470279
## [1] "Correlation between Wins and WHIP 2014"
## [1] -0.7253512
## [1] "Correlation between Wins and ERA 2014"
## [1] -0.7298654
This visualization shows our selected statistics in 2014 correlated against each other. The scatter plots show that ERA and WHIP have the strongest correlations with wins. These correlation values are both negative because it is better to have a lower ERA/WHIP. Thus, lower ERA or WHIP correlate strongly with higher wins.
These linear regression models depict our four selected stats plotted against wins in 2014. The four plots show that K/9 and K/BB have slight positive correlations with wins while WHIP and ERA have strong negative correlations with wins. We have also plotted the fitted values vs. the residuals values for each regression model. In all of these plots, there is no clear pattern and the data is generally a random cloud of points. For the ERA and WHIP plots, the data points are even less clustered which means that there is a strong correlation between wins and these statistics.
## [1] "Correlation between Wins and K/9 2017"
## [1] 0.7554183
## [1] "Correlation between Wins and K/BB 2017"
## [1] 0.7851053
## [1] "Correlation between Wins and WHIP 2017"
## [1] -0.8444133
## [1] "Correlation between Wins and ERA 2017"
## [1] -0.8394133
This visualization shows our selected statistics in 2017 correlated against each other. Once again, ERA and WHIP are most strongly correlated with wins. However, K/9 and K/BB also are much more closely correlated with wins than in 2014. In 2017, there were almost 3000 more total strikeouts compared to 2014. In general, baseball has become more strikeout focused and pitchers are intentionally trying to throw harder. Thus, a higher percentage of total outs were coming via the strikeout so K/9 drastically increased. Because pitchers have to get outs to win games and more of these outs came via strikeout, it makes sense that the correlation between K/9 and wins then increased too. The correlation between K/BB and wins also increased by a moderate amount. This means that teams that won more had a higher K/BB value. We are not able to determine why this occurred without additional datasets but we hypothesize that good teams placed an added emphasis on strikeouts which resulted in a higher correlation coefficient.
These linear regression models depict our four selected stats plotted against wins in 2017. The four plots show that K/9 and K/BB have strong positive correlations with wins while WHIP and ERA have even stronger negative correlations with wins. However, as discussed above, this makes sense because it is better to have a lower ERA and WHIP. We have also plotted the fitted values vs. the residuals values for each regression model. In all of these plots, there is no clear pattern and the data does not really form clusters. This means that a linear regression model was a suitable model to measure these correlations. For ERA and WHIP, the correlation with wins remained high in both 2014 and 2017 which suggests that our hypothesis is correct.
## [1] 0.4
This decision tree shows that teams with an ERA above 4 make up the bottom 27% of league. It then classifies the teams with ERAs lower than 4 into two groups- those with a WHIP higher or lower than 1.3. The teams with a WHIP lower than 1.3 make up the top 50% of the league and the teams with a WHIP higher than 1.3 comprise the remaining 23% in the middle.
This decision tree also has a prediction accuracy of 0.4, which is pretty low. This value is low because we’re training on the 2014 data, which has 30 data points because there are 30 baseball teams, and then testing on the 2017 data which also has 30 data points. We expect it would be higher if we had more data points to train on.
## [1] 0.3
Naive Bayes has a prediction accuracy of 0.3. Like the decision tree, this is also very low because we’re training and testing on 30 data points.
## [1] 0.3666667
KNN has a prediction accuracy of around 0.33. Like the other two classifiers, this value is low because we only have one dataset to train on.
## [1] 0.4
This decision tree shows that teams with a WHIP above 1.3 make up the 30% of the teams with the highest ERA. It then classifies the teams with WHIPs lower than 1.3 into two groups- those with a WHIP higher or lower than 1.2. The teams with a WHIP lower than 1.2 make up the 27% of the league with the lowest ERA and the teams with a WHIP between 1.2 and 1.3 comprise the remaining 43% in the middle.
The prediction accuracy for this classifier is 0.4. This, tree like the previous decision tree, was trained on the 2014 data and tested on the 2017 data. Thus, its accuracy was low because we did not have enough training data.
## [1] 0.4
Naive Bayes has a prediction accuracy of 0.4. Like the decision tree, this is also low because we’re training and testing on the same number of data points.
## [1] 0.5
KNN has a prediction accuracy of around 0.46. This value still isn’t great but is higher than the other classifiers.
Our geom area visualization demonstrates the range area for each statistic. We found that K/9 can range from anywhere from 5 to 8.75. This range is quite sizeable compared to WHIP, for which the range is between 1.25 and 2.5. The range of these statistics is indicative of the variables themselves, as WHIP is measured on a per inning basis and k/9 is measured per game.
Again, we have looked at the distribution for 2017. The distribution looks similar to the 2014 data except there is a larger range for K/9.
This bar graph compares the number of wins between 2014 and 2017 for our eight selected teams: Astros, Cubs, Diamondbacks, Indians, Red Sox, Rockies, Twins, and Yankees. Each team won more games in 2017 than in 2014.
This bar graph shows the eight teams’ ERA in 2014 and 2017. For each team, ERA either decreases or stays about the same. The Astros who had the greatest difference in team wins had their team ERA only very slightly. However, the Diamondbacks who had the second largest difference in wins between these two years saw their ERA decrease by 0.59 points, a huge jump. These results also confirm that ERA is a strong indicator of team performance.
This bar graph shows the eight teams’ WHIP in 2014 and 2017. For every team, WHIP decreased between our two selected years. This visualization shows that WHIP is a strong indicator of total wins.
This bar graph shows the eight teams’ K/9 in 2014 and 2017. This graph shows that each team’s K/9 increased between 2014 and 2017. While our model shows that WHIP and ERA have the strongest correlation with team wins, it seems that for these eight teams, K/9 is also a strong predictor.
This bar graph shows the eight teams’ K/BB in 2014 and 2017. As expected, this stat does not correlate as well with wins. For some teams such as the Red Sox and Indians, K/BB increased substantially. However, it also noticeably decreased for the Yankees and Twins showing that it is an inconsistent predictor.
This scatter plot shows the correlation between WHIP and wins from both years on the same visualization. As you can see, while there is a strong correlation in both years, the 2017 data certainly has a stronger correlation.
This scatter plot shows the correlation between ERA and wins from both years on the same visualization. Once again, while there is a strong correlation in both years, the 2017 data certainly has a stronger correlation.
After evaluating the correlations using ggpairs, we have determined that WHIP and ERA have the strongest correlation to team success. As our final two scatter plots show, teams with the lowest ERA and WHIP tend to be the teams with the most wins. It is also important to remember that lower WHIP and ERA indicate greater pitching performance, similar to golf scoring where lower scores show better play, which is why their correlations are both -.73.
Finally, our final scatter plots show that teams with lower team ERA and WHIP recorded more wins, further demonstrating our findings. Furthermore, we found that the eight teams that went from being among the worst in the league in 2014 and were playoff contenders in 2017, dramatically improved in these statistics. In conclusion, pitching statistics, specifically the two we have highlighted, ERA and WHIP, are helpful in determining overall team success. We have confirmed our hypothesis.