MLB 2023 Season Player Value: An EDA

Final Project
Data Science 1 with R (STAT 301-1)

Author

Annabelle Sole

Published

December 8, 2023

Introduction

This analysis explores MLB player value data from Baseball-Reference.com from the 2023 season. I’ve always been interested in baseball, and I wanted to use my new data science knowledge to learn even more about the sport about which I am passionate. Specifically, I wanted to explore different metrics of how players are valued, and which metrics are the most significant or useful to building a successful team. I wanted to research these questions using data from both batters and pitchers.

Data overview & quality

Variable overview

The raw batter value dataset contains 769 observations, and the raw pitcher value dataset contains 863 observations. The batter and pitcher value datasets are different, but share some of the same variables. The raw batter datset had 25 variables, 19 of which are numerical and 6 are categorical. I modified the salary variable to be numeric, and also added in a categorical variable, hand, and a numerical variable, pos_number. The raw pitcher value dataset initially had 21 numerical and 5 categorical variables. I also modified the salary variable to be numeric here, and added in the hand categorical variable. Additionally, I cleaned the pos_summary variable in order for it to only show the position the player played most. I had to clean the data significantly in order to perform this EDA.

Rather than actual batting/pitching statistics such as batting average or earned run average, the variables in the dataset are mostly made up of value statistics. It has always been the goal of baseball analysts to come up with the perfect statistic to determine a player’s value. The variables in the data are essentially made up of several attempts to do just that, based on different calculations of batting and pitching statistics.

Missingness overview

In both datasets, several of the salary values are missing. As stated in Progress Memo 1, according to Baseball Reference, this is because salary is often missing for players called up from the minor leagues during the season or who were acquired during the season. Additionally, in the pitcher value dataset, several of the gmLI (game entering leverage index) values are missing. This is because this statistic only applies to relief pitchers. The other missing values in both datasets may be due to the fact that the player has not appeared in enough games for the statistic to be calculated accurately.

Additionally, for the batters dataset, I limited observations to batters with over 100 plate appearances, in order to eliminate outliers that could skew the data a certain way. This limited the clean batters dataset to 459 observations. I did not filter in a similar way for the pitchers dataset because relief pitchers, those whole pitch towards the end of a game, automatically have far less innings pitched than starting pitchers.

Explorations

Initial Observations

My initial goal in exploring this data was to compare players’ salaries to the actual value they’ve added to their team. Therefore, I spent significant time analyzing the salary variable and its relationship to other variables in the dataset.

I first wanted to compare salary to war (Wins above replacement), a statistic that, according to Baseball Reference, represents “the number of wins the player added to the team above a replacement player.”

It makes sense that there is a positive relationship between these variables: that the more value a player adds to their team, the more they should get paid. However, not all teams are built the same. Teams in bigger cities have bigger audiences, resulting in more revenue and therefore more money to spend. Likewise, teams with smaller audiences have less to spend, and thus have to seek out ways to find value in players for a lower price. Using a list online, I separated some of the observations in the dataset into two groups: five small market teams, and five big market teams. For small market teams, I selected the A’s, Royals, Rays, Brewers, Padres, and for the big market teams, I selected the Yankees, Red Sox, Dodgers, Cubs, and Phillies. After this distinction, I calculated summary statistics for these groups’ respective salary and war.

Overall batter salary:

mean sd n rsd
6640710 8266274 459 8266274

Salary of batters on big market teams:

mean sd n rsd
8801561 9285984 56 9285984

Salary of batters on small market teams:

mean sd n rsd
4418729 6371075 74 6371075

Overall batter WAR:

mean sd n rsd
1.325055 1.833639 459 1.833639

WAR of batters on big market teams:

mean sd n rsd
1.489286 2.010217 56 2.010217

WAR of batters on small market teams:

mean sd n rsd
1.256757 1.814484 74 1.814484

These results were somewhat surprising. We can see that while the difference in salary between the biggest market teams and the smallest market teams is very different, the overall WAR deviates much less. I want to continue to explore the data through distinctions like these – what small market teams are looking for in players that are enabling them to compete with teams that have access to a higher payroll. For example, the Tampa Bay Rays, one of the smallest market teams, had the 5th best record in the MLB this year. How were they able to acquire so much talent with a small budget? I calculated summary statistics similar to the ones above for the pitchers dataset, as well:

Overall pitcher salary:

mean sd n rsd
4325165 6177438 863 6177438

Salary of pitchers on big market teams:

mean sd n rsd
5497065 7765816 104 7765816

Salary of pitchers on small market teams:

mean sd n rsd
3709624 4804720 144 4804720

Overall pitcher WAR:

mean sd n rsd
0.4763615 1.103789 863 1.103789

WAR of pitchers on big market teams:

mean sd n rsd
0.6163462 1.233915 104 1.233915

WAR of pitchers on small market teams:

mean sd n rsd
0.3576389 1.034819 144 1.034819

These results were similar to the ones for batters. With this information in mind, I began to focus my research more clearly. What were bigger and smaller market teams valuing, and how are the latter able to succeed?

Exploring Player Salary

To explore the effects of different variables on player salaries, I created correlation matrices for both batters and pitchers.

Batter correlation matrix:

Pitcher correlation matrix:

Surprisingly, for both the batters and pitchers dataset, age has the strongest positive correlation with salary, and not war or any other value-based statistic. While this could make sense due to the free agency system in the MLB, after which player contracts tend to be much higher, it is still interesting that war has a relatively weak correlation with salary compared to some of the other variables in the dataset.

It is also interesting to note that war and age have a zero to negative correlation for both batters and pitchers. I was curious about this relationship.

I was surprised that the age vs. war relationship differed this much between pitchers and batters.

Now I want to filter the correlation matrices by big and small market teams and see how the values change.

Small market batter correlation matrix:

Big market batter correlation matrix:

From these tables, we can see that the small market batter correlation between war and salary is approximately 0.37, while for big market batters it’s approximately 0.43. This tracks with our observations that both war and salary were higher on average for big market batters. Now to observe the pitchers:

Small market pitcher correlation matrix:

Big market pitcher correlation matrix:

This was perhaps my most surprising discovery. For the big market pitchers, I found a slightly negative correlation between war and salary. This was shocking because out of all the value statistics in the data, I feel like war has become the most universally accepted one-number value statistic. So how are teams with all this money wasting it on invaluable players?

From our previous analysis of war we found that the distribution is relatively similar across big market pitchers compared to all MLB pitchers. Still, I wanted to visualize this relationship to see if there were certain outliers affecting this conclusion.

The highest value of war is included in the big-market sample, which I found interesting. If anything, that means that that data could skew higher than average.

Pitcher Salary beyond WAR

Discovering this inverse relationship, I wanted to study another value statistic and see if this discrepancy in big-market pitchers salaries was limited to war. I selected both rar, which is similar to war but calculates Runs above replacement as opposed to wins above replacement, and raa which standardizes a league average of runs and then calculates the Runs Above Average for each player. What I found was surprising.

Compared to the overall league distributions, it seems like small market teams are allocating their salary more efficiently to better pitchers, at least by the metrics of war, rar, and raa. This is where I ended my EDA, having found a surprising conclusion.

Conclusion

I was indeed surprised to find that small market teams seem to be allocating their pitcher payroll better than the average team. However, from what I concluded, it seems that big market teams are able to pay more for better batters. In the future, I’d love to expand this analysis to actual player statistics instead of estimated value statistics. I’d also love to have the actual payroll of each team as a variable in the dataset, to figure out proportionally which team is the smartest.

References

O’Shea, T. (2022, May 9). Breaking Down The Smallest Market Teams In Major League Baseball. Joker Mag. https://jokermag.com/smallest-market-teams-mlb/

Sports Reference LLC. (2023, December 8). 2023 Major League Baseball Value. Baseball-Reference.com - Major League Statistics and Information. https://www.baseball-reference.com/leagues/majors/2023-value-batting.shtml

Sports Reference LLC. (2023, December 8). 2023 Major League Baseball Value. Baseball-Reference.com - Major League Statistics and Information. https://www.baseball-reference.com/leagues/majors/2023-value-pitching.shtml

Trueblood, M. (2012, January 13). Power Ranking All 30 MLB Teams By Market Size. Bleacher Report. https://bleacherreport.com/articles/961412-mlb-power-rankings-all-30-mlb-teams-by-market-size