MLB 2023 Season Player Value: An EDA
Final Project
Data Science 1 with R (STAT 301-1)
Introduction
This analysis explores MLB player value data from Baseball-Reference.com from the 2023 season. I’ve always been interested in baseball, and I wanted to use my new data science knowledge to learn even more about the sport about which I am passionate. Specifically, I wanted to explore different metrics of how players are valued, and which metrics are the most significant or useful to building a successful team. I wanted to research these questions using data from both batters and pitchers.
Data overview & quality
Variable overview
The raw batter value dataset contains 769 observations, and the raw pitcher value dataset contains 863 observations. The batter and pitcher value datasets are different, but share some of the same variables. The raw batter datset had 25 variables, 19 of which are numerical and 6 are categorical. I modified the salary
variable to be numeric, and also added in a categorical variable, hand
, and a numerical variable, pos_number
. The raw pitcher value dataset initially had 21 numerical and 5 categorical variables. I also modified the salary
variable to be numeric here, and added in the hand
categorical variable. Additionally, I cleaned the pos_summary
variable in order for it to only show the position the player played most. I had to clean the data significantly in order to perform this EDA.
Rather than actual batting/pitching statistics such as batting average or earned run average, the variables in the dataset are mostly made up of value statistics. It has always been the goal of baseball analysts to come up with the perfect statistic to determine a player’s value. The variables in the data are essentially made up of several attempts to do just that, based on different calculations of batting and pitching statistics.
Missingness overview
In both datasets, several of the salary
values are missing. As stated in Progress Memo 1, according to Baseball Reference, this is because salary is often missing for players called up from the minor leagues during the season or who were acquired during the season. Additionally, in the pitcher value dataset, several of the gmLI
(game entering leverage index) values are missing. This is because this statistic only applies to relief pitchers. The other missing values in both datasets may be due to the fact that the player has not appeared in enough games for the statistic to be calculated accurately.
Additionally, for the batters dataset, I limited observations to batters with over 100 plate appearances, in order to eliminate outliers that could skew the data a certain way. This limited the clean batters dataset to 459 observations. I did not filter in a similar way for the pitchers dataset because relief pitchers, those whole pitch towards the end of a game, automatically have far less innings pitched than starting pitchers.
Explorations
Initial Observations
My initial goal in exploring this data was to compare players’ salaries to the actual value they’ve added to their team. Therefore, I spent significant time analyzing the salary
variable and its relationship to other variables in the dataset.
I first wanted to compare salary
to war
(Wins above replacement), a statistic that, according to Baseball Reference, represents “the number of wins the player added to the team above a replacement player.”
It makes sense that there is a positive relationship between these variables: that the more value a player adds to their team, the more they should get paid. However, not all teams are built the same. Teams in bigger cities have bigger audiences, resulting in more revenue and therefore more money to spend. Likewise, teams with smaller audiences have less to spend, and thus have to seek out ways to find value in players for a lower price. Using a list online, I separated some of the observations in the dataset into two groups: five small market teams, and five big market teams. For small market teams, I selected the A’s, Royals, Rays, Brewers, Padres, and for the big market teams, I selected the Yankees, Red Sox, Dodgers, Cubs, and Phillies. After this distinction, I calculated summary statistics for these groups’ respective salary
and war
.
Overall batter salary:
mean | sd | n | rsd |
---|---|---|---|
6640710 | 8266274 | 459 | 8266274 |
Salary of batters on big market teams:
mean | sd | n | rsd |
---|---|---|---|
8801561 | 9285984 | 56 | 9285984 |
Salary of batters on small market teams:
mean | sd | n | rsd |
---|---|---|---|
4418729 | 6371075 | 74 | 6371075 |
Overall batter WAR:
mean | sd | n | rsd |
---|---|---|---|
1.325055 | 1.833639 | 459 | 1.833639 |
WAR of batters on big market teams:
mean | sd | n | rsd |
---|---|---|---|
1.489286 | 2.010217 | 56 | 2.010217 |
WAR of batters on small market teams:
mean | sd | n | rsd |
---|---|---|---|
1.256757 | 1.814484 | 74 | 1.814484 |
These results were somewhat surprising. We can see that while the difference in salary between the biggest market teams and the smallest market teams is very different, the overall WAR deviates much less. I want to continue to explore the data through distinctions like these – what small market teams are looking for in players that are enabling them to compete with teams that have access to a higher payroll. For example, the Tampa Bay Rays, one of the smallest market teams, had the 5th best record in the MLB this year. How were they able to acquire so much talent with a small budget? I calculated summary statistics similar to the ones above for the pitchers dataset, as well:
Overall pitcher salary:
mean | sd | n | rsd |
---|---|---|---|
4325165 | 6177438 | 863 | 6177438 |
Salary of pitchers on big market teams:
mean | sd | n | rsd |
---|---|---|---|
5497065 | 7765816 | 104 | 7765816 |
Salary of pitchers on small market teams:
mean | sd | n | rsd |
---|---|---|---|
3709624 | 4804720 | 144 | 4804720 |
Overall pitcher WAR:
mean | sd | n | rsd |
---|---|---|---|
0.4763615 | 1.103789 | 863 | 1.103789 |
WAR of pitchers on big market teams:
mean | sd | n | rsd |
---|---|---|---|
0.6163462 | 1.233915 | 104 | 1.233915 |
WAR of pitchers on small market teams:
mean | sd | n | rsd |
---|---|---|---|
0.3576389 | 1.034819 | 144 | 1.034819 |
These results were similar to the ones for batters. With this information in mind, I began to focus my research more clearly. What were bigger and smaller market teams valuing, and how are the latter able to succeed?
Exploring Player Salary
To explore the effects of different variables on player salaries, I created correlation matrices for both batters and pitchers.
Batter correlation matrix:
Pitcher correlation matrix:
Surprisingly, for both the batters and pitchers dataset, age
has the strongest positive correlation with salary, and not war
or any other value-based statistic. While this could make sense due to the free agency system in the MLB, after which player contracts tend to be much higher, it is still interesting that war
has a relatively weak correlation with salary
compared to some of the other variables in the dataset.
It is also interesting to note that war
and age
have a zero to negative correlation for both batters and pitchers. I was curious about this relationship.
I was surprised that the age
vs. war
relationship differed this much between pitchers and batters.
Now I want to filter the correlation matrices by big and small market teams and see how the values change.
Small market batter correlation matrix:
Big market batter correlation matrix:
From these tables, we can see that the small market batter correlation between war
and salary
is approximately 0.37, while for big market batters it’s approximately 0.43. This tracks with our observations that both war
and salary
were higher on average for big market batters. Now to observe the pitchers:
Small market pitcher correlation matrix:
Big market pitcher correlation matrix:
This was perhaps my most surprising discovery. For the big market pitchers, I found a slightly negative correlation between war
and salary
. This was shocking because out of all the value statistics in the data, I feel like war
has become the most universally accepted one-number value statistic. So how are teams with all this money wasting it on invaluable players?
From our previous analysis of war
we found that the distribution is relatively similar across big market pitchers compared to all MLB pitchers. Still, I wanted to visualize this relationship to see if there were certain outliers affecting this conclusion.
The highest value of war
is included in the big-market sample, which I found interesting. If anything, that means that that data could skew higher than average.
Pitcher Salary beyond WAR
Discovering this inverse relationship, I wanted to study another value statistic and see if this discrepancy in big-market pitchers salaries was limited to war
. I selected both rar
, which is similar to war
but calculates Runs above replacement as opposed to wins above replacement, and raa
which standardizes a league average of runs and then calculates the Runs Above Average for each player. What I found was surprising.
Compared to the overall league distributions, it seems like small market teams are allocating their salary more efficiently to better pitchers, at least by the metrics of war
, rar
, and raa
. This is where I ended my EDA, having found a surprising conclusion.
Conclusion
I was indeed surprised to find that small market teams seem to be allocating their pitcher payroll better than the average team. However, from what I concluded, it seems that big market teams are able to pay more for better batters. In the future, I’d love to expand this analysis to actual player statistics instead of estimated value statistics. I’d also love to have the actual payroll of each team as a variable in the dataset, to figure out proportionally which team is the smartest.
References
O’Shea, T. (2022, May 9). Breaking Down The Smallest Market Teams In Major League Baseball. Joker Mag. https://jokermag.com/smallest-market-teams-mlb/
Sports Reference LLC. (2023, December 8). 2023 Major League Baseball Value. Baseball-Reference.com - Major League Statistics and Information. https://www.baseball-reference.com/leagues/majors/2023-value-batting.shtml
Sports Reference LLC. (2023, December 8). 2023 Major League Baseball Value. Baseball-Reference.com - Major League Statistics and Information. https://www.baseball-reference.com/leagues/majors/2023-value-pitching.shtml
Trueblood, M. (2012, January 13). Power Ranking All 30 MLB Teams By Market Size. Bleacher Report. https://bleacherreport.com/articles/961412-mlb-power-rankings-all-30-mlb-teams-by-market-size