TY - JOUR
T1 - Machine Learning in Baseball Analytics: Sabermetrics and Beyond
AU - Zhao, Wenbing
AU - Akella, Vyaghri Seetharamayya
AU - Yang, Shunkun
AU - Luo, Xiong
PY - 2025/5/1
Y1 - 2025/5/1
N2 - In this article, we provide a comprehensive review of machine learning-based sports analytics in baseball. This review is primarily guided by the following three research questions: (1) What baseball analytics problems have been studied using machine learning? (2) What data repositories have been used? (3) What and how machine learning techniques have been employed for these studies? The findings of these research questions lead to several research contributions. First, we provide a taxonomy for baseball analytics problems. According to the proposed taxonomy, machine learning has been employed to (1) predict individual game plays; (2) determine player performance; (3) estimate player valuation; (4) predict future player injuries; and (5) project future game outcomes. Second, we identify a set of data repositories for baseball analytics studies. The most popular data repositories are Baseball Savant and Baseball Reference. Third, we conduct an in-depth analysis of the machine learning models applied in baseball analytics. The most popular machine learning models are random forest and support vector machine. Furthermore, only a small fraction of studies have rigorously followed the best practices in data preprocessing, machine learning model training, testing, and prediction outcome interpretation.
AB - In this article, we provide a comprehensive review of machine learning-based sports analytics in baseball. This review is primarily guided by the following three research questions: (1) What baseball analytics problems have been studied using machine learning? (2) What data repositories have been used? (3) What and how machine learning techniques have been employed for these studies? The findings of these research questions lead to several research contributions. First, we provide a taxonomy for baseball analytics problems. According to the proposed taxonomy, machine learning has been employed to (1) predict individual game plays; (2) determine player performance; (3) estimate player valuation; (4) predict future player injuries; and (5) project future game outcomes. Second, we identify a set of data repositories for baseball analytics studies. The most popular data repositories are Baseball Savant and Baseball Reference. Third, we conduct an in-depth analysis of the machine learning models applied in baseball analytics. The most popular machine learning models are random forest and support vector machine. Furthermore, only a small fraction of studies have rigorously followed the best practices in data preprocessing, machine learning model training, testing, and prediction outcome interpretation.
KW - Shapley additive explanations
KW - cross-validation
KW - feature importance
KW - machine learning
KW - major league baseball
KW - sabermetrics
KW - sports analytics
UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105006623345&origin=inward
UR - https://www.scopus.com/inward/citedby.uri?partnerID=HzOxMe3b&scp=105006623345&origin=inward
U2 - 10.3390/info16050361
DO - 10.3390/info16050361
M3 - Review article
SN - 2078-2489
VL - 16
JO - Information (Switzerland)
JF - Information (Switzerland)
IS - 5
M1 - 361
ER -