Project Description

MLB Pitchers: Winning Percentage Across Other Metrics

Brianna Griffin

For the final project, I am using a data set on MLB starting pitchers to analyze relationships between several different variables. Specifically, I am interested in the interaction between variables and "winning percentage". I have recently completed a Machine Learning Final Project in which I built 5 different regression models that predicted the "winning percentage" as an outcome variable. I found that 2 of the most prominent variables in predicting "winning percentage" were "best" and "team_win_loss_percentage". I am curious to see if the same results will be displayed in my 3D visualization. Going into the project, I also wanted to use curves in my visualization, with cubes connecting their endpoints. My final idea includes 5 cubes with different x and y axes that correspond to different metrics of the pitchers in the data set. The z-axis is constantly the winning percentage for each observation. Curves are created for each observation, and they will go through each cube. Hence, they are influenced by the 5 differing x, y metrics of each cube. Also, the colors of the curves are based off of the age of the pitcher.

The data that I am utilizing for the visualization contains statistics and metrics on MLB starting pitchers in the National League during the years 2000-2022. The data was collected from the webstite Baseball Reference. Below is a list of the variables that are used in the visualization along with a brief description of their meaning within the game of baseball.

Winning Percentage = number of games won ÷ total number of games pitched
Innings Pitched = numbers of innings pitched during the season
Year = year of the given season
Best = best game score
Worst = worst game score
- "Game Score" measures a pitcher's performance in any given game started
WLST = number of wins lost
- At the time the pitcher faced his final batter the pitcher was in position for a win, but game was blown by bullpen.
LSV = number of losses saved
- At the time of his last batter the pitcher was in position for a loss, but team came back to tie or take lead.
Team Win-Loss % = team win loss percentage
- The win-loss percentage of the team in games started by this pitcher.
Quality Start % = quality start percentage
- Percentage of starts that were quality starts: pitcher pitched at least 6 innings and allowed 3 or fewer earned runs in a start.
# Short Days Rest = number of short days rest
- less than 4 days of rest
# Long Days Rest = number of long days rest
- more than 4 days of rest

Below are some of the sketches I had before creating the final version of my 3D visualization.

In the beginning, I thought to have one singular cube and then partition the data into multiple groups. In those distinct groups, I wanted to draw curves inbetween each set of nodes. I think, although, this may have gotten a bit chaotic to visualize with the amount of data we have and putting it all into one constrained space of a cube.

Below is another sketch I had originally of the idea of having a cubic space. I was brain storming how to interpret which points would go from which node to the other. Clearly, this first idea is still a bit chaotic seen from the messiness of the sketch.

Here is my beginning sketch in Processing. It illustrates the idea I had in my above 2 handwritten sketches. As you can see, the curves intersect each other greatly making the results uninterpretable. This will lead me to another idea.

Here is my second draft in processing. I have colored the curves by age of the pitcher. I have also decided to do 5 cubes instead of 1 to clear up the crowding issues. This already looks better than previous sketches, but my final version creates continuous cubes running through all 5 cubes. It looks similar to this sketch, but with the curves going through all 5 cubes rather than starting and stopping in a singular cube.

Below is the process I followed in order to create my final 3D visualization:

I began by importing the .csv data file into the processing environment.
Then, I set up the PeasyCam assuring a minimum and maximum distance so that the user can easily access the visualization within the 3D axes. I also imported Gui into my Processing environment. I will be using this further on to create controls for the user.
I created a function that determines the color of a curve based on the age of a player. I grouped pitcher's age into 6 categories with each category having a range of 5 years.
I created the cubes and labeled the x, y, and z axes for each of the 5 cubes.

Cube #1: (x, y, z) = (Innings Pitched, Year, Winning %)
Cube #2: (x, y, z) = (Best, Worst, Winning %)
Cube #3: (x, y, z) = (WLST, LSV, Winning %)
Cube #4: (x, y, z) = (Team Win-Loss %, Quality Start %, Winning %)
Cube #5: (x, y, z) = (Short Days Rest, Long Days Rest, Winning %)

Note z = Winning % for all 5 cubes

I created the curves for each observation. The curves start at the leftmost cube and end at the rightmost cube. Their x, y, and z points are influenced by the x/y/z axes named above for each cube. I used the "curveVertex()" function in Processing to make the curves.
I added user interaction to the visualization. The commands were all made with GUI and are below:
1. DIRECT - plots points on the correspond x/y/z axes showing their respective direct relationships
  1. This will remove the curves from the graph
2. LABELS - removes the labels on the axes
3. CUBES - removes the cubes
4. AGE_19_24 - removes points with 19≤age≤24 years old
5. AGE_25_29 - removes points with 25≤age≤29 years old
6. AGE_30_34 - removes points with 30≤age≤34 years old
7. AGE_35_39 - removes points with 35≤age≤39 years old
8. AGE_40_44 - removes points with 40≤age≤44 years old
9. AGE_45_49 - removes points with 45≤age≤49 years old
10. ROTATE_CAM - rotates the camera view along the x-axis
11. INITIAL - positions the camera at the initial point of view
Once I inputted all of these commands, I created a menu on the processing screen so that the user can easily see their options for interaction. It is located on the left hand side of the Processing window. Users simply need to press the button of what they want to interact with in order for it to apply to the visualization. Here is what it looks like on the screen.
I finalized my visualization by adding aeshetic finishes such as making the text more clear on the screen, and changing the stroke size of the curves.

Below are some screenshots of the final visualization.

The first image shows the default view once you run the Processing code. The curves clearly change in each cube. Thus, each metric varies per observation. This overall, shows a different affect on the winning percentage of the pitchers.

Now, this image shows the visualization once the "DIRECT" button is toggled. This screenshot produces interesting results because there are trends seen in the data in the 2nd and 4th cubes. In other words, the data points don't remain constain. The 2nd and 4th cubes contain the metrics for which the most important variables are for predicting winning percentage. I found these metrics while building and computing machine learning models using the same data set. It is interesting that a 3D visual environment can display similar results as a Boosted Tree Machine Learning Model with significantly less computation time!

Now, here is a screenshot of the visualization but from a different camera angle. You can see here that the winning percentage in this data set mostly ranges from 0 to 60%, but there are few observations with a winning percentage of 100%. These are the rightmost curves.

In conlusion, I am proud of my final results from this project. The data set presented difficulties with its denseness and multitude of variables. I was able to take this complicated data set and present it a 3D environment, showing the relationships between the winning percentage and other statistics of pitchers in the MLB. Furthermore, this project helped me improve my Processing coding skills. It showed me the possibilites of data visualization, and how my abstract sketches can turn into useful visualizations.

All work is developed within Processing
Source Code + Data