Visualizing MLB pitch location data with Alteryx + Tableau

This project started with some really interesting reading on the work done by brooks baseball (Dan Brooks and several others) and fastballs (by Mike Fast) sites. There are references to these sites throughout this post. 

Data gathering & preparation work

I used the Perl script from the fastballs - build a pitch db page to download the data from this MLBAM site. Then I leveraged Alteryx to parse the 2.47 million XML files (no, that is not a typo) over the 8 years I pulled data for. Here is a summary of files and their combined size by year.

Thankfully, tools like Alteryx and Tableau provide the ability to take work that would have taken me days and reduce it down to hours. It also allowed me to do a lot of this work while holding the newest Blicker, my newborn baby girl…

The best part is that it is a pretty simple Alteryx workflow within which I gained a new admiration for the directory and dynamic input tools. I know this has been done many times before, but here is my workflow which parses the XML files into a format more suitable for Tableau.

The workflow is:

  • Query the directory with the Directory Tool

  • Parse the path to obtain the different folders and files

  • Use Dynamic Input tool to import all of the files in each subfolder

  • Based on the type of file, differently configured XML parse tools are used in a synchronous manner.

  • Final data manipulations and output to text files for use in Tableau.

Visualization work

My goal was to visually investigate if pitchers and their pitch location have a defined “shape”, or if they all looked the same. Thus, I decided to go with a binned heat map to see if any shape came out of the location data. To create the binned heat map in Tableau, I binned both the Px [x] and Pz [y] coordinate fields to my desired aggregation level. I then manually adjusted (can you see me cringe as I type that?!?) the axes and size of the squares in order to achieve my desired look and feel. Lastly I added a small multiple layout and sorted/limited the data to the top 25 pitchers based on earned run average (ERA) for the year. You can reverse engineer all of this from the workbook embedded at the bottom of the post.

What I found is that most pitchers seem to have pretty consistent angled shape to their pitch location based on whether they throw left or right handed. This of course can differ slightly from one pitcher to the next. There are also some pitchers like Clayton Kershaw and Tim Lincecum (early 2010s) that had shape which was much less angled and thinner than then their same handed counterparts. Here are some examples of this.

Once the pitch location heat map was complete, I decided to supplement it with additional pitch data like speed and movement information (shown in the image above). I utilized a labeling trick that you might have seen before in order to add the pitcher’s names, numbers and various additional stats that you see above, it is not perfect, but worked well enough for this project. First we place the mark (dual axis with circles that are too small and transparent to see) and then we add a separate field with the text to be shown on the label.

I was not thrilled with text only for these additional data points, this limited my ability to easily place the pitcher within the population by scanning the detailed numbers across the small multiple. Thus, I set out to add a secondary layer of visuals, settling on the idea of an overlay of graphs when a specific pitcher is selected by the end user, here is how the overlay looks…

With this approach, I had to cope with the fact that I am covering up two of the five pitchers in each row upon selection. For that I came up with a little trick. Before I get into it, I will say that this method has some issues. Some of which are:

  • You are overlaying visuals on top of your page, you will not be able to leverage tooltips on the underlying viz.

  • I floated the visuals on top, thus sizing was required to be somewhat fixed so all the various dashboard objects where lined up well.

  • You have to create two copies of each overlay chart, this could cause additional maintenance.

That being said, I was reasonably happy with how the effect turned out, so here is how I did it. First thing first, create the base dashboard, then create an additional sheet for the first overlay. Make sure you are happy with this and it is complete, then duplicate the overlay sheet, naming one sheet … L and the other … R. Drag, place and size the sheets on top of your dashboard. Your view should look something like this.

Next we are going to create three simple calculated fields to trigger the effect. “Show L” and “Show R” are simply hardcoded values, these could be set to any value as long as they remain in sync with “Show Value”. “Show Value” is used to display either the right or left side depending on which column in the small multiple grid you are in. If the first 3 columns show the overlay on the right, otherwise show the overlay on the left.

Lastly, we are going to add three dashboard actions to implement the effect on our dashboard. Then, trigger and test that they are working correctly. 

  • We are going to highlight the selected pitcher so the overlaid chart is focused on the selected pitcher.

  • We add two action filters. One for left and the other for right. The are pretty much identical except one filters the left overlay using Show Value -> Show L and the other the right overlay using Show Value -> Show R.

Here is the result, I hope you like the viz and find some useful tricks within it as well. Since I was able to scale this with Alteryx, there are several other years on my public page as well. 

Note: I am just scratching the surface of this dataset and I look forward to digging in more and seeing what others do with it. Please note the license of the dataset if you do decide to download my workbook(s) and do some analysis of your own.