Here we take a high level look at both analysis and data, and talk about how everything you do in data science falls into one of five main activities.
This is the first in a series of guides that teach Python and data science using NHL data. So far parts 2 and 3 are also out, but I'll probably do more. Enter your email if you want to be notified when these are ready.
Let's get started!
1. Introduction
The Purpose of Data Analysis
The purpose of data analysis is to get interesting or useful insights.
- I'm a sports better, what are the odds the St Louis Blues win the Stanley cup?
- I'm an NHL GM, which player should I draft?
- I'm a mad scientist, how many more goals does Jaromir Jagr score if he stays in the NHL instead of leaving for the KHL for three seasons?
Data analysis is one (hopefully) accurate and consistent way to get these insights.
Of course, that requires data.
What is Data?
At a very high level, data is a collection of structured information.
You might have data about anything, but let's take a hockey game, say Buffalo vs Pittsburgh, October 3, 2019. What would a collection of structured information about it look like?
Let's start with collection, or "a bunch of stuff." What is a hockey game a collection of? How about shots? This isn't the only acceptable answer — a collection of players, teams, possessions, or periods would fit — but it'll work. A hockey game is a collection of shots. OK.
Now information — what information might we have about each shot in this collection? Maybe: player shooting, distance, type of shot, time in the game, and whether it went in.
Finally, it's structured as a big rectangle with columns and rows. A row is a single item in our collection (a shot in this case). A column is one piece of information (shooter, distance, etc).
This is an efficient, organized way of presenting information. When we want to know, "who took the last shot in the first period and how far was it and did go in?", we can find the right row and columns, and say "Oh, Rasmus Ristolainen from 72 feet and no".
shot name pos period min_left sec_left dist goal
32 Z. Aston-Reese C 1 3 16 41.77 False
33 Z. Aston-Reese C 1 2 54 56.60 False
34 D. Simon C 1 2 45 32.01 False
35 J. Johnson D 1 2 18 52.34 False
36 B. Tanev LW 1 0 57 12.72 False
37 R. Ristolainen D 1 0 23 72.71 False
38 J. McCann C 2 17 51 20.00 False
39 Z. Girgensons C 2 17 2 43.04 False
40 C. Miller D 2 15 59 53.75 False
41 B. Dumoulin D 2 14 38 37.53 False
42 E. Malkin C 2 14 10 35.35 True
44 S. Reinhart C 2 9 43 8.54 False
45 R. Dahlin D 2 9 13 47.51 False
46 C. Mittelstadt C 2 9 1 17.02 False
47 B. Tanev LW 2 8 54 159.05 False
name p dist goal
Z. Aston-Reese 1 41.77 False
Z. Aston-Reese 1 56.60 False
D. Simon 1 32.01 False
J. Johnson 1 52.34 False
B. Tanev 1 12.72 False
R. Ristolainen 1 72.71 False
J. McCann 2 20.00 False
Z. Girgensons 2 43.04 False
C. Miller 2 53.75 False
B. Dumoulin 2 37.53 False
E. Malkin 2 35.35 True
S. Reinhart 2 8.54 False
R. Dahlin 2 47.51 False
C. Mittelstadt 2 17.02 False
B. Tanev 2 59.05 False
It's common to refer to rows as observations and columns as variables, particularly when using the data for more advanced forms of analysis, like modeling. Other names for this rectangle-like format include tabular data or flat file (because all this info about Sabres-Penguins is flattened out into one big table).
What is Analysis?
How many hockey pucks are in the following picture?
Pretty easy question right? What about this one?
Researchers have found that humans automatically know how many objects they're seeing, as long as there are no more than three or four. Any more than that, and counting is required.
If you open up the table of shot data the above comes from, you'd see it's 9347 rows and 26 columns.
From that, do you think you would be able to glance at it and tell me who the "best" player was? Worst? Most consistent or unluckiest? Of course not.
Raw data is the numerical equivalent of a pile of pucks. It's a collection of facts, way more than the human brain can reliably and accurately make sense of and meaningless without some work.
Data analysis is the process of transforming this raw data to something smaller and more useful you can fit in your head.
Types of Data Analysis
Broadly, it is useful to think of two types of analysis, both of which involve reducing a pile of data into a few, more manageable number of insights.
- Single number type summary statistics.
- Models that help us understand relationships between data.
Summary Statistics
Summary statistics can be complex (Goalie Point Shares, Strength of Schedule) or more basic (goals scored or games missed due to injury), but all of them involve going from raw data to some more useful number.
Stats don't necessarily need fancy acronyms to be useful. Take this player data:
name pos team ht_cm wt_kg
Z. Chara D BOS 205.74 113.39
J. Thornton C SJS 193.04 99.79
P. Marleau C SJS 187.96 97.52
R. Hainsey D OTT 190.50 95.25
J. Williams RW CAR 185.42 85.27
... .. ... ... ...
I. Mikheyev RW TOR 190.50 88.45
A. Wedin LW CHI 180.34 88.45
J. Nygard LW EDM 182.88 81.19
J. Lilja LW CBJ 185.42 88.90
G. Haas C EDM 182.88 82.10
name pos team ht_cm kg
Z. Chara D BOS 205.74 113
J. Thornton C SJS 193.04 99
P. Marleau C SJS 187.96 97
R. Hainsey D OTT 190.50 95
J. Williams RW CAR 185.42 85
... .. ... ... ...
I. Mikheyev RW TOR 190.50 88
A. Wedin LW CHI 180.34 88
J. Nygard LW EDM 182.88 81
J. Lilja LW CBJ 185.42 88
G. Haas C EDM 182.88 82
What "statistic" might we use to understand the physical characteristics of NHL players?
How about, the average (or the median, or mode, or a series of percentiles)?
ht_cm 185.56
wt_kg 90.13
ht_cm 185.56
wt_kg 90.13
(That is, about 6'1, 199 lbs).
The main goal of these single number summary statistics is usually to summarize and make sense of some data, e.g. when deciding out who should win the Hart Trophy or arguing online about who contributed the most to the Avalanche's Stanley Cup win.
Stats also vary in scope. Some, like points or plus-minus, are more all encompassing, while others get at a particular facet of a player's performance. For example, we might measure "power" by looking at a player's longest goal or shot speed, or "clutchness" by looking at number of game winning goals.
A key skill in data analysis is knowing how to look at data multiple ways via different summary statistics, keeping in mind their strengths and weaknesses.
Modeling
The other broad type of analysis is modeling. A model describes a mathematical relationship between variables in your data, specifically the relationship between one or more input variables and one output variable.
output variable = model(input variables)
This is called "modeling output variable as a function of input variables".
How do we find these relationships and actually "do" modeling in practice?
When working with flat, rectangular data, variable is just another word for column. In practice, modeling is making a dataset where the columns are your input variables and one output variable, then passing this data (with information about which columns are which) to your modeling program.
This is important, so let emphasize it, in practice:
Modeling is making a dataset where the columns are your input variables and one output variable, then passing this data (with information about which columns are which) to your modeling program.
That's why most data scientists and modelers spend most of their time collecting and manipulating data. Getting your inputs and output together in a dataset that your modeling program can accept is most of the work.
Why Model?
Though all models describe some relationship, why you might want to analyze a relationship depends on the situation. Sometimes, it's because models help make sense of things that have already happened.
For example, modeling a goalie's saves as a function of the player, defenders and quality of opponent would be one way to figure out — by separating talent from other factors outside of the their control — which goalie is the most "skilled".
Then as an analyst for an NHL team assigned to scout free agent goalies, I can report back to my GM on who my model thinks is truly the best, the one who will make the most saves with our defenders and the opponents we play.
Models that Predict the Future
Often, we want to use a model to predict the future.
For example, say I'm writing this on the eve of the 2022 season opener. I have everyone's preseason stats, and I want to use them to predict goals scored in 2022.
Modeling is about relationships. In this case the relationship is between data I have now (preseason stats for this year) and events that will happen in the future (regular season goals).
But if something is in the future, how can we relate it to the present?
The answer: by starting with the past. If I'm writing this on in October 2022, I have data on 2021. And I could build a model:
player 2021 goals scored = model(player 2021 pre-season goals scored)
Training (or fitting) this model is the process of using that known/existing/already happened data to find a relationship between the input variables (2021 preseason goals scored) and the output variable (2021 regular season goals scored).
Once I establish that relationship, I can feed it new inputs — 2 preseason goals scored — and transform it using my relationship to get back a prediction for regular season goals.
The inputs I feed my model might be from events that have already happened. Often this done to evaluate model performance. For example, I could put in Sidney Crosby's 2021 preseason stats to see what the model would have predicted for 2021, even though I already know how he did (hopefully it's close).
Alternatively, I can feed it data from right now in order to predict things that haven't happened yet. For example — again, say I'm writing this on the eve of the season opener in 2022 — I can put in Connor McDavid's pre-season goals scored and get back a projection for the regular season.
High Level Data Analysis Process
Now that we've covered both the inputs (data) and final outputs (analytical insights), let's take a very high level look at what's in between.
Everything in data science falls somewhere in one of the following steps:
1. Collecting Data
Whether you scrape a website, connect to a public API, download some spreadsheets, or enter it yourself, you can't do data analysis without data. The first step is getting ahold of some.
This book covers how to scrape a website and get data by connecting to an API. It also suggests a few ready-made datasets.
2. Storing Data
Once you have data, you have to put it somewhere. This could be in several spreadsheet or text files in a folder on your desktop, sheets in an Excel file, or a database.
This book covers the basics and benefits of storing data in a SQL database.
3. Loading Data
Once you have your data stored, you need to be able to retrieve the parts you want. This can be easy if it's in a spreadsheet, but if it's in a database then you need to know some SQL — pronounced "sequel" and short for Structured Query Language — to get it out.
This book covers basic SQL and loading data with Python.
4. Manipulating Data
Talk to any data scientist, and they'll tell you they spend most of their time preparing and manipulating their data. Hockey data is no exception. Sometimes called munging, this means getting your raw data in the right format for analysis.
There are many tools available for this step. Examples include Excel, R, Python, Stata, SPSS, Tableau, SQL, and Hadoop. In this book you'll learn how to do it in Python, particularly using the library Pandas.
The boundaries between this step and the ones before and after it can be a little fuzzy. For example, though we won't do it in this book, it is possible to do some basic manipulation in SQL. In other words, loading (3) and manipulating (4) data can be done with the same tools. Similarly Pandas — the primary tool we'll use for data manipulation (4) — also includes basic functionality for analysis (5) and input-output capabilities (3).
Don't get too hung up on this. The point isn't to say, "this technology is always associated with this part of the analysis process". Instead, it's a way to keep the big picture in mind as you are working through the book and your own analysis.
5. Analyzing Data for Insights
This step is the model, summary stat or plot that takes you from formatted data to insight.
This book covers a few different analysis methods, including summary stats, a few modeling techniques, and data visualization.
We will do these in Python using the scikit-learn, statsmodels, and matplotlib libraries, which cover machine learning, statistical modeling and data visualization respectively.
Connecting the High Level Analysis Process to Everything Else
Again, all of data analysis falls into one of the five sections above. Throughout these posts, I'll tie back what you are learning to this section so you can keep sight of the big picture.
This is the forest. If you ever find yourself banging your head against a tree — either confused or wondering why we're talking about something — refer back here and think about where it fits in.
Thanks for reading — next up in Part 2 we'll get into some actual coding with Python.