Learn Python with Hockey - Part 2

This is the second in a series of posts that teach Python and data science using NHL data.

This post is where we get into some actual coding. It's really meant to be followed along with, but you're welcome to read through it too.

All the Python code we'll write later is built upon the concepts covered here. Since we'll be using Python for nearly everything, this section touches all parts of the high level, five-step data analysis process.

Note: there will likely be more posts like this coming so if you want to be notified when new guides like this come out you can enter your email here:

Let's get started!

Python

Much of the functionality in Python comes from third party libraries (or packages), specially designed for specific tasks.

For example: the Pandas library lets us manipulate data. And the library BeautifulSoup is the Python standard for scraping data from websites.

But, even when using third party packages, you will also be using a core set of Python features and functionality. These features — called the standard library — are built-in to Python.

Following Along

All these posts are heavy on examples and meant to be followed along with. If you haven't already, install Spyder (a free program to write and run Python code) and open it up.

Then go to:

https://raw.githubusercontent.com/nathanbraun/learn-python-hockey/main/code/part2.py

And copy and paste everything that's there into Spyder (either temp.py or a new file). Then make sure the bottom right tab is set to 'IPython Console'. You should see something like this:

Ready to go in Spyder

Best case you have this guide up in one monitor and Spyder (with this code) in another. Then you can follow along + run the code (highlight the line(s) you want and press F9 to send it to the REPL/console) as we work through it.

If you do that, I've included what you'll see in the REPL here. That is:

In [1]: 1 + 1
Out[1]: 2

Where the line starting with In[1] is the code you run, and Out[1] is what the REPL prints out. These are lines [1] for me because this was the first thing I entered in a new REPL session.

Don't worry if the numbers you see in In[ ] and Out[ ] don't match exactly what I'm showing here. In fact, they probably won't, because as you run the examples you should be exploring and experimenting. That's what the REPL is for.

Nor should you worry about messing anything up: if you need a fresh start, you can type reset into the REPL and it will clear out everything you've run previously. You can also type clear to clear all the printed output.

Sometimes, examples build on each other (remember, the REPL keeps track of what you've run previously), so if something isn't working, it might be relying on code you haven't run yet.

Important Parts of the Python Standard Library

Comments

As you look at part2.py you might notice a lot of lines beginning with #. These are comments. When reading your code, the computer will ignore everything from # to the end of the line.

Comments exist in all programming languages. They are a way to explain to anyone reading your code (including your future self) more about what's going on and what you were trying to do when you wrote it.

The problem with comments is it's easy for them to become out of date. This happens when you change your code and forget to update the comment.

An incorrect or misleading comment is worse than no comment. For that reason, most beginning programmers probably comment too often, especially because Python's syntax (the language related rules for writing programs) is usually pretty clear.

For example, this would be an unnecessary comment:

# print the result of 1 + 1
print(1 + 1)

Because it's not adding anything that isn't obvious by just looking at the code. It's better to use descriptive names, let your code speak for itself, and save comments for particularly tricky portions of code.

Variables

Variables are a fundamental concept in any programming language.

At their core, variables are just named pieces of information. This information can be anything from a single number to an entire dataset — the point is that they let you store and recall things easily.

The rules for naming variables differ by programming language. In Python, they can be any upper or lowercase letter, number or _ (underscore), but they can't start with a number.

Assigning data to variables

You assign a piece of data to a variable with an equals sign, like this:

In [1]: goals_scored = 2

Now, whenever you use goals_scored in your code, the program automatically substitutes it with 2 instead.

In [2]: goals_scored
Out[2]: 2

In [3]: 3*goals_scored
Out[3]: 6

One of the benefits of developing with a REPL is that you can type in a variable, and the REPL will evaluate (i.e. determine what it is) and print it. That's what the code above is doing. But note while goals_scored is 2, the assignment statement itself, goals_scored = 2, doesn't evaluate to anything, so the REPL doesn't print anything out.

You can update and override variables too. Going into the code below, goals_scored has a value of 2 (from the code we just ran above). So the right hand side, goals_scored + 1 is evaluated first (2 + 1 = 3), and then the result gets (re)assigned to goals_scored, overwriting the 2 it held previously.

In [4]: goals_scored = goals_scored + 1

In [5]: goals_scored
Out[5]: 3

In [4]: goals_scored = (
  goals_scored + 1)

In [5]: goals_scored
Out[5]: 3

Types

Like Excel, Python includes concepts for both numbers and text. Technically, Python distinguishes between two types of numbers: integers (whole numbers) and floats (numbers that may have decimal points), but the difference isn't important for us right now.

In [6]: penalty_minutes = 15  # int
In [7]: puck_speed = 82.5  # float

In [6]: penalty_minutes = 15  # int
In [7]: puck_speed = 82.5  # float

Text, called a string in Python, is wrapped in either single (') or double (") quotes. I usually just use single quotes, unless the text I want to write has a single quote in it (like O'Reilly), in which case a string with 'Ryan O'Reilly' would give an error.

In [8]: starting_lw = 'David Perron'
In [9]: starting_c = "Ryan O'Reilly"  # this works

In [8]: starting_lw = 'David Perron'
In [9]: starting_c = "Ryan O'Reilly"

You can check the type of any variable with the type function.

In [10]: type(starting_lw)
Out[10]: str

In [11]: type(penalty_minutes)
Out[11]: int

Keep in mind the difference between strings (quotes) and variables (no quotes). A variable is a named of a piece of information. A string (or a number) is the information.

One common thing to do with strings is to insert variables inside of them. The easiest way to do that is via f-strings.

In [12]: starters = f'{starting_c}, {starting_lw}, etc.'

In [13]: starters
Out[13]: "Ryan O'Reilly, David Perron, etc."

In [12]: starters = (
  f'C: {starting_c}')

In [13]: starters
Out[13]: "C: Ryan O'Reilly"

Note the f immediately preceding the quotation mark. Adding that tells Python you want to use variables inside your string, which you wrap in curly brackets.

f-strings are new as of Python 3.8, so if they're not working for you make sure that's at least the version you're using.

Strings also have useful methods you can use to do things to them. You invoke methods with a . and parenthesis. For example, to make a string uppercase you can do:

In [14]: 'do you believe in miracles!?'.upper()
Out[14]: 'DO YOU BELIEVE IN MIRACLES!?'

In [14]: 'goal!!'.upper()
Out[14]: 'GOAL!!'

Note the parenthesis, e.g. upper(). That's because sometimes these take additional data, for example the replace method takes two strings: the one you want to replace, and what you want to replace it with:

In [15]: 'Bernie Geoffrion'.replace('Bernie', 'Boom Boom')
Out[15]: 'Boom Boom Geoffrion'

In [15]: 'Bernie Geoffrion'.replace(
  'Bernie', 'Boom Boom')
Out[15]: 'Boom Boom Geoffrion'

There are a bunch of these string methods, most of which you won't use that often. Going through them all right now would bog down progress on more important things. But occasionally you will need one of these string methods. How should we handle this?

The problem is we're dealing with a comprehensiveness-clarity trade off. And, since anything short of Python in a Nutshell: A Desktop Quick Reference (which is 772 pages) is going to necessarily fall short on comprehensiveness, we'll do something better.

Rather than teaching you all 44 of Python's string methods, I am going to teach you how to quickly see which are available, what they do, and how to use them.

Though we're nominally talking about string methods here, this advice applies to any of the programming topics we'll cover in this book.

Interlude: How to Figure Things Out in Python

"A simple rule I taught my nine year-old today: if you can't figure something out, figure out how to figure it out." — Paul Graham

The first tool you can use to figure out your options is the REPL. In particular, the REPL's tab completion functionality. Type in a string like 'sidney crosby' then . and hit tab. You'll see all the options available to you (this is only the first page, you'll see more if you keep pressing tab).

'sidney crosby'.
    capitalize()   encode()       format()       
    isalpha()      isidentifier() isspace()      
    ljust()        casefold()     endswith()    
    format_map()   isascii()      islower()

'sidney crosby'.
    capitalize()   encode()       
    isalpha()      isidentifier()
    ljust()        casefold()    
    format_map()   isascii()

Note: tab completion on a string directly like this doesn't always work in Spyder. If it's not working for you, assign 'sidney crosby' to a variable and tab complete on that. Like this:

In [16]: foo = 'sidney crosby'
Out[16]: foo.
            capitalize()   encode()       format()       
            isalpha()      isidentifier() isspace()      
            ljust()        casefold()     endswith()    
            format_map()   isascii()      islower()

In [16]: foo = 'sidney crosby'
Out[16]: foo.
     capitalize()   encode()       
     isalpha()      isidentifier()
     ljust()        casefold()    
     format_map()   isascii()

Then, when you find something you're interested in, enter it in the REPL with a question mark after it, like 'sidney crosby'.capitalize? (or foo.capitalize? if you're doing it that way).

You'll see:

Signature: str.capitalize(self, /)
Docstring:
Return a capitalized version of the string.

More specifically, make the first character have upper case and
the rest lower case.

Signature: str.capitalize(self, /)
Docstring:
Return a capitalized version of
the string.

More specifically, make the
first character have upper case
and the rest lower case.

So, in this case, it sounds like capitalize will make the first letter uppercase and the rest of the string lowercase. Let's try it:

In [17]: 'sidney crosby'.capitalize()
Out[17]: 'Sidney crosby'

In [17]: ('sidney crosby'
  .capitalize())
Out[17]: 'Sidney crosby'

Great. Many of the items you'll be working with in the REPL have methods, and tab completion is a great way to explore what's available.

The second strategy is more general. Maybe you want to do something that you know is string related but aren't necessarily sure where to begin or what it'd be called.

For example, maybe you've scraped some data that looks like:

In [18]: '  sidney crosby'

But you want it to be like this, i.e. without the spaces before "sidney":

In [19]: 'sidney crosby'

Here's what you should do — and I'm not trying to be glib here — Google: "python string get rid of leading white space".

When you do that, you'll see the first result is from stackoverflow and says:

"The lstrip() method will remove leading whitespaces, newline and tab characters on a string beginning."

A quick test confirms that's what we want.

In [20]: ('  sidney crosby'
  .lstrip())
Out[20]: 'sidney crosby'

Stackoverflow

Python — particularly the data libraries we'll be using — became popular during the golden age of stackoverflow.com, a programming question and answer site that specializes in answers to small, self-contained technical problems.

How it works: people ask questions related to programming, and other, more experienced programmers answer. The rest of the community votes, both on questions ("that's a very good question, I was wondering how to do that too") as well as answers ("this solved my problem perfectly"). In that way, common problems and the best solutions rise to the top over time. Add in Google's search algorithm, and you usually have a way to figure out exactly how to do most anything you'll want to do in a few minutes.

You don't have to ask questions yourself or vote or even make a stackoverflow account to get the benefits. Most people probably don't. But enough people do, especially when it comes to Python, that it's a great resource.

If you're used to working like this, this advice may seem obvious. Like I said, I don't mean to be glib. Instead, it's intended for anyone who might mistakenly believe "real" coders don't Google things.

As programmer-blogger Umer Mansoor writes,

Software developers, especially those who are new to the field, often ask this question... Do experienced programmers use Google frequently?

The resounding answer is YES, experienced (and good) programmers use Google... a lot. In fact, one might argue they use it more than the beginners. [that] doesn't make them bad programmers or imply that they cannot code without Google. In fact, truth is quite the opposite: Google is an essential part of their software development toolkit and they know when and how to use it.

A big reason to use Google is that it is hard to remember all those minor details and nuances especially when you are programming in multiple languages... As Einstein said: 'Never memorize something that you can look up.'

Now you know how to figure things out in Python. Back to the basics.

Bools

There are other data types besides strings and numbers. One of the most important ones is bool (for boolean).

Boolean's — which exist in every language — are for binary, yes or no, true or false data. While a string can have almost an unlimited number of different values, and an integer can be any whole number, bools in Python only have two possible values: True or False.

Similar to variable names, bool values lack quotes. So "True" is a string, not a bool.

A Python expression (any number, text or bool) is a bool when it's yes or no type data. For example:

# some numbers to use in our examples
In [21]: team1_goals = 2
In [22]: team2_goals = 1

# these are all bools
In [23]: team1_won = team1_goals > team2_goals

In [24]: team2_won = team2_goals > team1_goals

In [25]: teams_tied = team1_goals == team2_goals

In [26]: teams_did_not_tie = team1_goals != team2_goals

In [27]: type(team1_won)
Out[27]: bool

In [28]: teams_did_not_tie
Out[28]: True

# some numbers to use in ex
In [21]: team1_goals = 2
In [22]: team2_goals = 1

# these are all bools
In [23]: team1_won = (
  team1_goals > team2_goals)

In [24]: team2_won = (
  team2_goals > team1_goals)

In [25]: teams_tied = (
  team1_goals == team2_goals)

In [26]: teams_did_not_tie = (
  team1_goals != team2_goals)

In [27]: type(team1_won)
Out[27]: bool

In [28]: teams_did_not_tie
Out[28]: True

Notice the == by teams_tied. That tests for equality. It's the double equals sign because — as we learned above — Python uses the single = to assign to a variable. This would give an error:

In [29]: teams_tied = (team1_goals = team2_goals)  
...
SyntaxError: invalid syntax

In [29]: teams_tied = (
  team1_goals = team2_goals)  
...
SyntaxError: invalid syntax

So team1_goals == team2_goals will be True if those numbers are the same, False if not.

The reverse is !=, which means not equal. The expression team1_goals != team2_goals is True if the values are different, False if they're the same.

You can manipulate bools — i.e. chain them together or negate them — using the keywords and, or, not and parenthesis.

In [30]: shootout = (team1_goals > 4) and (team2_goals > 4)

In [31]: at_least_one_good_team = ((team1_goals > 4) or
                                   (team2_goals > 3))

In [32]: you_guys_are_bad = not ((team1_goals > 1) or
                                 (team2_goals > 1)))

In [33]: meh = not (shootout or
                    at_least_one_good_team or
                    you_guys_are_bad)

In [30]: shootout = (
  (team1_goals > 4) and
  (team2_goals > 4))

In [31]:
at_least_one_good_team = (
  (team1_goals > 4) or
  (team2_goals > 3))

In [32]: you_guys_are_bad = not (
  (team1_goals > 1) or
  (team2_goals > 1)))

In [33]: meh = not (
  shootout or
  at_least_one_good_team or
  you_guys_are_bad)

if statements

Bools are used frequently; one place is with if statements. The following code assigns a string to a variable message depending on what happened.

In [34]: 
if team1_won:
  message = "Nice job team 1!"
elif team2_won:
  message = "Way to go team 2!!"
else:
  message = "must have tied!"
--

In [35]: message
Out[35]: 'Nice job team 1!'

Notice how in the code I'm saying if team1_won, not if team1_won == True. While the latter would technically work, it's a good way to show anyone looking at your code that you don't really understand bools. team1_won is True, it's a bool. team1_won == True is also True, and it's still a bool. Similarly, don't write team1_won == False, write not team1_won.

A boolean -- like a string or number -- is a basic, building block type. In the part 3 we'll expand on this with some container types.

Thanks for reading

If you want to be notified when new guides like this come out you can enter your email here: