Learn Python with Hockey - Part 3

This is the third in a series of posts that teach Python and data science using NHL data.

If you want to be notified when new guides like this come out you can enter your email here:

Let's get started!

Following Along

All these posts are heavy on examples and meant to be followed along with. Like last time (assuming you've got everything installed) go to:

https://raw.githubusercontent.com/nathanbraun/learn-python-hockey/main/code/part3.py

And copy and paste everything that's there into Spyder (either temp.py or a new file).

Container Types

Last time we talked about basic Python types like strings, numbers and bools. These are called primitives; they're the basic building block types.

There are other container types that can hold other values. Two important container types are lists and dicts. Sometimes containers are also called collections.

Lists

Lists are built with square brackets and are basically a simple way to hold other, ordered pieces of data.

In [1]: first_line = ['alex ovechkin', 'nicklas backstrom', 'anthony mantha']

Every spot in a list has a number associated with it. The first spot is 0. You can get sections (called slices) of your list by separating numbers with a colon. Both single numbers and slices are called inside square brackets, i.e. [].

A single integer inside a bracket returns one element of your list, while a slice returns a smaller list. Note a slice returns up to the last number, so [0:2] returns the 0 and 1 items, but not item 2.

In [2]: first_line[0]
Out[2]: 'alex ovechkin'

In [3]: first_line[0:2]
Out[3]: ['alex ovechkin', 'nicklas backstrom']

Passing a negative number gives you the end of the list. To get the last two items you could do:

In [4]: first_line[-2:]
Out[4]: ['nicklas backstrom', 'anthony mantha']

Also note how when you leave off the number after the colon the slice will automatically use the end of the list.

Lists can hold anything, including other lists. Lists that hold other lists are often called nested lists.

Dicts

A dict is short for dictionary. You can think about it like an actual dictionary if you want. Real dictionaries have words and definitions, Python dicts have keys and values.

Dicts are basically a way to hold data and give each piece a name. They're written with curly brackets, like this:

In [5]: 
first_line_dict = {'lw': 'alex ovechkin',
                   'c': 'nicklas backstrom',
                   'rw': 'anthony mantha'}

You can access items in a dict like this:

In [6]: first_line_dict['lw']
Out[6]: 'alex ovechkin'

And add new things to dicts like this:

In [7]: first_line_dict['rd'] = 'john carlson'

In [8]: roster_dict
Out[8]:
{'lw': 'alex ovechkin',
 'c': 'nicklas backstrom',
 'rw': 'anthony mantha',
 'rd': 'john carlson'}

Notice how keys are strings (they're surrounded in quotes). They can also be numbers or even bools. They cannot be a variable that has not already been created. You could do this:

In [9]: pos = 'lw'

In [10]: first_line_dict[pos]
Out[10]: 'alex ovechkin'

Because when you run it Python is just replacing pos with 'lw'.

But you will get an error if pos is undefined. You also get an error if you try to use a key that's not present in the dict (note: assigning something to a key that isn't there yet — like we did with 'john carlson' above — is OK).

While dictionary keys are usually strings, dictionary values can be anything, including lists or other dicts.

Unpacking

Now that we've seen an example of container types, we can mention unpacking. Unpacking is a way to assign multiple variables at once, like this:

In [11]: lw, rw = ['alex ovechkin', 'anthony mantha']

That does the exact same thing as assigning these separately on their own line.

In [12]: lw = 'alex ovechkin'

In [13]: c = 'nicklas backstrom'

One pitfall when unpacking values is that the number of whatever you're assigning to has to match the number of values available in your container. This would give you an error:

In [14]: lw, c = ['alex ovechkin', 'nicklas backstrom', 'anthony mantha']
...
ValueError: too many values to unpack (expected 2)

Unpacking isn't used that frequently. Shorter code isn't always necessarily better, and it's probably clearer to someone reading your code if you assign lw and c on separate lines.

However, some built-in parts of Python (including material below) use unpacking, so we needed to touch on it briefly.

Loops

Loops are a way to "do something" for every item in a collection.

For example, maybe I have a list of lowercase player names and I want to go through them and change them all to proper name formatting using the title string method, which capitalizes the first letter of every word in a string.

One way to do that is with a for loop:

first_line = ['alex ovechkin', 'nicklas backstrom', 'anthony mantha']

first_line_upper = ['', '', '']
i = 0
for player in first_line:
    first_line_upper[i] = player.title()
    i = i + 1

What's happening here is the last two lines are run multiple times, once for every item in the list. The first time player has the value 'alex ovechkin', the second 'nicklas backstrom', etc. We're also using a variable i to keep track of our position in our list. The last line in the body of each loop is to increment i by one, so that we'll be working with the correct spot the next time we go through it.

In [15]: first_line_upper
Out[15]: ['Alex Ovechkin', 'Nicklas Backstrom', 'Anthony Mantha']

The programming term for "going over each element in some collection" is iterating. Collections that allow you to iterate over them are called iterables.

Dicts are also iterables. The default behavior when iterating over dicts is you get access to the keys only. So:

In [51]: 
for x in first_line_dict:
    print(f"position: {x}")
--
position: lw
position: c
position: rw
position: rd

But what if we want access to the values too? One thing we could do is write first_line_dict[x], like this:

In [17]:
for x in first_line_dict:
   print(f"position: {x}")
   print(f"player: {first_line_dict[x]}")

--
position: lw
player: alex ovechkin
position: c
player: nicklas backstrom
position: rw
player: anthony mantha
position: rd
player: john carlson

But Python has a shortcut that makes things easier: we can add .items() to our dict to get access to the value.

In [18]: 
for x, y in first_line_dict.items():
    print(f"position: {x}")
    print(f"player: {y}")

position: lw
player: alex ovechkin
position: c
player: nicklas backstrom
position: rw
player: anthony mantha
position: rd
player: john carlson

Notice the for x, y... part of the loop. Adding .items() unpacks the key and value into our two loop variables (we choose x and y).

Loops are occasionally useful. And they're definitely better than copying and pasting a bunch of code over and over and making some minor change.

But in many instances, there's a better option: comprehensions.

Comprehensions

Comprehensions are a way to modify lists or dicts with not a lot of code. They're like loops condensed onto one line.

Mark Pilgrim, author of Dive into Python, says that every programming language has some complicated, powerful concept that it makes intentionally simple and easy to do. Not every language can make everything easy, because all language decisions involve tradeoffs. Pilgrim says comprehensions are that feature for Python.

List Comprehensions

When you want to go from one list to another, different list you should be thinking comprehension. Our first for loop example, where we wanted to take our list of lowercase players and make a list where they're all properly formatted, is a great candidate.

The list comprehension way of doing that would be:

In [19]: first_line
Out[19]: ['alex ovechkin', 'nicklas backstrom', 'anthony mantha']

In [20]: first_line_proper = [x.title() for x in first_line]

In [21]: first_line_proper
Out[21]: ['Alex Ovechkin', 'Nicklas Backstrom', 'Anthony Mantha']

All list comprehensions take the form [a for b in c] where c is the list you're iterating over (starting with), and b is the variable you're using in a to specify exactly what you want to do to each item.

In the above example a is x.title(), b is x, and c is first_line.

Note, it's common to use x for your comprehension variable, but — like loops — you can use whatever you want. So this:

In [22]: first_line_proper_alt = [y.title() for y in first_line]

does exactly the same thing as the version using x did.

Comprehensions can be tricky at first, but they're not that bad once you get the hang of them. They're useful and we'll see them again though, so if the explanation above is fuzzy, read it again and look at the example until it makes sense.

A List Comprehension is a List

A comprehension evaluates to a regular Python list. That's a fancy way of saying the result of a comprehension is a list.

In [23]: type([x.title() for x in first_line])
Out[23]: list

And we can slice it and do everything else we could do to a normal list:

In [24]: [x.title() for x in first_line][:2]
Out[24]: ['Alex Ovechkin', 'Nicklas Backstrom']

There is literally no difference.

More Comprehensions

Let's do another, more complicated, comprehension:

In [25]: first_line_last_names = [full_name.split(' ')[1]
                                  for full_name in first_line]

In [26]: first_line_last_names
Out[26]: ['ovechkin', 'backstrom', 'mantha']

Remember, all list comprehensions take the form [a for b in c]. The last two are easy: c is just first_line and b is full_name.

That leaves a, which is full_name.split(' ')[1].

Sometimes its helpful to prototype this part in the REPL with an actual item from your list.

In [27]: full_name = 'alex ovechkin'

In [28]: full_name.split(' ')
Out[28]: ['alex', 'ovechkin']

In [29]: full_name.split(' ')[1]
Out[29]: 'ovechkin'

We can see split is a string method that returns a list of substrings. After calling it we can pick out each player's last name in spot 1 of our new list.

The programming term for how we've been using comprehensions so far — "doing something" to each item in a collection — is mapping. As in, I mapped title to each element of first_line.

We can also use comprehensions to filter a collection to include only certain items. To do this we add if some criteria that evaluates to a boolean at the end.

In [30]:
first_line_a_only = [
    x for x in first_line if x.startswith('a')]

In [31]: first_line_a_only
Out[31]: ['alex ovechkin', 'anthony mantha']

Updating our notation, a comprehension technically has the form [a for b in c if d], where if d is optional.

Above, d is x.startswith('a'). The startswith string method takes a string and returns a bool indicating whether the original string starts with it or not. Again, it's helpful to test it out with actual items from our list.

In [32]: 'alex ovechkin'.startswith('a')
Out[32]: True

In [33]: 'nicklas backstrom'.startswith('a')
Out[33]: False

In [34]: 'anthony mantha'.startswith('a')
Out[34]: True

Interestingly, in this comprehension the a in our [a for b in c if d] notation is just x. That means we're doing nothing to the value itself (we're taking x and returning x); the whole purpose of this comprehension is to filter roster_list to only include items that start with 'a'.

You can easily extend this to map and filter in the same comprehension:

In [35]: 
first_line_a_only_title = [
    x.title() for x in first_line if x.startswith('a')]
--

In [36]: first_line_a_only_title
Out[36]: ['Alex Ovechkin', 'Anthony Mantha']

Dict Comprehensions

Dict comprehensions work similarly to list comprehensions. Except now, the whole thing is wrapped in {} instead of [].

And — like with our for loop over a dict, we can use .items() to get access to the key and value.

In [37]:
salary_per_player = {
    'alex ovechkin': 10000000 , 'nicklas backstrom': 12000000,
    'anthony mantha': 5700000}

In [38]:
salary_m_per_upper_player = {
    name.upper(): salary/1000000 for name, salary in salary_per_player.items()}
--

In [39]: salary_m_per_upper_player
Out[39]: {'ALEX OVECHKIN': 10.0, 'NICKLAS BACKSTROM': 12.0, 'ANTHONY MANTHA': 5.7}

Comprehensions make it easy to go from a list to a dict or vice versa. For example, say we want to total up all the money in our dict salary_per_player.

Well, one way to add up numbers in Python is to pass a list of them to the sum() function.

In [40]: sum([1, 2, 3])
Out[40]: 6

If we want to get the total salary in our salary_per_player dict, we make a list of just the salaries using a list comprehension, then pass it to sum like:

In [41]: sum([salary for _, salary in salary_per_player.items()])
Out[41]: 27700000

This is still a list comprehension even though we're starting with a dict (salary_per_player). When in doubt, check the surrounding punctuation. It's brackets here, which means list.

Also note the for _, salary in ... part of the code. The only way to get access to a value of a dict (i.e., the salary here) is to use .items(), which also gives us access to the key (the player name in this case). But since we don't actually need the key for summing salary, the Python convention is to name that variable _. This lets people reading our code know we're not using it.

Functions

In the last section we saw sum(), which is a Python built-in that takes in a list of numbers and totals them up.

sum() is an example of a function. Functions are code that take inputs (the function's arguments) and return outputs. Python includes several built-in functions. Another common one is len, which finds the length of a list.

In [42]: len(['alex ovechkin', 'nicklas backstrom', 'anthony mantha'])
Out[42]: 3

Using the function — i.e. giving it some inputs and having it return its output — is also known as calling or applying the function.

Once we've called a function, we can use it just like any other value. There's no difference between len(['ruben dias', 'gabriel jesus', 'riyad mahrez']) and 3. We could define variables with it:

In [43]: n_goals = len(['alex ovechkin', 'nicklas backstrom', 'anthony mantha'])

In [44]: n_goals
Out[44]: 3

Or use it in math.

In [45]: 4 + len(['alex ovechkin', 'nicklas backstrom', 'anthony mantha'])
Out[45]: 7

Or whatever. Once it's called, it's the value the function returned, that's it.

Defining Your Own Functions

It is very common in all programming languages to define your own functions.

def team_pts(wins, losses, ot_losses):
    """
    multi line strings in python are between three double quotes

    it's not required, but the convention is to put what the fn does in
    one of these multi line strings (called "docstring") right away in
    a function

    this function takes number of wins, overtime losses, and regular
    season losses and returns team points
    """
    return 2*wins + 1*ot_losses

After defining a function (making sure to highlight it and send it to the REPL) you can call it like this:

In [46]: team_pts(62, 16, 4)
Out[46]: 128

Note the arguments wins, ot_losses and losses. These work just like normal variables, except they're only available inside your function (the function's body).

So, even after defining and running this function it you try to type:

In [47]: print(wins)
...
NameError: name 'wins' is not defined

You'll get an error - wins only exists inside the team_pts function.

The programming term for where you have access to a variable (inside the function for arguments) is scope.

You could put the print statement inside the function:

def team_pts_noisy(wins, losses, ot_losses):
    """
    this function takes number of wins, overtime losses, and regular
    season losses and returns team points

    it also prints out wins
    """
    print(wins)  # works here since we're inside fn
    return 2*wins + 1*ot_losses

And then when we call it:

In [48]: team_pts_noisy(62, 16, 4)
62
Out[48]: 128

Note the 62 in the REPL. Along with returning a bool, team_pts_noisy prints the value of wins. This is a side effect of calling the function. A side effect is anything your function does besides returning a value.

Printing variable values isn't a big deal (it can be helpful if your function isn't working like you expect), but apart from that you should avoid side effects in your functions.

Default Values in Functions

Here's a question: what happens if we leave out any of the arguments when calling our function?

Let's try it:

In [49]: team_pts(62, 16)
...
TypeError: team_pts() missing 1 required positional argument: 'ot_losses'

We get an error. We gave it 62 and 16, which got assigned to wins and losses but ot_losses didn't get a value.

We can avoid this error by including default values. Let's make ot_losses default to 0.

def team_pts_wdefault(wins, losses, ot_losses=0):
    """
    this function takes number of wins, overtime losses, and regular
    season losses and returns team points
    """
    return 2*wins + 1*ot_losses

Now ot_losses is optional because we gave it a default value. Note wins and losses are still required (no default values). Also note this mix of required and optional arguments — this is fine. Python's only rule is any optional arguments have to come after required arguments.

Now this function call works:

In [50]: team_pts_wdefault(62, 16)
Out[50]: 124

But if we run it without wins or losses we still get an error:

In [51]: team_pts_wdefault(62)
...
TypeError: team_pts_wdefault() missing 1 required positional argument: 'losses'

Positional vs Keyword Arguments

Up to this point we've just passed the arguments in order, or by position.

So when we call:

In [52]: team_pts(62, 16, 4)
Out[52]: 128

The function assigns 62 to wins, 16 to losses and 4 to ot_losses. It's in that order (wins, losses, ot_losses) because that's the order we wrote them when we defined team_pts above.

These are called positional arguments.

We wrote this function, so we know the order the arguments go, but often we'll use third party code with functions we didn't write.

In that case we'll want to know the function's Signature — the arguments it takes, the order they go, and what's required vs optional.

It's easy to check in the REPL, just type the name of the function and a question mark:

In [53]: team_pts?
Signature: team_pts(wins, losses, ot_losses)
...
Type:      function

The alternative to passing all the arguments in the correct positions is to use keyword arguments, like this:

In [54]: team_pts(wins=62, ot_losses=4, losses=16)
Out[54]: 128

Keyword arguments are useful because you no longer have to remember the exact argument order. In practice, they're also required to take advantage of default values.

Think about it: presumably your function includes defaults so that you don't have to type in a value for every argument, every time. But if you're passing some values and not others, how's Python supposed to know which is which?

The answer is keyword arguments.

You're allowed to mix positional and keyword arguments:

In [55]: team_pts(62, losses=16, ot_losses=4)
Out[55]: 128

But Python's rule is that positional arguments have to come first.

One thing this implies is it's a good idea to put your most "important" arguments first, leaving your optional arguments for the end of the function definition.

For example, later we'll learn about the read_csv function in Pandas, whose job is to load your csv data into Python. The first argument to read_csv is a string with the path to your file, and that's the only argument you'll use 95% of the time. But it also has more than 40 optional arguments, everything from skip_blank_lines (defaults to True) to parse_dates (defaults to False).

What this means is usually you can just use the function like this:

data = read_csv('my_data_file.csv')

And on the rare occasions when you do need to tweak some option, change the specific settings you want using keyword arguments:

data = read_csv('my_data_file.csv', skip_blank_lines=False,
                parse_dates=True)

Python's argument rules are precise, but pretty intuitive when you get used to them. See the end of chapter exercises for more practice.

Functions That Take Other Functions

A cool feature of Python is that functions can take other functions as arguments.

def do_to_list(working_list, working_fn, desc):
    """
    this function takes a list, a function that works on a list, and a
    description

    it applies the function to the list, then returns the result along
    with description as a string
    """

    value = working_fn(working_list)

    return f'{desc} {value}'

Now let's also make a function to use this on.

def last_elem_in_list(working_list):
    """
    returns the last element of a list.
    """
    return working_list[-1]

And try it out:

In [56]: positions = ['LW', 'C', 'RW', 'LD', 'RD', 'GK']

In [57]: do_to_list(positions, last_elem_in_list,
                    "last element in your list:")
Out[57]: 'last element in your list: GK'

In [58]: do_to_list([1, 2, 4, 8], last_elem_in_list
                    "last element in your list:")
Out[58]: 'last element in your list: 8'

The function do_to_list can work on built in functions too.

In [59]: do_to_list(positions, len, "length of your list:")
Out[59]: 'length of your list: 6'

You can also create functions on the fly without names, usually for purposes of passing to other, flexible functions.

In [60]: do_to_list([2, 3, 7, 1.3, 5], lambda x: 3*x[0],
                     "first element in your list times 3 is:")
Out[60]: 'first element in your list times 3 is: 6'

These are called anonymous or lambda functions.

Libraries are Functions and Types

There is much more to basic Python than this, but this is enough of a foundation to learn the other libraries we'll be using.

Libraries are just a collection of user defined functions and types ^7 that other people have written using Python ^8 and other libraries. That's why it's critical to understand the concepts in this section. Libraries are Python, with lists, dicts, bools, functions and all the rest.

we covered defining your own functions, we did not cover defining your own types — sometimes called classes — in Python. Working with classes is sometimes called object-oriented programming. While object-oriented programming and being able to write your own classes is sometimes helpful, it's definitely not required for everyday data analysis. I hardly ever use it myself.

sometimes they use other programming languages too. Parts of the data analysis library Pandas, for example, are written in the programming language C. But we don't have to worry about that.

os Library and path

Some libraries come built-in to Python. One example we'll use is the os (for operating system) library. To use it, we have to import it, like this:

In [61]: import os

That lets us use all the functions written in the os library. For example, we can call cpu_count to see the number of computer cores we currently have available.

In [62]: os.cpu_count()
Out[62]: 12

Libraries like os can contain sub-libraries too. The sub-library we'll use from os is path, which is useful for working with filenames. One of the main function is join, which takes a directory (or multiple directories) and a filename and puts them together in a string. Like this:

In [63]: from os import path

In [64]: DATA_DIR = '/Users/nathan/hockey-book/data'

In [65]: path.join(DATA_DIR, 'shots.csv')
Out[65]: '/Users/nathan/hockey-book/data/shots.csv'

In [66]: os.path.join(DATA_DIR, 'shots.csv')  # alt way of calling
Out[66]: '/Users/nathan/hockey-book/data/shots.csv'

With join, you don't have to worry about trailing slashes or operating system differences or anything like that. You can just replace DATA_DIR with the directory that holds the csv files that came with this book and you'll be set.