This is the third in a series of posts that teach Python and data science using NHL data.
If you want to be notified when new guides like this come out you can enter your email here:
Let's get started!
Following Along
All these posts are heavy on examples and meant to be followed along with. Like last time (assuming you've got everything installed) go to:
And copy and paste everything that's there into Spyder (either temp.py
or a
new file).
Container Types
Last time we talked about basic Python types like strings, numbers and bools. These are called primitives; they're the basic building block types.
There are other container types that can hold other values. Two important container types are lists and dicts. Sometimes containers are also called collections.
Lists
Lists are built with square brackets and are basically a simple way to hold other, ordered pieces of data.
In [1]: first_line = ['alex ovechkin', 'nicklas backstrom', 'anthony mantha']
Every spot in a list has a number associated with it. The first spot is
0. You can get sections (called slices) of your list by separating
numbers with a colon. Both single numbers and slices are called inside square
brackets, i.e. []
.
A single integer inside a bracket returns one element of your list, while
a slice returns a smaller list. Note a slice returns up to the last number,
so [0:2]
returns the 0
and 1
items, but not item 2
.
In [2]: first_line[0]
Out[2]: 'alex ovechkin'
In [3]: first_line[0:2]
Out[3]: ['alex ovechkin', 'nicklas backstrom']
Passing a negative number gives you the end of the list. To get the last two items you could do:
In [4]: first_line[-2:]
Out[4]: ['nicklas backstrom', 'anthony mantha']
Also note how when you leave off the number after the colon the slice will automatically use the end of the list.
Lists can hold anything, including other lists. Lists that hold other lists are often called nested lists.
Dicts
A dict is short for dictionary. You can think about it like an actual dictionary if you want. Real dictionaries have words and definitions, Python dicts have keys and values.
Dicts are basically a way to hold data and give each piece a name. They're written with curly brackets, like this:
In [5]:
first_line_dict = {'lw': 'alex ovechkin',
'c': 'nicklas backstrom',
'rw': 'anthony mantha'}
You can access items in a dict like this:
In [6]: first_line_dict['lw']
Out[6]: 'alex ovechkin'
And add new things to dicts like this:
In [7]: first_line_dict['rd'] = 'john carlson'
In [8]: roster_dict
Out[8]:
{'lw': 'alex ovechkin',
'c': 'nicklas backstrom',
'rw': 'anthony mantha',
'rd': 'john carlson'}
Notice how keys are strings (they're surrounded in quotes). They can also be numbers or even bools. They cannot be a variable that has not already been created. You could do this:
In [9]: pos = 'lw'
In [10]: first_line_dict[pos]
Out[10]: 'alex ovechkin'
Because when you run it Python is just replacing pos
with 'lw'
.
But you will get an error if pos
is undefined. You also get an error if you
try to use a key that's not present in the dict (note: assigning something to
a key that isn't there yet — like we did with 'john carlson'
above — is OK).
While dictionary keys are usually strings, dictionary values can be anything, including lists or other dicts.
Unpacking
Now that we've seen an example of container types, we can mention unpacking. Unpacking is a way to assign multiple variables at once, like this:
In [11]: lw, rw = ['alex ovechkin', 'anthony mantha']
That does the exact same thing as assigning these separately on their own line.
In [12]: lw = 'alex ovechkin'
In [13]: c = 'nicklas backstrom'
One pitfall when unpacking values is that the number of whatever you're assigning to has to match the number of values available in your container. This would give you an error:
In [14]: lw, c = ['alex ovechkin', 'nicklas backstrom', 'anthony mantha']
...
ValueError: too many values to unpack (expected 2)
Unpacking isn't used that frequently. Shorter code isn't always necessarily
better, and it's probably clearer to someone reading your code if you assign
lw
and c
on separate lines.
However, some built-in parts of Python (including material below) use unpacking, so we needed to touch on it briefly.
Loops
Loops are a way to "do something" for every item in a collection.
For example, maybe I have a list of lowercase player names and I want to go
through them and change them all to proper name formatting using the title
string method, which capitalizes the first letter of every word in a string.
One way to do that is with a for
loop:
first_line = ['alex ovechkin', 'nicklas backstrom', 'anthony mantha']
first_line_upper = ['', '', '']
i = 0
for player in first_line:
first_line_upper[i] = player.title()
i = i + 1
What's happening here is the last two lines are run multiple times, once for
every item in the list. The first time player
has the value 'alex
ovechkin'
, the second 'nicklas backstrom'
, etc. We're also using a variable
i
to keep track of our position in our list. The last line in the body of
each loop is to increment i
by one, so that we'll be working with the correct
spot the next time we go through it.
In [15]: first_line_upper
Out[15]: ['Alex Ovechkin', 'Nicklas Backstrom', 'Anthony Mantha']
The programming term for "going over each element in some collection" is iterating. Collections that allow you to iterate over them are called iterables.
Dicts are also iterables. The default behavior when iterating over dicts is you get access to the keys only. So:
In [51]:
for x in first_line_dict:
print(f"position: {x}")
--
position: lw
position: c
position: rw
position: rd
But what if we want access to the values too? One thing we could do is write
first_line_dict[x]
, like this:
In [17]:
for x in first_line_dict:
print(f"position: {x}")
print(f"player: {first_line_dict[x]}")
--
position: lw
player: alex ovechkin
position: c
player: nicklas backstrom
position: rw
player: anthony mantha
position: rd
player: john carlson
But Python has a shortcut that makes things easier: we can add .items()
to
our dict to get access to the value.
In [18]:
for x, y in first_line_dict.items():
print(f"position: {x}")
print(f"player: {y}")
position: lw
player: alex ovechkin
position: c
player: nicklas backstrom
position: rw
player: anthony mantha
position: rd
player: john carlson
Notice the for x, y
... part of the loop. Adding .items()
unpacks the key
and value into our two loop variables (we choose x
and y
).
Loops are occasionally useful. And they're definitely better than copying and pasting a bunch of code over and over and making some minor change.
But in many instances, there's a better option: comprehensions.
Comprehensions
Comprehensions are a way to modify lists or dicts with not a lot of code. They're like loops condensed onto one line.
Mark Pilgrim, author of Dive into Python, says that every programming language has some complicated, powerful concept that it makes intentionally simple and easy to do. Not every language can make everything easy, because all language decisions involve tradeoffs. Pilgrim says comprehensions are that feature for Python.
List Comprehensions
When you want to go from one list to another, different list you should be
thinking comprehension. Our first for
loop example, where we wanted to take
our list of lowercase players and make a list where they're all properly
formatted, is a great candidate.
The list comprehension way of doing that would be:
In [19]: first_line
Out[19]: ['alex ovechkin', 'nicklas backstrom', 'anthony mantha']
In [20]: first_line_proper = [x.title() for x in first_line]
In [21]: first_line_proper
Out[21]: ['Alex Ovechkin', 'Nicklas Backstrom', 'Anthony Mantha']
All list comprehensions take the form [a for b in c] where c is the list you're iterating over (starting with), and b is the variable you're using in a to specify exactly what you want to do to each item.
In the above example a is x.title()
, b is x
, and c is first_line
.
Note, it's common to use x
for your comprehension variable, but — like loops
— you can use whatever you want. So this:
In [22]: first_line_proper_alt = [y.title() for y in first_line]
does exactly the same thing as the version using x
did.
Comprehensions can be tricky at first, but they're not that bad once you get the hang of them. They're useful and we'll see them again though, so if the explanation above is fuzzy, read it again and look at the example until it makes sense.
A List Comprehension is a List
A comprehension evaluates to a regular Python list. That's a fancy way of saying the result of a comprehension is a list.
In [23]: type([x.title() for x in first_line])
Out[23]: list
And we can slice it and do everything else we could do to a normal list:
In [24]: [x.title() for x in first_line][:2]
Out[24]: ['Alex Ovechkin', 'Nicklas Backstrom']
There is literally no difference.
More Comprehensions
Let's do another, more complicated, comprehension:
In [25]: first_line_last_names = [full_name.split(' ')[1]
for full_name in first_line]
In [26]: first_line_last_names
Out[26]: ['ovechkin', 'backstrom', 'mantha']
Remember, all list comprehensions take the form [a for b in c]. The last
two are easy: c
is just first_line
and b
is full_name
.
That leaves a
, which is full_name.split(' ')[1]
.
Sometimes its helpful to prototype this part in the REPL with an actual item from your list.
In [27]: full_name = 'alex ovechkin'
In [28]: full_name.split(' ')
Out[28]: ['alex', 'ovechkin']
In [29]: full_name.split(' ')[1]
Out[29]: 'ovechkin'
We can see split
is a string method that returns a list of substrings. After
calling it we can pick out each player's last name in spot 1
of our new
list.
The programming term for how we've been using comprehensions so far — "doing
something" to each item in a collection — is mapping. As in, I mapped
title
to each element of first_line
.
We can also use comprehensions to filter a collection to include only
certain items. To do this we add if
some criteria that evaluates to
a boolean at the end.
In [30]:
first_line_a_only = [
x for x in first_line if x.startswith('a')]
In [31]: first_line_a_only
Out[31]: ['alex ovechkin', 'anthony mantha']
Updating our notation, a comprehension technically has the form [a for b in c if d], where if d is optional.
Above, d is x.startswith('a')
. The startswith
string method takes
a string and returns a bool indicating whether the original string starts with
it or not. Again, it's helpful to test it out with actual items from our list.
In [32]: 'alex ovechkin'.startswith('a')
Out[32]: True
In [33]: 'nicklas backstrom'.startswith('a')
Out[33]: False
In [34]: 'anthony mantha'.startswith('a')
Out[34]: True
Interestingly, in this comprehension the a in our [a for b in c if
d] notation is just x
. That means we're doing nothing to the value
itself (we're taking x
and returning x
); the whole purpose of this
comprehension is to filter roster_list
to only include items that start with
'a'
.
You can easily extend this to map and filter in the same comprehension:
In [35]:
first_line_a_only_title = [
x.title() for x in first_line if x.startswith('a')]
--
In [36]: first_line_a_only_title
Out[36]: ['Alex Ovechkin', 'Anthony Mantha']
Dict Comprehensions
Dict comprehensions work similarly to list comprehensions. Except now, the
whole thing is wrapped in {}
instead of []
.
And — like with our for
loop over a dict, we can use .items()
to get
access to the key and value.
In [37]:
salary_per_player = {
'alex ovechkin': 10000000 , 'nicklas backstrom': 12000000,
'anthony mantha': 5700000}
In [38]:
salary_m_per_upper_player = {
name.upper(): salary/1000000 for name, salary in salary_per_player.items()}
--
In [39]: salary_m_per_upper_player
Out[39]: {'ALEX OVECHKIN': 10.0, 'NICKLAS BACKSTROM': 12.0, 'ANTHONY MANTHA': 5.7}
Comprehensions make it easy to go from a list to a dict or vice versa. For
example, say we want to total up all the money in our dict salary_per_player
.
Well, one way to add up numbers in Python is to pass a list of them to the
sum()
function.
In [40]: sum([1, 2, 3])
Out[40]: 6
If we want to get the total salary in our salary_per_player
dict, we make
a list of just the salaries using a list comprehension, then pass it to sum
like:
In [41]: sum([salary for _, salary in salary_per_player.items()])
Out[41]: 27700000
This is still a list comprehension even though we're starting with a dict
(salary_per_player
). When in doubt, check the surrounding punctuation. It's
brackets here, which means list.
Also note the for _, salary in ...
part of the code. The only way to get access
to a value of a dict (i.e., the salary here) is to use .items()
, which also
gives us access to the key (the player name in this case). But since we don't
actually need the key for summing salary, the Python convention is to name that
variable _
. This lets people reading our code know we're not using it.
Functions
In the last section we saw sum()
, which is a Python built-in that takes in a
list of numbers and totals them up.
sum()
is an example of a function. Functions are code that take inputs
(the function's arguments) and return outputs. Python includes several
built-in functions. Another common one is len
, which finds the length of
a list.
In [42]: len(['alex ovechkin', 'nicklas backstrom', 'anthony mantha'])
Out[42]: 3
Using the function — i.e. giving it some inputs and having it return its output — is also known as calling or applying the function.
Once we've called a function, we can use it just like any other value. There's
no difference between len(['ruben dias', 'gabriel jesus', 'riyad mahrez'])
and 3
. We could define variables with it:
In [43]: n_goals = len(['alex ovechkin', 'nicklas backstrom', 'anthony mantha'])
In [44]: n_goals
Out[44]: 3
Or use it in math.
In [45]: 4 + len(['alex ovechkin', 'nicklas backstrom', 'anthony mantha'])
Out[45]: 7
Or whatever. Once it's called, it's the value the function returned, that's it.
Defining Your Own Functions
It is very common in all programming languages to define your own functions.
def team_pts(wins, losses, ot_losses):
"""
multi line strings in python are between three double quotes
it's not required, but the convention is to put what the fn does in
one of these multi line strings (called "docstring") right away in
a function
this function takes number of wins, overtime losses, and regular
season losses and returns team points
"""
return 2*wins + 1*ot_losses
After defining a function (making sure to highlight it and send it to the REPL) you can call it like this:
In [46]: team_pts(62, 16, 4)
Out[46]: 128
Note the arguments wins
, ot_losses
and losses
. These work just like
normal variables, except they're only available inside your function (the
function's body).
So, even after defining and running this function it you try to type:
In [47]: print(wins)
...
NameError: name 'wins' is not defined
You'll get an error - wins
only exists inside the team_pts
function.
The programming term for where you have access to a variable (inside the function for arguments) is scope.
You could put the print
statement inside the function:
def team_pts_noisy(wins, losses, ot_losses):
"""
this function takes number of wins, overtime losses, and regular
season losses and returns team points
it also prints out wins
"""
print(wins) # works here since we're inside fn
return 2*wins + 1*ot_losses
And then when we call it:
In [48]: team_pts_noisy(62, 16, 4)
62
Out[48]: 128
Note the 62
in the REPL. Along with returning a bool, team_pts_noisy
prints
the value of wins
. This is a side effect of calling the function. A side
effect is anything your function does besides returning a value.
Printing variable values isn't a big deal (it can be helpful if your function isn't working like you expect), but apart from that you should avoid side effects in your functions.
Default Values in Functions
Here's a question: what happens if we leave out any of the arguments when calling our function?
Let's try it:
In [49]: team_pts(62, 16)
...
TypeError: team_pts() missing 1 required positional argument: 'ot_losses'
We get an error. We gave it 62 and 16, which got assigned to wins
and
losses
but ot_losses
didn't get a value.
We can avoid this error by including default values. Let's make ot_losses
default to 0
.
def team_pts_wdefault(wins, losses, ot_losses=0):
"""
this function takes number of wins, overtime losses, and regular
season losses and returns team points
"""
return 2*wins + 1*ot_losses
Now ot_losses
is optional because we gave it a default value. Note wins
and losses
are still required (no default values). Also note this mix of
required and optional arguments — this is fine. Python's only rule is any
optional arguments have to come after required arguments.
Now this function call works:
In [50]: team_pts_wdefault(62, 16)
Out[50]: 124
But if we run it without wins
or losses
we still get an error:
In [51]: team_pts_wdefault(62)
...
TypeError: team_pts_wdefault() missing 1 required positional argument: 'losses'
Positional vs Keyword Arguments
Up to this point we've just passed the arguments in order, or by position.
So when we call:
In [52]: team_pts(62, 16, 4)
Out[52]: 128
The function assigns 62 to wins
, 16 to losses
and 4 to ot_losses
. It's in
that order (wins
, losses
, ot_losses
) because that's the order we wrote
them when we defined team_pts
above.
These are called positional arguments.
We wrote this function, so we know the order the arguments go, but often we'll use third party code with functions we didn't write.
In that case we'll want to know the function's Signature
— the arguments it
takes, the order they go, and what's required vs optional.
It's easy to check in the REPL, just type the name of the function and a question mark:
In [53]: team_pts?
Signature: team_pts(wins, losses, ot_losses)
...
Type: function
The alternative to passing all the arguments in the correct positions is to use keyword arguments, like this:
In [54]: team_pts(wins=62, ot_losses=4, losses=16)
Out[54]: 128
Keyword arguments are useful because you no longer have to remember the exact argument order. In practice, they're also required to take advantage of default values.
Think about it: presumably your function includes defaults so that you don't have to type in a value for every argument, every time. But if you're passing some values and not others, how's Python supposed to know which is which?
The answer is keyword arguments.
You're allowed to mix positional and keyword arguments:
In [55]: team_pts(62, losses=16, ot_losses=4)
Out[55]: 128
But Python's rule is that positional arguments have to come first.
One thing this implies is it's a good idea to put your most "important" arguments first, leaving your optional arguments for the end of the function definition.
For example, later we'll learn about the read_csv
function in Pandas, whose
job is to load your csv data into Python. The first argument to read_csv
is
a string with the path to your file, and that's the only argument you'll use
95% of the time. But it also has more than 40 optional arguments, everything
from skip_blank_lines
(defaults to True
) to parse_dates
(defaults to
False
).
What this means is usually you can just use the function like this:
data = read_csv('my_data_file.csv')
And on the rare occasions when you do need to tweak some option, change the specific settings you want using keyword arguments:
data = read_csv('my_data_file.csv', skip_blank_lines=False,
parse_dates=True)
Python's argument rules are precise, but pretty intuitive when you get used to them. See the end of chapter exercises for more practice.
Functions That Take Other Functions
A cool feature of Python is that functions can take other functions as arguments.
def do_to_list(working_list, working_fn, desc):
"""
this function takes a list, a function that works on a list, and a
description
it applies the function to the list, then returns the result along
with description as a string
"""
value = working_fn(working_list)
return f'{desc} {value}'
Now let's also make a function to use this on.
def last_elem_in_list(working_list):
"""
returns the last element of a list.
"""
return working_list[-1]
And try it out:
In [56]: positions = ['LW', 'C', 'RW', 'LD', 'RD', 'GK']
In [57]: do_to_list(positions, last_elem_in_list,
"last element in your list:")
Out[57]: 'last element in your list: GK'
In [58]: do_to_list([1, 2, 4, 8], last_elem_in_list
"last element in your list:")
Out[58]: 'last element in your list: 8'
The function do_to_list
can work on built in functions too.
In [59]: do_to_list(positions, len, "length of your list:")
Out[59]: 'length of your list: 6'
You can also create functions on the fly without names, usually for purposes of passing to other, flexible functions.
In [60]: do_to_list([2, 3, 7, 1.3, 5], lambda x: 3*x[0],
"first element in your list times 3 is:")
Out[60]: 'first element in your list times 3 is: 6'
These are called anonymous or lambda functions.
Libraries are Functions and Types
There is much more to basic Python than this, but this is enough of a foundation to learn the other libraries we'll be using.
Libraries are just a collection of user defined functions and types ^7 that other people have written using Python ^8 and other libraries. That's why it's critical to understand the concepts in this section. Libraries are Python, with lists, dicts, bools, functions and all the rest.
we covered defining your own functions, we did not cover defining your own types — sometimes called classes — in Python. Working with classes is sometimes called object-oriented programming. While object-oriented programming and being able to write your own classes is sometimes helpful, it's definitely not required for everyday data analysis. I hardly ever use it myself.
sometimes they use other programming languages too. Parts of the data analysis library Pandas, for example, are written in the programming language C. But we don't have to worry about that.
os Library and path
Some libraries come built-in to Python. One example we'll use is the os
(for
operating system) library. To use it, we have to import it, like this:
In [61]: import os
That lets us use all the functions written in the os
library. For example, we
can call cpu_count
to see the number of computer cores we currently have
available.
In [62]: os.cpu_count()
Out[62]: 12
Libraries like os
can contain sub-libraries too. The sub-library we'll use
from os
is path
, which is useful for working with filenames. One of the
main function is join
, which takes a directory (or multiple directories) and
a filename and puts them together in a string. Like this:
In [63]: from os import path
In [64]: DATA_DIR = '/Users/nathan/hockey-book/data'
In [65]: path.join(DATA_DIR, 'shots.csv')
Out[65]: '/Users/nathan/hockey-book/data/shots.csv'
In [66]: os.path.join(DATA_DIR, 'shots.csv') # alt way of calling
Out[66]: '/Users/nathan/hockey-book/data/shots.csv'
With join
, you don't have to worry about trailing slashes or operating system
differences or anything like that. You can just replace DATA_DIR
with the
directory that holds the csv files that came with this book and you'll be set.