This is the second in a series of posts that teach Python and data science using NHL data.
This post is where we get into some actual coding. It's really meant to be followed along with, but you're welcome to read through it too.
All the Python code we'll write later is built upon the concepts covered here. Since we'll be using Python for nearly everything, this section touches all parts of the high level, five-step data analysis process.
Note: there will likely be more posts like this coming so if you want to be notified when new guides like this come out you can enter your email here:
Let's get started!
Python
Much of the functionality in Python comes from third party libraries (or packages), specially designed for specific tasks.
For example: the Pandas library lets us manipulate data. And the library BeautifulSoup is the Python standard for scraping data from websites.
But, even when using third party packages, you will also be using a core set of Python features and functionality. These features — called the standard library — are built-in to Python.
Following Along
All these posts are heavy on examples and meant to be followed along with. If you haven't already, install Spyder (a free program to write and run Python code) and open it up.
Then go to:
And copy and paste everything that's there into Spyder (either temp.py
or a
new file). Then make sure the bottom right tab is set to 'IPython Console'. You
should see something like this:
Best case you have this guide up in one monitor and Spyder (with this code) in another. Then you can follow along + run the code (highlight the line(s) you want and press F9 to send it to the REPL/console) as we work through it.
If you do that, I've included what you'll see in the REPL here. That is:
In [1]: 1 + 1
Out[1]: 2
Where the line starting with In[1]
is the code you run, and Out[1]
is what
the REPL prints out. These are lines [1]
for me because this was the first
thing I entered in a new REPL session.
Don't worry if the numbers you see in In[ ]
and Out[ ]
don't match exactly
what I'm showing here. In fact, they probably won't, because as you run the
examples you should be exploring and experimenting. That's what the REPL is
for.
Nor should you worry about messing anything up: if you need a fresh start, you
can type reset
into the REPL and it will clear out everything you've run
previously. You can also type clear
to clear all the printed output.
Sometimes, examples build on each other (remember, the REPL keeps track of what you've run previously), so if something isn't working, it might be relying on code you haven't run yet.
Important Parts of the Python Standard Library
Comments
As you look at part2.py
you might notice a lot of lines beginning with #
.
These are comments. When reading your code, the computer will ignore
everything from #
to the end of the line.
Comments exist in all programming languages. They are a way to explain to anyone reading your code (including your future self) more about what's going on and what you were trying to do when you wrote it.
The problem with comments is it's easy for them to become out of date. This happens when you change your code and forget to update the comment.
An incorrect or misleading comment is worse than no comment. For that reason, most beginning programmers probably comment too often, especially because Python's syntax (the language related rules for writing programs) is usually pretty clear.
For example, this would be an unnecessary comment:
# print the result of 1 + 1
print(1 + 1)
Because it's not adding anything that isn't obvious by just looking at the code. It's better to use descriptive names, let your code speak for itself, and save comments for particularly tricky portions of code.
Variables
Variables are a fundamental concept in any programming language.
At their core, variables are just named pieces of information. This information can be anything from a single number to an entire dataset — the point is that they let you store and recall things easily.
The rules for naming variables differ by programming language. In Python, they can be any upper or lowercase letter, number or _ (underscore), but they can't start with a number.
Assigning data to variables
You assign a piece of data to a variable with an equals sign, like this:
In [1]: goals_scored = 2
Now, whenever you use goals_scored
in your code, the program
automatically substitutes it with 2
instead.
In [2]: goals_scored
Out[2]: 2
In [3]: 3*goals_scored
Out[3]: 6
One of the benefits of developing with a REPL is that you can type in a
variable, and the REPL will evaluate (i.e. determine what it is) and print it.
That's what the code above is doing. But note while goals_scored
is 2
, the
assignment statement itself, goals_scored = 2
, doesn't evaluate to anything,
so the REPL doesn't print anything out.
You can update and override variables too. Going into the code below,
goals_scored
has a value of 2
(from the code we just ran above). So the
right hand side, goals_scored + 1
is evaluated first (2 + 1 = 3
), and then
the result gets (re)assigned to goals_scored
, overwriting the 2
it held
previously.
In [4]: goals_scored = goals_scored + 1
In [5]: goals_scored
Out[5]: 3
In [4]: goals_scored = (
goals_scored + 1)
In [5]: goals_scored
Out[5]: 3
Types
Like Excel, Python includes concepts for both numbers and text. Technically, Python distinguishes between two types of numbers: integers (whole numbers) and floats (numbers that may have decimal points), but the difference isn't important for us right now.
In [6]: penalty_minutes = 15 # int
In [7]: puck_speed = 82.5 # float
In [6]: penalty_minutes = 15 # int
In [7]: puck_speed = 82.5 # float
Text, called a string in Python, is wrapped in either single ('
) or
double ("
) quotes. I usually just use single quotes, unless the text I want
to write has a single quote in it (like O'Reilly), in which case a string with
'Ryan O'Reilly'
would give an error.
In [8]: starting_lw = 'David Perron'
In [9]: starting_c = "Ryan O'Reilly" # this works
In [8]: starting_lw = 'David Perron'
In [9]: starting_c = "Ryan O'Reilly"
You can check the type of any variable with the type function.
In [10]: type(starting_lw)
Out[10]: str
In [11]: type(penalty_minutes)
Out[11]: int
Keep in mind the difference between strings (quotes) and variables (no quotes). A variable is a named of a piece of information. A string (or a number) is the information.
One common thing to do with strings is to insert variables inside of them. The easiest way to do that is via f-strings.
In [12]: starters = f'{starting_c}, {starting_lw}, etc.'
In [13]: starters
Out[13]: "Ryan O'Reilly, David Perron, etc."
In [12]: starters = (
f'C: {starting_c}')
In [13]: starters
Out[13]: "C: Ryan O'Reilly"
Note the f
immediately preceding the quotation mark. Adding that tells Python
you want to use variables inside your string, which you wrap in curly brackets.
f-strings are new as of Python 3.8, so if they're not working for you make sure that's at least the version you're using.
Strings also have useful methods you can use to do things to them. You
invoke methods with a .
and parenthesis. For example, to make a string
uppercase you can do:
In [14]: 'do you believe in miracles!?'.upper()
Out[14]: 'DO YOU BELIEVE IN MIRACLES!?'
In [14]: 'goal!!'.upper()
Out[14]: 'GOAL!!'
Note the parenthesis, e.g. upper()
. That's because sometimes these take
additional data, for example the replace
method takes two strings: the one
you want to replace, and what you want to replace it with:
In [15]: 'Bernie Geoffrion'.replace('Bernie', 'Boom Boom')
Out[15]: 'Boom Boom Geoffrion'
In [15]: 'Bernie Geoffrion'.replace(
'Bernie', 'Boom Boom')
Out[15]: 'Boom Boom Geoffrion'
There are a bunch of these string methods, most of which you won't use that often. Going through them all right now would bog down progress on more important things. But occasionally you will need one of these string methods. How should we handle this?
The problem is we're dealing with a comprehensiveness-clarity trade off. And, since anything short of Python in a Nutshell: A Desktop Quick Reference (which is 772 pages) is going to necessarily fall short on comprehensiveness, we'll do something better.
Rather than teaching you all 44 of Python's string methods, I am going to teach you how to quickly see which are available, what they do, and how to use them.
Though we're nominally talking about string methods here, this advice applies to any of the programming topics we'll cover in this book.
Interlude: How to Figure Things Out in Python
"A simple rule I taught my nine year-old today: if you can't figure something out, figure out how to figure it out." — Paul Graham
The first tool you can use to figure out your options is the REPL. In
particular, the REPL's tab completion functionality. Type in a string like
'sidney crosby'
then .
and hit tab. You'll see all the options available to you
(this is only the first page, you'll see more if you keep pressing tab).
'sidney crosby'.
capitalize() encode() format()
isalpha() isidentifier() isspace()
ljust() casefold() endswith()
format_map() isascii() islower()
'sidney crosby'.
capitalize() encode()
isalpha() isidentifier()
ljust() casefold()
format_map() isascii()
Note: tab completion on a string directly like this doesn't always work in
Spyder. If it's not working for you, assign 'sidney crosby'
to a variable and tab
complete on that. Like this:
In [16]: foo = 'sidney crosby'
Out[16]: foo.
capitalize() encode() format()
isalpha() isidentifier() isspace()
ljust() casefold() endswith()
format_map() isascii() islower()
In [16]: foo = 'sidney crosby'
Out[16]: foo.
capitalize() encode()
isalpha() isidentifier()
ljust() casefold()
format_map() isascii()
Then, when you find something you're interested in, enter it in the REPL with a
question mark after it, like 'sidney crosby'.capitalize?
(or foo.capitalize?
if
you're doing it that way).
You'll see:
Signature: str.capitalize(self, /)
Docstring:
Return a capitalized version of the string.
More specifically, make the first character have upper case and
the rest lower case.
Signature: str.capitalize(self, /)
Docstring:
Return a capitalized version of
the string.
More specifically, make the
first character have upper case
and the rest lower case.
So, in this case, it sounds like capitalize
will make the first letter
uppercase and the rest of the string lowercase. Let's try it:
In [17]: 'sidney crosby'.capitalize()
Out[17]: 'Sidney crosby'
In [17]: ('sidney crosby'
.capitalize())
Out[17]: 'Sidney crosby'
Great. Many of the items you'll be working with in the REPL have methods, and tab completion is a great way to explore what's available.
The second strategy is more general. Maybe you want to do something that you know is string related but aren't necessarily sure where to begin or what it'd be called.
For example, maybe you've scraped some data that looks like:
In [18]: ' sidney crosby'
But you want it to be like this, i.e. without the spaces before "sidney":
In [19]: 'sidney crosby'
Here's what you should do — and I'm not trying to be glib here — Google: "python string get rid of leading white space".
When you do that, you'll see the first result is from stackoverflow and says:
"The lstrip() method will remove leading whitespaces, newline and tab characters on a string beginning."
A quick test confirms that's what we want.
In [20]: (' sidney crosby'
.lstrip())
Out[20]: 'sidney crosby'
Stackoverflow
Python — particularly the data libraries we'll be using — became popular during the golden age of stackoverflow.com, a programming question and answer site that specializes in answers to small, self-contained technical problems.
How it works: people ask questions related to programming, and other, more experienced programmers answer. The rest of the community votes, both on questions ("that's a very good question, I was wondering how to do that too") as well as answers ("this solved my problem perfectly"). In that way, common problems and the best solutions rise to the top over time. Add in Google's search algorithm, and you usually have a way to figure out exactly how to do most anything you'll want to do in a few minutes.
You don't have to ask questions yourself or vote or even make a stackoverflow account to get the benefits. Most people probably don't. But enough people do, especially when it comes to Python, that it's a great resource.
If you're used to working like this, this advice may seem obvious. Like I said, I don't mean to be glib. Instead, it's intended for anyone who might mistakenly believe "real" coders don't Google things.
As programmer-blogger Umer Mansoor writes,
Software developers, especially those who are new to the field, often ask this question... Do experienced programmers use Google frequently?
The resounding answer is YES, experienced (and good) programmers use Google... a lot. In fact, one might argue they use it more than the beginners. [that] doesn't make them bad programmers or imply that they cannot code without Google. In fact, truth is quite the opposite: Google is an essential part of their software development toolkit and they know when and how to use it.
A big reason to use Google is that it is hard to remember all those minor details and nuances especially when you are programming in multiple languages... As Einstein said: 'Never memorize something that you can look up.'
Now you know how to figure things out in Python. Back to the basics.
Bools
There are other data types besides strings and numbers. One of the most important ones is bool (for boolean).
Boolean's — which exist in every language — are for binary, yes or no, true or
false data. While a string can have almost an unlimited number of different
values, and an integer can be any whole number, bools in Python only have two
possible values: True
or False
.
Similar to variable names, bool values lack quotes. So "True"
is a string,
not a bool.
A Python expression (any number, text or bool) is a bool when it's yes or no type data. For example:
# some numbers to use in our examples
In [21]: team1_goals = 2
In [22]: team2_goals = 1
# these are all bools
In [23]: team1_won = team1_goals > team2_goals
In [24]: team2_won = team2_goals > team1_goals
In [25]: teams_tied = team1_goals == team2_goals
In [26]: teams_did_not_tie = team1_goals != team2_goals
In [27]: type(team1_won)
Out[27]: bool
In [28]: teams_did_not_tie
Out[28]: True
# some numbers to use in ex
In [21]: team1_goals = 2
In [22]: team2_goals = 1
# these are all bools
In [23]: team1_won = (
team1_goals > team2_goals)
In [24]: team2_won = (
team2_goals > team1_goals)
In [25]: teams_tied = (
team1_goals == team2_goals)
In [26]: teams_did_not_tie = (
team1_goals != team2_goals)
In [27]: type(team1_won)
Out[27]: bool
In [28]: teams_did_not_tie
Out[28]: True
Notice the ==
by teams_tied
. That tests for equality. It's the double
equals sign because — as we learned above — Python uses the single =
to
assign to a variable. This would give an error:
In [29]: teams_tied = (team1_goals = team2_goals)
...
SyntaxError: invalid syntax
In [29]: teams_tied = (
team1_goals = team2_goals)
...
SyntaxError: invalid syntax
So team1_goals == team2_goals
will be True
if those numbers are the same,
False
if not.
The reverse is !=
, which means not equal. The expression team1_goals !=
team2_goals
is True
if the values are different, False
if they're the same.
You can manipulate bools — i.e. chain them together or negate them — using
the keywords and
, or
, not
and parenthesis.
In [30]: shootout = (team1_goals > 4) and (team2_goals > 4)
In [31]: at_least_one_good_team = ((team1_goals > 4) or
(team2_goals > 3))
In [32]: you_guys_are_bad = not ((team1_goals > 1) or
(team2_goals > 1)))
In [33]: meh = not (shootout or
at_least_one_good_team or
you_guys_are_bad)
In [30]: shootout = (
(team1_goals > 4) and
(team2_goals > 4))
In [31]:
at_least_one_good_team = (
(team1_goals > 4) or
(team2_goals > 3))
In [32]: you_guys_are_bad = not (
(team1_goals > 1) or
(team2_goals > 1)))
In [33]: meh = not (
shootout or
at_least_one_good_team or
you_guys_are_bad)
if statements
Bools are used frequently; one place is with if statements. The following code
assigns a string to a variable message
depending on what happened.
In [34]:
if team1_won:
message = "Nice job team 1!"
elif team2_won:
message = "Way to go team 2!!"
else:
message = "must have tied!"
--
In [35]: message
Out[35]: 'Nice job team 1!'
Notice how in the code I'm saying if team1_won
, not if team1_won == True
.
While the latter would technically work, it's a good way to show anyone looking
at your code that you don't really understand bools. team1_won
is True
,
it's a bool. team1_won == True
is also True
, and it's still a bool.
Similarly, don't write team1_won == False
, write not team1_won
.
A boolean -- like a string or number -- is a basic, building block type. In the part 3 we'll expand on this with some container types.
Thanks for reading
If you want to be notified when new guides like this come out you can enter your email here: