Project Description

Web scraping with Calvin & Hobbes – Part I

Published: March 18, 2019

Computer programming is all about writing instructions that computers can follow to achieve some task. The individual instructions are surprisingly simple, but many simple instructions can be chained together to build complex systems. That’s what software engineers do (more or less). That’s what I do at Blue Sky. I write code to help artists make animated movies.

I don’t think I’ve ever really talked about programming on this blog before–nothing technical at least–so a post like this is probably long overdue. Honestly, I consider myself to be a pretty mediocre programmer. Not bad, but not great either. I think I have a lot of skills that make me a valuable employee, but my technical skills are kind of meh compared to the true pros. The code I write works, but my solutions aren’t always elegant. You have been warned.

One thing computers are really great at is relieving us of the boring, repetitive tasks that often plague our lives. For example, if a website is hosting a bunch of images that you want to download, instead of manually visiting hundreds of different web pages to download the images (right click, “save image as”, enter file name, save, repeat), you could write some code and have a computer complete the task for you instead. A well-defined sequence of actions that needs to happen over and over again is a computer’s bread and butter. That’s what I hope to demonstrate in this series of posts.

A hackathon project

Back in 2017 I participated in a hackathon at work. What’s a hackathon? “A hackathon is a social coding event in which engineers come together to create and build out-of-the-box ideas in a short amount of time.” In this case we had 48 hours to build. There wasn’t a theme, so the possibility space was wide open. I’m a big Calvin & Hobbes fan, so I chose to make an app that could generate new Calvin & Hobbes comic strips from existing comic panels.

My first step was to get my hands on the data needed to make an app like this function. In this case, I needed digital images of Calvin & Hobbes comics. I own all the books, like real paper books, so worst case I could scan the images myself, but that’s boring, tedious, and would take forever. Instead, I found a website that was hosting all the comics. Someone had already suffered through digitizing this stuff, so why duplicate that work? The images existed, I just needed to download them.

I could have manually downloaded each strip one-at-a-time, but that’s boring, tedious, and would take forever (again). The solution? Write a simple script to visit the web pages and download the relevant images automatically. It works faster than I ever could, and it frees me up to work on other tasks. Programming to the rescue!

I’m going to spend the rest of this post walking through parts of the script in detail. Here’s a link to the repository where you can view/copy/download the code. And here’s a direct link to the file that we’re actually talking about in this post (you should open this).

Python web scraping 101

I’m gonna walk through things in kind of a haphazard way and just talk about whatever pops into my head. My goal is to make this accessible to complete beginners. If this example sparks your interest and you have questions, feel free to reach out! I’m always happy to chat!

Comments

At 128 lines, the code might look kinda scary, but a lot of these lines are comments. Comments are messages programmers write to themselves or to other people who might read the code to explain what the code is doing. Comments are for humans. They’re not instructions that the computer needs to execute.

"""
This is a comment
"""

# This is also a comment

Minus all the comments and unnecessary whitespace, the core logic is only about 50 lines long.

Syntax

This script is written in Python. Python is just one of many different programming languages that exist. It’s widely regarded as one of the easiest languages to read. Certain languages are better at certain things, but general-purpose programming languages tend to have a lot in common.

Every programming language has a unique syntax: rules that govern how the code has to be written/formatted/structured in order to run. In Python, single line comments start with the “#” symbol, for instance. That’s Python’s syntax.

If we were using Swift instead (the recommended language for writing iOS applications), comments would look like this:

/*
This is a comment
*/

// This is also a comment

The syntax is different, but the meaning is the same. A comment is a comment no matter the language. Syntax things are easy to lookup if you forget, so it’s not a big deal if you don’t have it all memorized. The harder part is the logic part–not how to write something but what to write in the first place.

We’re not gonna get bogged down in syntax here. All you need to know is that we’re looking at Python code in this post, and if you were to rewrite this code in another programming language, then it would look a little bit different even if the logic was the same.

Fun fact: I use Python all the time at Blue Sky because it’s one of the most popular programming languages for writing pipeline code at animation studios. If you’re interested in joining the animation industry as a technical director or software engineer, then you can’t go wrong learning Python.

Import statements

At the top of the file are things called import statements:

import os
import time
import shutil
import requests
from bs4 import BeautifulSoup
from datetime import date, timedelta

These lines bring in pieces of code written by other people for us to use. For example, there’s a module called datetime that gives us tons of functionality when we need to work with dates or times. Things like converting time zones, figuring out what day of the week it is today, etc. Developers often write useful things, package them up, and then share them so that other developers don’t have to reinvent the wheel all the time.

Most of the import statements used here are for common things like saving files and manipulating dates. The requests and BeautifulSoup libraries are special for our web scraping use case, but we’ll talk more about those in the next post.

Running a Python script

Let’s pretend that you copied the script onto your computer, and now you want to run it. How do you do that? First, you need to make sure that you have Python installed on your computer. If you’re working on a Mac, then it’s probably already installed for you. You can double check by opening up the Terminal application and typing: which Python. The which command prints out the location of known executable files, so if you see something print out, then you’re good to go! Mine looks like /usr/bin/python. Here’s an installation guide if things aren’t setup for you yet. I recommend installing Python 3 since Python 2 is old and on the way out.

Now that you have Python installed, you can execute Python scripts! Using the Terminal application again, locate the script (wherever you saved it). Just like how you can navigate your computer’s file system using the Finder application, you can also navigate the file system using the Terminal application and Unix commands. There’s a lot of power in knowing how to navigate and control your computer from the command line. Here’s a useful guide to some of the most basic Unix commands. We’re most interested in cd (change directory), cd .. (go back one directory), and ls (show me what files are in the current directory).

Find the script: cd path/to/directory/where/script/lives/

Execute the script: python name_of_script.py

That’s pretty much all there is to it. Installing Python was the hard part.

Where the code starts

This part is going to be super nontechnical. Partly because I don’t want to get sidetracked for 8000 words, but mostly because this part is complicated and I don’t understand it very well myself. As developers, we work at a really high level. The code we write is more or less human readable once you know what to look for. When you execute your code though, it needs to get translated into low-level instructions that computer hardware can understand. It’s not quite 1s and 0s at this point, but it’s close. Programs known as compilers do this part for us. It’s complicated stuff and an entire field of study all by itself.

The good news is we don’t need to have a comprehensive understanding of this process to be effective programmers. We can let smarter people deal with the complexities of the compilation process.

Here’s kinda what’s happening at a 30,000 foot level though. When you execute the script, the file gets interpreted starting with the first line down. Whitespace is important in Python, and only the lines of code that don’t have any indentation at all (they exist all the way to the left) get executed during the first pass. This is a gross oversimplification but whatever.

At the top of the file we have our import statements, so all the code we’re referencing there gets brought in for us to use:

import os
import time
import shutil
import requests
from bs4 import BeautifulSoup
from datetime import date, timedelta

Next we have this line:

class CalvinAndHobbesWebCrawler(object):

Here we define the main object that we created to organize the rest of our code underneath. We give this object a name so that we can reference it later in the code: CalvinAndHobbesWebCrawler.

And that’s it until we get to the very bottom of the file:

if __name__ == '__main__':
    import sys
    import traceback
    try:
        crawler = CalvinAndHobbesWebCrawler()
        crawler.go()
    except SystemExit:
        raise
    except:
        if '--verbose' in sys.argv:
            traceback.print_exc(sys.stderr)
        else:
            sys.stderr.write("Error: %s\n" % sys.exc_info()[1])
        exit(1)

This is probably the scariest looking part of the code, but we can safely ignore almost all of it. Three of these lines are pretty interesting though:

if __name__ == '__main__':

This is one of those not indented, all-the-way-to-the-left lines, so we execute it. What exactly are we executing here though? Well, every Python file has a special variable (more on variables in the next post) called __name__ that gets set when the file gets loaded/interpreted. Typically this __name__ variable has a value that matches the name of the current Python file. There is an exception to this rule though. If the current Python file is the source file (the file that was initially executed), then the __name__ variable gets set to a special value: '__main__'.

That’s the condition that line #115 is checking for: Is the current file the file that was executed? If yes, then execute the code contained in this block (the indented code that exists beneath this line). If no, then don’t execute this block of code. That if statement is known as a conditional. It’s one of a handful of control flow statements that let us control when certain blocks of code should be executed and when they should be skipped.

The file we’re currently looking at is the source file because it’s the file that we executed (python name_of_script.py, remember?), so the __name__ variable does equal '__main__' in this case, which means that we “enter into” this block of code and continue executing the intended lines of code.

That was a really wordy explanation, but hopefully it makes sense.

Object-oriented programming

So now we’re in the block of code that starts on line #116. There are a few more import statements here. We already know about those. Next, there’s some scary looking code that uses words like try and except. This code is doing some error handling for us so that if something breaks while the program is running, then a useful error message will print to the Terminal that describes what went wrong. I’m not going to dive into error handling in this post, but you can read more about it here if you’re feeling adventurous.

Lines #119 and #120 are the fun ones. They look simple, but they’ve got a lot of important programming concepts packed into them:

crawler = CalvinAndHobbesWebCrawler()
crawler.go()

This might be the briefest and lamest explanation of object-oriented programming ever, so buckle up.

Python is an object-oriented programming language, which means that it lets us organize our code into objects. These objects can be anything really–cars, people, recipes, telephone records–whatever makes sense to help us solve the problem at hand. Of course these objects aren’t real physical things, they’re just representations of real (or abstract) things that we’ve defined in our code.

To create a new kind of object that we can use in our code, we have to define a class for that object. Classes are like templates or blueprints that describe objects. Hmm… class… that sounds familiar. Oh yeah! It’s because we’ve actually already defined one of those in this very file! Take another look at line #9:

class CalvinAndHobbesWebCrawler(object):

This is a class I made to help us solve our comic strip web crawling problem. I mentioned that classes are like blueprints that describe objects, but what exactly do these blueprints contain? Two things mostly: data and actions.

Data is the information that an object cares about. Actions are all the things that the object is capable of doing. Most of these actions involve manipulating the object’s data in some way. In programming, these actions are commonly referred to as methods or functions.

Here’s an example: Pretend that we’re making a car racing video game. One of the objects we would need to define in our code (probably the most important object) is a car. What is a car? Well, a car has wheels and an engine and gasoline and a top speed and the number of miles it’s driven so far. All of these things could be attributes for our car object–the car related data we want to define and keep track of in our game.

What about actions? Well, a car can accelerate and brake and overheat and go in reverse. If these actions are important to our game, then we’ll want to capture them as methods within our car class definition.

Make sense? Probably not. That’s okay. Object-oriented programming is a big, big concept. It’s gonna take a few examples before it starts to sink in.

But that’s object-oriented programming in a nutshell. We create classes to define objects in our code that will help us solve some problem. Classes are like blueprints that describe the data an object contains and the actions an object can perform. Once you get used to it, it becomes a very intuitive way to organize code.

Line #9 is where the blueprint for our CalvinAndHobbesWebCrawler class starts. I’ll break down the data and actions this class contains in the next post, but if we go back to line #119 really quick, we can see some of these object-oriented principals in action:

crawler = CalvinAndHobbesWebCrawler()

This is where we left off when we were tracing what happens when the script gets executed, remember? We’ve defined our class, and now we’re ready to actually do something with it.

I mentioned that classes are like blueprints. A blueprint for a car isn’t actually a car though. You can’t sit in it or drive it. But the blueprint does know how to make a car that you can sit in and drive. In this respect, I think it’s helpful to think of classes as not just blueprints but as factories as well.

A factory is setup to create a very specific thing. Different things have different requirements, so each thing requires its own unique factory to make it. Defining a class is like setting up a factory, and once we have the factory setup, we can ask it to build us an instance of the thing that we defined in the blueprint.

I feel like I’m saying “blueprint” and “factory” way too much.

Line #119 is asking our CalvinAndHobbesWebCrawler class/blueprint/factory to create an instance of the class for us that we’re calling crawler. This crawler object contains the data that we defined in the class and it can perform the actions that we also defined in the class.

crawler.go()

Line #120 calls (or executes) one of these CalvinAndHobbesWebCrawler actions. In this case we’re calling the go method because we want to execute the code that’s contained inside of that method. The go method is where the web scraping process really begins.

Part 2 coming soon!

I’m exhausted, and we haven’t even gotten into the actual logic behind the web scraping yet. No wonder people think programming is such an intimidating hobby.

If you do feel intimidated, just remember that all we’re really doing is chaining tiny actions together into something that’s larger than the sum of its parts. One baby step at a time.

The process can be daunting though, so it’s okay to feel daunted, but don’t get discouraged. Some people finish computer science degrees and still don’t fully understand all of this beginner-level stuff. Present company included. It takes time. And trial and error. And lots and lots of questions.

We’ll talk more about variables and methods and actually walk through the web scraping code in the next post. See you soon!

***

Image credit: Dimitar Belchev