This first part of this book focuses on the basic mechanics of web scraping: how to use Python to
request information from a web server, how to perform basic handling of the server’s response,
and how to begin interacting with a website in an automated fashion. By the end, you’ll be cruising
around the internet with ease, building scrapers that can hop from one domain to another, gather
information, and store that information for later use.
To be honest, web scraping is a fantastic field to get into if you want a huge payout for relatively
little up-front investment. In all likelihood, 90% of web scraping projects you’ll encounter will
draw on techniques used in just the next 6 chapters. This section covers what the general (albeit
technically savvy) public tends to think of when they think of “web scrapers”:
Retrieving HTML data from a domain name
Parsing that data for target information
Storing the target information
Optionally, moving to another page to repeat the process
This will give you a solid foundation before moving on to more complex projects in Part II. Don’t
be fooled into thinking that this first section isn’t as important as some of the more advanced
projects in the second half. You will use nearly all the information in the first half of this book on
a daily basis while writing web scrapers!
Chapter 1. How the Internet Works
I have met very few people in my life who truly know how the internet works, and I am certainly
not one of them.
The vast majority of us are making do with a set of mental abstractions that allow us to use the
internet just as much as we need to. Even for programmers, these abstractions might extend only
as far as what was required for them to solve a particularly tricky problem once in their career.
Due to limitations in page count and the knowledge of the author, this chapter must also rely on
these sorts of abstractions. It describes the mechanics of the internet and web applications, to the
extent needed to scrape the web (and then, perhaps a little more).
This chapter, in a sense, describes the world in which web scrapers operate: the customs, practices,
protocols, and standards that will be revisited throughout the book.
When you type a URL into the address bar of your web browser and hit Enter, interactive text,
images, and media spring up as if by magic. This same magic is happening for billions of other
people every day. They’re visiting the same websites, using the same applications—often getting
media and text customized just for them.
,And these billions of people are all using different types of devices and software applications,
written by different developers at different (often competing!) companies.
Amazingly, there is no all-powerful governing body regulating the internet and coordinating its
development with any sort of legal force. Instead, different parts of the internet are governed by
several different organizations that evolved over time on a somewhat ad hoc and opt-in basis.
Of course, choosing not to opt into the standards that these organizations publish may result in
your contributions to the internet simply...not working. If your website can’t be displayed in
popular web browsers, people likely aren’t going to visit it. If the data your router is sending can’t
be interpreted by any other router, that data will be ignored.
Web scraping is, essentially, the practice of substituting a web browser for an application of your
own design. Because of this, it’s important to understand the standards and frameworks that web
browsers are built on. As a web scraper, you must both mimic and, at times, subvert the expected
internet customs and practices.
Networking
In the early days of the telephone system, each telephone was connected by a physical wire to a
central switchboard. If you wanted to make a call to a nearby friend, you picked up the phone,
asked the switchboard operator to connect you, and the switchboard operator physically created
(via plugs and jacks) a dedicated connection between your phone and your friend’s phone.
Long-distance calls were expensive and could take minutes to connect. Placing a long-distance
call from Boston to Seattle would result in the coordination of switchboard operators across the
United States creating a single enormous length of wire directly connecting your phone to the
recipient’s.
Today, rather than make a telephone call over a temporary dedicated connection, we can make a
video call from our house to anywhere in the world across a persistent web of wires. The wire
doesn’t tell the data where to go, the data guides itself, in a process called packet
switching. Although many technologies over the years contributed to what we think of as “the
internet,” packet switching is really the technology that single-handedly started it all.
In a packet-switched network, the message to be sent is divided into discrete ordered packets, each
with its own sender and destination address. These packets are routed dynamically to any
destination on the network, based on that address. Rather than being forced to blindly traverse the
single dedicated connection from receiver to sender, the packets can take any path the network
chooses. In fact, packets in the same message transmission might take different routes across the
network and be reordered by the receiving computer when they arrive.
If the old phone networks were like a zip line—taking passengers from a single destination at the
top of a hill to a single destination at the bottom—then packet-switched networks are like a
, highway system, where cars going to and from multiple destinations are all able to use the same
roads.
A modern packet-switching network is usually described using the Open Systems Interconnection
(OSI) model, which is composed of seven layers of routing, encoding, and error handling:
1. Physical layer
2. Data link layer
3. Network layer
4. Transport layer
5. Session layer
6. Presentation layer
7. Application layer
Most web application developers spend their days entirely in layer 7, the application layer. This is
also the layer where the most time is spent in this book. However, it is important to have at least
conceptual knowledge of the other layers when scraping the web. For example, TLS fingerprinting,
discussed in Chapter 17, is a web scraping detection method that involves the transport layer.
In addition, knowing about all of the layers of data encapsulation and transmission can help
troubleshoot errors in your web applications and web scrapers.
Physical Layer
The physical layer specifies how information is physically transmitted with electricity over the
Ethernet wire in your house (or on any local network). It defines things like the voltage levels that
encode 1’s and 0’s, and how fast those voltages can be pulsed. It also defines how radio waves
over Bluetooth and WiFi are interpreted.
This layer does not involve any programming or digital instructions but is based purely on physics
and electrical standards.
Data Link Layer
The data link layer specifies how information is transmitted between two nodes in a local network,
for example, between your computer and a router. It defines the beginning and ending of a single
transmission and provides for error correction if the transmission is lost or garbled.
At this layer, the packets are wrapped in an additional “digital envelope” containing routing
information and are referred to as frames. When the information in the frame is no longer needed,
it is unwrapped and sent across the network as a packet.
It’s important to note that, at the data link layer, all devices on a network are receiving the same
data at all times—there’s no actual “switching” or control over where the data is going. However,
devices that the data is not addressed to will generally ignore the data and wait until they get
something that’s meant for them.