Written by students who passed Immediately available after payment Read online or as PDF Wrong document? Swap it for free 4.6 TrustPilot
logo-home
Presentation

Web Scraping with Python, 3rd Edition

Rating
-
Sold
-
Pages
294
Uploaded on
09-08-2024
Written in
2017/2018

"If programming is magic, then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. This thoroughly updated third edition not only introduces you to web scraping but also serves as a comprehensive guide to scraping almost every type of data from the modern web. Part I focuses on web scraping mechanics: using Python to request information from a web server, performing basic handling of the server's response, and interacting with sites in an automated fashion. Part II explores a variety of more specific tools and applications to fit any web scraping scenario you're likely to encounter. Parse complicated HTML pages Develop crawlers with the Scrapy framework Learn methods to store the data you scrape Read and extract data from documents Clean and normalize badly formatted data Read and write natural languages Crawl through forms and logins Scrape JavaScript and crawl through APIs Use and write image-to-text software Avoid scraping traps and bot blockers Use scrapers to test your website"

Show more Read less
Institution
Course

Content preview

,Part I. Building Scrapers
This first part of this book focuses on the basic mechanics of web scraping: how to use Python to
request information from a web server, how to perform basic handling of the server’s response,
and how to begin interacting with a website in an automated fashion. By the end, you’ll be cruising
around the internet with ease, building scrapers that can hop from one domain to another, gather
information, and store that information for later use.

To be honest, web scraping is a fantastic field to get into if you want a huge payout for relatively
little up-front investment. In all likelihood, 90% of web scraping projects you’ll encounter will
draw on techniques used in just the next 6 chapters. This section covers what the general (albeit
technically savvy) public tends to think of when they think of “web scrapers”:

 Retrieving HTML data from a domain name
 Parsing that data for target information
 Storing the target information
 Optionally, moving to another page to repeat the process

This will give you a solid foundation before moving on to more complex projects in Part II. Don’t
be fooled into thinking that this first section isn’t as important as some of the more advanced
projects in the second half. You will use nearly all the information in the first half of this book on
a daily basis while writing web scrapers!


Chapter 1. How the Internet Works
I have met very few people in my life who truly know how the internet works, and I am certainly
not one of them.

The vast majority of us are making do with a set of mental abstractions that allow us to use the
internet just as much as we need to. Even for programmers, these abstractions might extend only
as far as what was required for them to solve a particularly tricky problem once in their career.

Due to limitations in page count and the knowledge of the author, this chapter must also rely on
these sorts of abstractions. It describes the mechanics of the internet and web applications, to the
extent needed to scrape the web (and then, perhaps a little more).

This chapter, in a sense, describes the world in which web scrapers operate: the customs, practices,
protocols, and standards that will be revisited throughout the book.

When you type a URL into the address bar of your web browser and hit Enter, interactive text,
images, and media spring up as if by magic. This same magic is happening for billions of other
people every day. They’re visiting the same websites, using the same applications—often getting
media and text customized just for them.

,And these billions of people are all using different types of devices and software applications,
written by different developers at different (often competing!) companies.

Amazingly, there is no all-powerful governing body regulating the internet and coordinating its
development with any sort of legal force. Instead, different parts of the internet are governed by
several different organizations that evolved over time on a somewhat ad hoc and opt-in basis.

Of course, choosing not to opt into the standards that these organizations publish may result in
your contributions to the internet simply...not working. If your website can’t be displayed in
popular web browsers, people likely aren’t going to visit it. If the data your router is sending can’t
be interpreted by any other router, that data will be ignored.

Web scraping is, essentially, the practice of substituting a web browser for an application of your
own design. Because of this, it’s important to understand the standards and frameworks that web
browsers are built on. As a web scraper, you must both mimic and, at times, subvert the expected
internet customs and practices.


Networking
In the early days of the telephone system, each telephone was connected by a physical wire to a
central switchboard. If you wanted to make a call to a nearby friend, you picked up the phone,
asked the switchboard operator to connect you, and the switchboard operator physically created
(via plugs and jacks) a dedicated connection between your phone and your friend’s phone.

Long-distance calls were expensive and could take minutes to connect. Placing a long-distance
call from Boston to Seattle would result in the coordination of switchboard operators across the
United States creating a single enormous length of wire directly connecting your phone to the
recipient’s.

Today, rather than make a telephone call over a temporary dedicated connection, we can make a
video call from our house to anywhere in the world across a persistent web of wires. The wire
doesn’t tell the data where to go, the data guides itself, in a process called packet
switching. Although many technologies over the years contributed to what we think of as “the
internet,” packet switching is really the technology that single-handedly started it all.

In a packet-switched network, the message to be sent is divided into discrete ordered packets, each
with its own sender and destination address. These packets are routed dynamically to any
destination on the network, based on that address. Rather than being forced to blindly traverse the
single dedicated connection from receiver to sender, the packets can take any path the network
chooses. In fact, packets in the same message transmission might take different routes across the
network and be reordered by the receiving computer when they arrive.

If the old phone networks were like a zip line—taking passengers from a single destination at the
top of a hill to a single destination at the bottom—then packet-switched networks are like a

, highway system, where cars going to and from multiple destinations are all able to use the same
roads.

A modern packet-switching network is usually described using the Open Systems Interconnection
(OSI) model, which is composed of seven layers of routing, encoding, and error handling:

1. Physical layer
2. Data link layer
3. Network layer
4. Transport layer
5. Session layer
6. Presentation layer
7. Application layer

Most web application developers spend their days entirely in layer 7, the application layer. This is
also the layer where the most time is spent in this book. However, it is important to have at least
conceptual knowledge of the other layers when scraping the web. For example, TLS fingerprinting,
discussed in Chapter 17, is a web scraping detection method that involves the transport layer.

In addition, knowing about all of the layers of data encapsulation and transmission can help
troubleshoot errors in your web applications and web scrapers.

Physical Layer

The physical layer specifies how information is physically transmitted with electricity over the
Ethernet wire in your house (or on any local network). It defines things like the voltage levels that
encode 1’s and 0’s, and how fast those voltages can be pulsed. It also defines how radio waves
over Bluetooth and WiFi are interpreted.

This layer does not involve any programming or digital instructions but is based purely on physics
and electrical standards.

Data Link Layer

The data link layer specifies how information is transmitted between two nodes in a local network,
for example, between your computer and a router. It defines the beginning and ending of a single
transmission and provides for error correction if the transmission is lost or garbled.

At this layer, the packets are wrapped in an additional “digital envelope” containing routing
information and are referred to as frames. When the information in the frame is no longer needed,
it is unwrapped and sent across the network as a packet.

It’s important to note that, at the data link layer, all devices on a network are receiving the same
data at all times—there’s no actual “switching” or control over where the data is going. However,
devices that the data is not addressed to will generally ignore the data and wait until they get
something that’s meant for them.

Written for

Course

Document information

Uploaded on
August 9, 2024
Number of pages
294
Written in
2017/2018
Type
PRESENTATION
Person
Unknown

Subjects

$5.49
Get access to the full document:

Wrong document? Swap it for free Within 14 days of purchase and before downloading, you can choose a different document. You can simply spend the amount again.
Written by students who passed
Immediately available after payment
Read online or as PDF

Get to know the seller
Seller avatar
RobertCuong

Get to know the seller

Seller avatar
RobertCuong Telecommunication
Follow You need to be logged in order to follow users or courses
Sold
-
Member since
3 year
Number of followers
0
Documents
225
Last sold
-
GPON and WiFi

+ SDH solution based on Fujitsu/Alcatel/Huawei devices in deployment and troubleshoot + Switching and Routing network fundamental and advance + GPON solution with deep knowledge of PLOAM/OMCI, activation procedure. Analysis of Private/Public OMCI + WiFi solution with WiFi Management/Control/Data. WiFi bandsteering, WiFi mesh, and WiFi 6, 6E, 7, ...

0.0

0 reviews

5
0
4
0
3
0
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Working on your references?

Create accurate citations in APA, MLA and Harvard with our free citation generator.

Working on your references?

Frequently asked questions