MODULE V
5.1NETWORKED PROGRAMS
In this era of internet, it is a requirement in many situations to retrieve the data from web and to
process it. In this section, we will discuss basics of network protocols and Python libraries available
to extract data from web.
HyperText Transfer Protocol (HTTP)
HTTP (HyperText Transfer Protocol) is the media through which we can retrieve web- based data.
The HTTP is an application protocol for distributed and hypermedia information systems.
HTTP is the foundation of data communication for the World Wide Web.
Hypertext is structured text that uses logical links (hyperlinks) between nodes containing text.
HTTP is the protocol to exchange or transfer hypertext.
Consider a situation:
you try to read a socket, but the program on the other end of the socket has not sent any data,
then you need to wait.
If the programs on both ends of the socket simply wait for some data without sending
anything, they will wait for a very long time.
So an important part of programs that communicate over the Internet is to have some sort of
protocol. A protocol is a set of precise rules that determine
Who will send request for what purpose
What action to be taken
What response to be given
To send request and to receive response, HTTP uses GET and POST methods.
NOTE: To test all the programs in this section, you must be connected to internet.
The World’s Simplest Web Browser
The built-in module socket of Python facilitates the programmer to make network connections and to
retrieve data over those sockets in a Python program.
Socket is bidirectional data path to a remote system.
A socket is much like a file, except that a single socket provides a two-way connection between
two programs.
You can both read from and write to the same socket.
If you write something to a socket, it is sent to the application at the other end of the socket.
If you read from the socket, you are given the data which the other application has sent.
Consider a simple program to retrieve the data from a web page. To understand the program given
below, one should know the meaning of terminologies used there.
Mamatha A, Asst Prof, Dept of CSE, SVIT Page 1
,Python Application Programming (15CS664) Module V
AF_INET is an address family (IP) that is used to designate the type of addresses that your
socket can communicate with.When you create a socket, you have to specify its address
family, and then you can use only addresses of that type with the socket.
SOCK_STREAM is a constant indicating the type of socket (TCP). It works as a file stream
and is most reliable over the network.
Port is a logical end-point. Port 80 is one of the most commonly used port numbers in the
Transmission Control Protocol (TCP) suite.
The command to retrieve the data must use CRLF(Carriage Return Line Feed) line endings, and
it must end in \r\n\r\n (line break in protocol specification).
encode() method applied on strings will return bytes-representation of the string. Instead of
encode() method, one can attach a character b at the beginning of the string for the same effect.
decode() method returns a string decoded from the given bytes.
Figure : A Socket Connection
A socket connection between the user program and the webpage is shown in Figure below
Now, observe the following program –
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd='GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)
while True:
data = mysock.recv(512)
if (len(data) < 1):
break
print(data.decode(),end='')
mysock.close()
Mamatha A, Asst Prof, Dept of CSE, SVIT Page 2
, Python Application Programming (15CS664) Module V
When we run above program, we will get some information related to web-server of the website
which we are trying to scrape.
Then, we will get the data written in that web-page. In this program, we are extracting 512 bytes
of data at a time. (One can use one‟s convenient number here). The extracted data is decoded and
printed. When the length of data becomes less than one (that is, no more data left out on the web
page), the loop is terminated.
Retrieving an Image over HTTP
In the previous section, we retrieved the text data from the webpage. Similar logic can be used to
extract images on the webpage using HTTP.
In the following program, we extract the image data in the chunks of 5120 bytes at a time, store
that data in a string, trim off the headers and then store the image file on the disk.
import socket
import time
HOST = 'data.pr4e.org' #host name
PORT = 80 #port number
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((HOST, PORT))
mysock.sendall(b'GET http://data.pr4e.org/cover3.jpg HTTP/1.0\r\n\r\n')
count = 0
picture = b"" #empty string in binary format
while True:
data = mysock.recv(5120) #retrieve 5120 bytes at a time
if (len(data) < 1):
break
time.sleep(0.25) #programmer can see data retrieval easily
count = count + len(data)
print(len(data), count) #display cumulative data retrieved
picture = picture + data
mysock.close()
pos = picture.find(b"\r\n\r\n") #find end of the header (2 CRLF)
Mamatha A, Asst Prof, Dept of CSE, SVIT Page 3