2.2 The Web and HTTP¶

This note explains how the Web and HTTP works

The Web was the first Internet application that caught the general public's eye. It elevated the Internet from just one of many data networks to essentially the one and only data network.

Overview of HTTP¶

HyperText Transfer Protocol (HTTP), the Web's application-layer protocol, lies at the heart of the Web. It is implemented in two programs: a client program and a server program, executing on different end systems, which talk to each other by exchanging HTTP messages.

HTTP defines the structure of these messages and how the client and server exchange the messages.

First, some Web terminology:

Web page: consists of objects
object: a file, such as HTML file, JPEG image, Javascript fill, CCS style sheet, video clip, that is addressable by a single URL
Base HTML File: The base of the Web pages that references the other objects in the page with the objects' URLs.
URL: made of two components -- the hostname of the server that houses the object and the object's path name.

http://www.someSchool.edu/someDepartment/picture.gif has www.someSchool.edu as the hostname, and /someDepartment/picture.gif for a path name.

Web browsers: (Chrome, Internet Explorer) implement the client side of HTTP, in the context of the Web, we will use the words browser and client interchangeably
Web servers: implement the server side of the HTTP, house Web objects, each addressable by a URL

The general request-response behavior looks like this.

2.2.1 httprequest.png

HTTP request messages for the objects in the page to the server. The server recieves the requests and responds with HTTP response messages that contain the objects.

HTTP uses TCP as its underlying transport protocol. The HTTP client first initiates a TCP connection with the server, then when the conection is established, the browser and server process access TCP through their socket interfaces.

It is important to note that the server sends requested files to clients without storing any state information about the client. It completely forgets what it did. HTTP is said to be a stateless protocol since it maintains no information about the clients.

Non-Persistent and Persistent Connections¶

Non-persistent: when the client-server interaction is taking place over TCP each request/response pair is sent over a separate TCP connection

Persistent: when the client-server interaction is taking place over TCP each request/response pair are all sent over the same TCP connection

HTTP with Non-Persistent Connections¶

Example: Let's have a page with base HTML file and 10 JPEG images, and that all 11 of these objects reside on the same surver. The URL for the base HTML file is http://www.someSchool.edu/someDepartment/home.index.

This is what happens:

HTTP client process initiates a TCP connection to the server www.someSchool.edu on port number 80 (default port number for HTTP). There will be a socket at the client and a socket at the server associated with the TCP connection.
HTTP client sends HTTP request message to server via its socket. The request message includes the path name /someDepartment/home.index.
HTTP server process receives request message via its socket, retrieves the object /someDepartment/home.index from its storage (RAM or disk), encapsulates the object in an HTTP response message, and sends the response message to the client via its socket
HTTP server process tells TCP to close the TCP connection (it won't terminate until it knows for sure that the client has received the response message intact)
HTTP client receives the response message, then the TCP connection terminates. The message indicates the encapsulated object is an HTML file. THe client extracts the file from the response message, examines the HTML file, and finds references to the 10 JPEG objects
The first 4 steps are then repeated for each of the referenced JPEG objects.

Not that here, each TCP connection is closed after the server sends the object. HTTP/1.0 employs non-persistent TCP connections. Therefore, 11 TCP connections are generated.

Let's estimate amount of time that elapses. The round-trip time (RTT) is the time it takes for a small packet to travel from client to server and then back to the client. It includes packet-propagation delays, packet-queuing delays in intermediate routers and switches, and packet-processing delays. When a use clicks on a hyperlink, this causes the browser to initiate a TCP connection. There is a "3-way handshake" -- client sends small TCP segment to server, the server acknowledges and responds with a small TCP segment, and the client ackowledges back to the server. The first two parts of the handshake take one RTT, Then after the first two parts, the client sends the HTTP request into the TCP connection, and once this arrives at the server, the server sends the HTML file to TCP connection. This is another RTT, so it is 2 RTTs plus transmission time at the server of the HTML file.

Shortcomings of Non-Persistent connections:

Brand-new connections must be established and maintained for each requested object
TCP buffers must be allocated and TCP variables must be kept in both the client and server, which places a significant burden on the Web surver
Each object suffers a delivery delay of 2 RTTs -- one RTT to establish the TCP connection, and 1 RTT to request and receive the object.

HTTP with Persistent Connections¶

HTTP/1.1 uses persistent connections, the server leaves the TCP connection open after sending a response. So, the HTML file and the 10 images are sent over a single persistent TCP connection.

HTTP Message Format¶

Two types of HTTP messages: request, and response

HTTP Request Message¶

Typical HTTP request might look like

GET /somedir/page.html HTTP/1.1 
Host: www.someschool.edu 
Connection: close  
User-agent: Mozilla/5.0 
Accept-language: fr

The message is written in ordinary ASCII text, the message consists of 5 lines, each followed by a carriage return and a line feed. The last line is followed by an additional carriage return and line feed.

A request message can have many more lines or as few as one line.

First line is called request line, which has three parts
- Method field
  - Can take on several different values: GET, POST, HEAD, PUT, DELETE. A great majority of HTTP request messages uses GET. GET gets used when browser requests an object, with the requested object identified in the URL field
- URL field
- HTTP version field
Subsequent lines are the header lines.
- Host specifies the host on which the object resides
- Connection: close is the browser telling the server it does not want to bother with persistent connections.
- User-agent specifies the user agent is a Mozilla/5.0, a Firefox browser. You can send different versions of the same object to different types of user agents.
- Accept-language indicates that the user prefers to receive a French version of the object, if it exists, otherwise send the default version.

Here is an image of the HTTP request message:

2.2.2 httpreqmessage.png

The general format closely follows our example. The entity body is empty with GET method, but is used with POST method. POST is used when the user fills out a form, such as words to a search engine. The user still requests a Web page, but the specific contents depend on the input into the form. The entity body contains what the user entered into the form fields.

HEAD is similar to GET. It reponds with an HTTP message but it leaves out the requested object. It is used for debugging

PUT is used in conjunction with Web publishing tools. It allows a user to upload an object to a specific path on a specific Web server.

DELETE allows a user to delete an object on the Web server.

HTTP Response Message¶

Here is an image of the response message:

2.2.3 httpresponsemessage.png

There are three sections:

status line: three fields
- protocol version field
- status code
- corresponding status message
header lines
- Very similar to the header lines in the request message
entity body: the meat of the message. contains the requested object itself

There are a few status codes you might have seen. These are the status codes and corresponding status messages:

200 OK: Request succeeded and information is returned in the response
301 Moved Permanently: Requested object has been permanently moved; the new URL is specified in a Location: header. The client software will automatically retrieve the new URL
400 Bad Request: Generic error code indicating that the request could not be understood by the server
404 Not Found: Requested document does not exist on the server
505 HTTP Version Not Supported: Requested HTTP protocol version is not supported by the server.

User-Server Interaction: Cookies¶

The HTTP server is stateless, which simplifies server design and permitted engineers to develop high-performance Web servers that can handle thousands of simultaneous TCP connections. Cookies however, allow sites to keep track of users.

Cookie technology has 4 components: 1. Cookie header line in the HTTP response message 2. Cookie header lin in the HTTP request message 3. Cookie file kept on the user's end system and managed by the user's browser 4. Back-end database at the Web site

When somone contacts Amazon.com for the first time, the server creates a unique identification number and creates an entry in its back-end database that is indexed by identification number. The Amazon Web server then responds to your browser, including in the HTTP reponse a Set-cookie: header, which contains the identification number. When your browser receives the HTTP response message, it sees the Set-cookie header, then it appends a line to the special cookie file that it manages. This line includes the hostname of the server and identification number. The cookie file may have entries from other websites.

Now whenever you have activity on Amazon.com, although Amazon might not know who you are, they know exactly which pages user 1678 visited, in which order, and at what times. Thus, they can use cookies to provide its shopping cart service and other things, like recommend products. If you register an account with Amazon, then Amazon can include this information in its database, thereby associating a name to a cookie identification number.

Web Caching¶

A Web cache (or called proxy server) is a network entity that satisfies HTTP requests on behalf of an origin Web server. The Web cache has its own disk storage and keeps copies of recently requested objects in this storage.

Suppose a browser requests http://www.someschool.edu/ campus.gif. Here is what happens: 1. Browser establishes TCP connection to Web cache and sends HTTP request for the object to the Web cache 2. Web cache checks to see if it has a copy of the object stored locally. If it does, the Web cache returns the object within an HTTP response message to the client browser 3. If the Web cache does not have the object, it opens a TCP connection to the origin server, www.someschool.edu. The Web cache then sends an HTTP request for the object into the cache-to-server TCP connection. After receiving the request the origin server sends the object within an HTTP response to the Web cache 4. When the Web cache receives the object, it stores a copy in its local storage and sends a copy, within an HTTP response message, to the client browser.

It is super similar to caching in 16. Caches.

Through use of Content Distribution Networks (CDNs), Web caches are increasingly playing an important role in the Internet. A CDN company installs many geographically distributed caches throughout the Internet, thereby localizing much of the traffic.

The Conditional GET¶

The conditional GET happens if (1) the request message uses the GET method and (2) the request message includes an If-Modified-Since header line. It allows a cache to verify that its objects are up to date. We might then send

GET /fruit/kiwi.gif HTTP/1.1  
Host: www.exotiquecuisine.com 
If-modified-since: Wed, 9 Sep 2015 09:23:24

Suppose the object has not been modified since that date, then the Web server sends a response message to the cache:

HTTP/1.1 304 Not Modified  
Date: Sat, 10 Oct 2015 15:39:29 
Server: Apache/1.3.0 (Unix)

(empty entity body)

So the conditional GET makes the Web server send a response message but does not include the requested object in the response message. If it has changed, then the server will send the object.

HTTP/2¶

Standardized in 2015, HTTP/2 reduced perceived latency by enabling request and response multiplexing over a single TCP connection, provide request prioritization and server push, and provide efficient compression of HTTP header fields

It does not change HTTP methods, status codes, URLs, header fields. It simply changes how the data is formatted and transported between client and server.

The developers of Web browsers discovered that sending al lthe objects in a Web page over a single TCP connection has a Head of Line (HOL) blocking problem. If a large video clip is at the top of the Web page and bleow it are many small objects below it, the a single TCP connection will have the video clip take a long time to pass through the bottleneck link. HTTP/1.1 browsers typically work around this by opening multiple parallel TCP connections.

HTTP/2 wants to get rid of parallel TCP connections, which reduces the number of sockets that need to be open and maintained at servers, but also allows TCP congestion control to operate as intended. TCP congestion controll provides browsers an unintended incentive to use multiple parallel TCP connections rather than a single persistent connection.

HTTP/2 Framing¶

The HTTP/2 solution for HOL blocking is to break each message into small frames, and interleave the request and response messages on the same TCP connection. We might break a large video clip into 8 smaller objects, so the server can receive 9 concurrent requests from any browser wanting to see the Web page, and for each request, the server needs to send 9 competing HTTP response messages to the browser.

The ability to break down an HTTP message into independent frames, interleave them, and then reassemble them on the other end is the single most important enhancement of HTTP/2

Response Message Prioritization and Server Pushing¶

Message prioritization allows developers to customize the relative priority of requests to better optimize applicaiton performance. We might assign a weight between 1 and 256 to each message, where higher number indicates higher priority.

Another feature of HTTP/2: ability for server to send multiple responses for a single client request. The server can push additional objects to the client, without the client having to request each one. Instead of waiting for the HTTP requests for these objects, the server cna analyze the HTML page, identify the objects that are needed, and send them to the client before receiving explicit requests for these objects, which eliminates extra latency.

HTTP/3¶

QUIC is a new transport protocol implemented in the applicaiton layer over the bare-bones UDP protocol. QUIC has several features desireable for HTTP, such as message multiplexing (interleaving), per-stream flow control, and low-latency connection establishment.

HTTP/3 is yet a new HTTP protocol designed to operate over QUIC. As of 202, HTTP/2 is described in Internet drafts and has not yet been fully standardized. Many HTTP/2 features are subsumed by QUIC.