Advanced HTTPClient Info

Contents

Proxies

Support for proxies (including SOCKS) is fully implemented. However, using proxies in Applets is subject to a number of security restrictions (see security for more information on the various security policies and the consequences that arise from them). If you are using an http proxy then use the HTTPConnection.setProxyServer() method to set the default proxy for all connections, and HTTPConnection.setCurrentProxy() to set a proxy for the current connection only. You can also manipulate a list of hosts for which no proxy is to be used with the methods HTTPConnection.dontProxyFor() and HTTPConnection.doProxyFor().

If you are using SOCKS then the method to use is setSocksServer(). Note that both an http proxy and a SOCKS proxy can be set at the same time, in which case a request is sent via the SOCKS server to the proxy server, which in turn relays the request to the desired destination.

Some proxies will proxy for protocols other than http using http to contact the proxy itself. If you have such a proxy then you can use the HTTPClient to do requests for other protocols through the proxy. To do this you need to create an HTTPConnection to the proxy itself (i.e. don't use setCurrentProxy() or setProxyServer()) and specify the full URL of the file/article/whatever in the Get(), Put(), etc. Example: if you want to retrieve the file /pub/README via ftp from rtfm.mit.edu then you could use something like:

    HTTPConnection proxy = new HTTPConnection("my.proxy.dom", 8000);
    HTTPResponse   resp  = proxy.Get("ftp://rtfm.mit.edu/pub/README");
    ...

Timeouts

Sometimes one doesn't want to wait (almost) forever until a connection is established to the server or until the server answers. In this case a timeout can be set using the methods HTTPConnection.setDefaultTimeout() and HTTPConnection.setTimeout(). Setting a timeout will cause the client to limit the time it will spend trying to get the hosts IP-address and establishing a connection with the server. If this is running under JDK 1.1 or later it will also set the timeout on the socket while reading the response headers. The timeout is always disabled while reading the response body. The rationale for this is that otherwise I'd have to make all the input streams (which may be pushed onto the response input stream) reentrant, and this would include writing my own versions of GZIPInputStream etc.

Contexts

There has been the desire to run multiple independent clients within one application. This was previously hindered by the fact that the list of authorization info was shared by all instances of HTTPConnection, resulting in all clients having to use the same username and password. Starting with V0.3 you can set a context (HTTPConnection.setContext()) for each HTTPConnection. Each module which keeps information on behalf of the application (such as the cookie module, the authorization module and the redirection module) then uses a separate list for each context. In this way only instances of HTTPConnection using the same context will share information.

If no context is set a default context is used which is the same for all HTTPConnections. Therefore applications which don't need this feature can just ignore it and they will behave as before.

Persistent Connections (Keep-Alive's)

The Hypertext Transfer Protocol originally allowed only one request per TCP connection. However, establishing a TCP connection is fairly expensive time wise, so that some implementors of HTTP/1.0 added so called Keep-Alive's to keep a connection open after a request was completed and to allow further requests to be made over that connection. Unfortunately, this was not well defined and is broken in the face of proxies. HTTP/1.1 defines persistent connections correctly and even makes them the default.

The HTTPClient will by default try to keep a connection alive for as many requests as possible, both when talking to HTTP/1.0 and HTTP/1.1 servers. To disable persistent connections you can specify a Connection header with the value close. Example:

    NVPair[] def_hdrs = { new NVPair("Connection", "close") };
    con.setDefaultHeaders(def_hdrs);

This will disable persistent connections for all future request (unless overridden by a connection header on the request method call).

Keeping the connection to the server open after a request is fine as long as another request follows within a short period of time. However when you are done you should let the library know by passing the above Connection: close header with the last request. Furthermore, to limit the length of time the connection will be held open a timer is started after each request which will close the connection if no further requests arrive within the next 60 seconds.

Note that most of this is transparent as far as the functioning of the requests is concerned; the only differences you will notice is in the time required for a request to be sent. Also note that persistent connections are only done within the context of a given instance of HTTPConnection; so if you create two instances both pointing at the same server then they will create separate connections to the server.

Closing of Sockets

A socket is closed when one of the five following conditions occurs:

Responses are marked for close whenever the client determines that the connection should not be kept open past the end of this response. This includes the connection timing out, the server sending a Connection: close (in the case of an HTTP/1.1 server) or not sending a Connection: keep-alive (in the case of an HTTP/1.0 server), the response having no Content-length and no self-delimiting body, or the receipt of certain error status codes.

Pipelining

If the connection is kept open across request then the request may be pipelined. Pipelining here means that a new request is sent before the response to a previous request is received. It is obvious that this may speed up requests, so HTTPClient supports pipelining (at the expense of some extra code to keep track of the outstanding requests).

In spite of all the possible pipelining going on underneath, the programming model still stays simple: for every request you send you get a reponse back which contains the headers and data of the servers response. Now with pipelining the fields in the reponse aren't necessarily filled yet (i.e. the actual response headers and data haven't been read off the net), but the first call to any method in the reponse (e.g. a getStatusCode()) will wait till the reponse has actually been read and parsed. Also any previous requests will be forced to read their responses if they have not already done so (so e.g. if you send two consecutive requests and receive responses r1 and r2, calling r2.getHeader("Content-type") will first force the complete response r1 to be read before reading the response r2). All this should be completely transparent, except for the fact that invoking a method on one response may sometimes take a few seconds to complete, while the same method on a different response will return immediately with the desired info.

Protocol Version

The request protocol version sent is always HTTP/1.1, except in a few circumstances when a broken server is encountered, in which case the version sent reverts to HTTP/1.0. An HTTP/0.9 request is never sent.

The protocol version returned by the server is used to select between different mechanisms for persistent connections. If the server advertises itself as being HTTP/1.1 compliant then HTTP/1.1 persistent connections are used; otherwise HTTP/1.0 keep-alives are used (the difference is the tokens used in the Connection header for signaling persistence and the end of a connection). Apart from this, the only other distinction made between talking to an HTTP/1.0 or an HTTP/1.1 server is in how request are automatically retried (retried requests with a body need to use slighly different mechanisms for determining how long to wait after sending the headers, before sending the body).

Modules

Starting with Version 0.3 the HTTPClient uses modules for a number of its functions. Each connection has a list of modules. When a request method is invoked the request is first assembled into a Request instance. Then the request handler of each module is invoked in turn with this request. This handler may modify the request (such as add headers) or even generate a response directly (such as a cache might do). Only after all handlers have been invoked (and none of them generated a response) is the request actually sent over the wire. Similarly, when a response is read the response handlers in each module are invoked in turn. They may do certain things based on the status code (such as the redirection module) or the headers (such as the cookie module), modify the response, or even generate a new request (such as in the redirection and authentication modules). If a new request is generated the process starts from the top.

The use of modules allows additional functionality to be easily added without having to modify the core code. It also allows for the easy enabling and disabling of various functions. The currently supplied modules are the AuthenticationModule, the RedirectionModule, the ContentEncodingModule, the TransferEncodingModule, the CookieModule, the ContentMD5Module, the RetryModule and the DefaultModule. These are explained in more detail further down.

Modules can be dynamically added and removed to tailor the request and response processing desired. The methods HTTPConnection.addDefaultModule(), HTTPConnection.removeDefaultModule() and HTTPConnection.getDefaultModules() manipulate and return the list of default modules which is used when a new HTTPConnection is created. Similarly, the methods HTTPConnection.addModule(), HTTPConnection.removeModule() and HTTPConnection.getModules() manipulate and return the list of current modules for an connection.

The default list of modules is initialized from the property HTTPClient.Modules. This property must be a "|" separated list of class names. If this property is not set it defaults to all the classes listed below. Normally if during class initialization any module in the list does not exist or cannot be instantiated then an Error is thrown. However, if this is being used in an Applet then the error is supressed. This way Applets can limit the modules loaded over the net by simply not providing them (remember, they can't set properties due to security restrictions).

You may create your own modules and add them using the above methods. Any module you write must implement the HTTPClientModule interface. See the API docs more info. Note: this interface may change. If you write a module and find the interface insufficient or difficult, please contact me. Also, if you write a module which you think might be of general usefulness and would like to make it freely available, let me know.

Here is a short description of each module.

AuthorizationModule

Authorization briefly described in Getting Started. As mentioned, this module will handle both server and proxy authorization requests (status codes 401 and 407). In addition to the 'Basic' and 'Digest' authorization schemes, the AuthorizationModule can be made to handle other schemes as well, so long as they are not "too exotic" (i.e. they follow the simple challenge-response mechanism outlined in the http specs); this is done by setting your own AuthorizationHandler. Of course, if you need to something more sophisticated you can always plug in your own auth module.

When confronted with an authorization request the auth module will query all known authorization info for a possible candidate (the match must be for the host, port, scheme and realm). If no suitable info is found, or if the server rejects any info found, an authorization handler is called to try and get the necessary info from the user; if the user does not give any information, or if the information she gives is also rejected, then the retrying is terminated and the last failure status returned to the caller. The default handler currently only understands requests for the 'Basic' and 'Digest' authorization schemes; you may however set your own handler via the AuthorizationInfo.setAuthHandler() method. The handler given must implement the AuthorizationHandler interface. To disable the handler completely give null for the handler. To prevent the popup box from appearing use setAllowUserInteraction(false).

A server (or proxy) may send multiple authorization challenges in the response, in which case the above algorithm is modified to go through the list of challenges in the same order as they were sent, trying to get authorization info for each challenge in the list and going to the next challenge if either no info was found or the server rejects that info. If the end of the list is reached without achieving authorization then the authorization handler is called on each challenge (in the same order) until either an authorization request is successful, the authorization handler returns null (e.g. when the Cancel button in the default popup box is activated) or the list is exhausted, in which case the response to the last failed request is returned.

RedirectionModule

This module handles the redirection status codes 301, 302, 303, 305 and 307. 301 and 307 responses are only redirected if the request method was GET or HEAD; this is because redirecting, say, a POST blindly might lead to undesired behaviour, as the circumstances leading to the POST might have changed. 302 and 303 are treated identically: the new request to the new location is done using GET (this is what many cgi's expect - they are basically directing you to a prefabricated answer). 305 is only honored if the connection is not already using a proxy.

This module also keeps a list of permanently redirected urls (status code 301) and will preemptively redirect requests for these. This list is volatile (i.e. it will be lost when the client exits).

CookieModule

This module implements cookies as defined by Netscape's cookie spec. Whenever the server tries to set a cookie a cookie handler is invoked to see whether this cookie should be set. The default handler brings up a popup describing the cookie and allowing the user to accept or reject it; the user may also summarily accept or reject cookies from whole domains. You may substitute you own cookie handler using the setCookiePolicyHandler() method. If you set the handler to null then all cookies will be accepted. If you do not want any cookies to be accepted then either remove the CookieModule from the list of modules or set your own handler which always returns false. The handler must implement the CookiePolicyHandler interface.

ContentEncodingModule

Servers may apply various content encodings to the content. The most widely used encodings are compressions: gzip, compress and deflate. If running under JDK 1.1 or later, this module will handle the gzip and deflate content encodings by pushing an appropriate decoding stream; this means that the data read from getInputStream() will be the clear text. The Content-Encoding header is also modified appropriately.

If running under JDK 1.0.2 this module is not loaded in the first place. The simple reason for this is that I'm too lazy to write my own gzip and deflate decompression streams which are part of the java core libraries as of JDK 1.1.

You might want to consider disabling this module if you are using the HTTPClient for things like web-copying where storing the compressed document makes sense.

TransferEncodingModule

This is similar to the ContentEncoding module except that it applies to transfer encodings. It also handles gzip and deflate encodings and is also only loaded if running under JDK 1.1 or later.

ContentMD5Module

Some servers may generate a Content-MD5 header which contains an MD5 hash of the message body (after any content encoding, but before any transport encoding is applied). If this header is present this module will push a stream which calculates the MD5 hash of the body. When the stream is closed or the end of the data is reached the calculated hash is compared to the one in the Content-MD5 header and if they don't match an IOException is thrown.

DefaultModule

This handles the response stati 408 (request timeout) and 411 (length required).

RetryModule

This module is special. It is responsible for automatically retrying requests which were aborted due to an IOException on the socket. It is unlike other modules in that it is closely tied in to the core code, instead of just manipulating the request and response structures as other modules do. The code in this module could of been put in with the rest of the core code, but moving it to a module has the advantage that this automatic retrying of requests may be disabled using the standard mechanism of removing modules.

Ordering the Modules

The handlers in the modules are invoked in the order the modules are placed in the list. Because of certain constraints between modules this order is important. The default order for the supplied modules is:

  1. RetryModule
  2. CookieModule
  3. RedirectionModule
  4. AuthorizationModule
  5. DefaultModule
  6. TransferEncodingModule
  7. ContentMD5Module
  8. ContentEncodingModule
However the constraints impose only a partial ordering, so that the above order may be changed as long as the following restrictions are observed:

Properties recognized by HTTPClient

There are a number of properties which are used by the HTTPClient. Most are documented somewhere in the api docs. Some of the properties may contain a list of elements, in which case the elements are separated by vertical bars ("|"). White space is ignored, except that a "| |" produces an empty element whereas "||" is treated like a single delimiter (i.e. "|"). Here is summary of all properties recognized:

http.proxyHost
Read by HTTPConnection. Used to specify the http proxy to use. See setProxyServer() for more info. This is the same property that is used by Sun's JDK 1.1 (and later).
http.proxyPort
Read by HTTPConnection. Used to specify the http proxy to use. See setProxyServer() for more info. This is the same property that is used by Sun's JDK 1.1 (and later).
proxySet
Read by HTTPConnection. Obsolete. Used by Sun's JDK 1.0.2. If http.proxyHost is not set and proxySet is true, then the default proxy is set using the values in proxyHost and proxyPort.
proxyHost
Read by HTTPConnection. Obsolete. Used by Sun's JDK 1.0.2. If http.proxyHost is not set and proxySet is true, then the default proxy is set using the values in proxyHost and proxyPort.
proxyPort
Read by HTTPConnection. Obsolete. Used by Sun's JDK 1.0.2. If http.proxyHost is not set and proxySet is true, then the default proxy is set using the values in proxyHost and proxyPort.
HTTPClient.nonProxyHosts
Read by HTTPConnection. Used to specify a list hosts for which no http proxy is to be used. See dontProxyFor() for more info.
http.nonProxyHosts
Read by HTTPConnection. Used to specify a list hosts for which no http proxy is to be used. See dontProxyFor() for more info. This is the same property that is used by Sun's JDK 1.1 (and later).
HTTPClient.socksHost
Read by HTTPConnection. Used to specify the SOCKS proxy host. See setSocksServer() for more info.
HTTPClient.socksPort
Read by HTTPConnection. Used to specify the SOCKS proxy port. See setSocksServer() for more info.
HTTPClient.socksVersion
Read by HTTPConnection. Used to specify the SOCKS proxy version. See setSocksServer() for more info.
HTTPClient.Modules
Read by HTTPConnection. Used to define the default list of modules. See HTTPConnection.addDefaultModule() for more info.
HTTPClient.disable_pipelining
Read by HTTPConnection. Used to disable all pipelining. This should never be needed, but you may encounter a server which displays problems when pipelining requests. Setting this property to true will cause the HTTPClient to stall each request until the headers from the response to the previous request have been received and parsed.
HTTPClient.cookies.hosts.accept
Read by CookieModule. Used to initialize the list of hosts and domains from which to always accept cookies. See setCookiePolicyHandler() for more info.
HTTPClient.cookies.hosts.reject
Read by CookieModule. Used to initialize the list of hosts and domains from which to always reject cookies. See setCookiePolicyHandler() for more info.
http.agent
Read by HttpURLConnection. If set then the "User-Agent" header is set to this property's value.

HTTP Headers

All request methods accept optional headers to be sent with the request. Here are a list of possible request and response headers as defined in the HTTP/1.1 spec. I have added some comments to some of them, but for further info I recommend getting the specs (every header is described in a paragraph of its own in the spec, so you can read just the part that interests you and ignore the rest).

Request Headers

Response Headers

Further Reading

* General HTTP Info at W3C
* HTTP/1.0 Spec (RFC 1945)
* HTTP/1.1 Spec (RFC 2068)

[HTTPClient]


Ronald Tschalär / 30. January 1998 / ronald@innovation.ch.