.\" Automatically generated by Pod::Man 2.27 (Pod::Simple 3.28)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings. \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote. \*(C+ will
.\" give a nicer C++. Capital omega is used to do unbreakable dashes and
.\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
. ds -- \(*W-
. ds PI pi
. if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
. if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch
. ds L" ""
. ds R" ""
. ds C` ""
. ds C' ""
'br\}
.el\{\
. ds -- \|\(em\|
. ds PI \(*p
. ds L" ``
. ds R" ''
. ds C`
. ds C'
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\"
.\" If the F register is turned on, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD. Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.\"
.\" Avoid warning from groff about undefined register 'F'.
.de IX
..
.nr rF 0
.if \n(.g .if rF .nr rF 1
.if (\n(rF:(\n(.g==0)) \{
. if \nF \{
. de IX
. tm Index:\\$1\t\\n%\t"\\$2"
..
. if !\nF==2 \{
. nr % 0
. nr F 2
. \}
. \}
.\}
.rr rF
.\"
.\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2).
.\" Fear. Run. Save yourself. No user-serviceable parts.
. \" fudge factors for nroff and troff
.if n \{\
. ds #H 0
. ds #V .8m
. ds #F .3m
. ds #[ \f1
. ds #] \fP
.\}
.if t \{\
. ds #H ((1u-(\\\\n(.fu%2u))*.13m)
. ds #V .6m
. ds #F 0
. ds #[ \&
. ds #] \&
.\}
. \" simple accents for nroff and troff
.if n \{\
. ds ' \&
. ds ` \&
. ds ^ \&
. ds , \&
. ds ~ ~
. ds /
.\}
.if t \{\
. ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u"
. ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'
. ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'
. ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'
. ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'
. ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'
.\}
. \" troff and (daisy-wheel) nroff accents
.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'
.ds 8 \h'\*(#H'\(*b\h'-\*(#H'
.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#]
.ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'
.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'
.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#]
.ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#]
.ds ae a\h'-(\w'a'u*4/10)'e
.ds Ae A\h'-(\w'A'u*4/10)'E
. \" corrections for vroff
.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'
.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'
. \" for low resolution devices (crt and lpr)
.if \n(.H>23 .if \n(.V>19 \
\{\
. ds : e
. ds 8 ss
. ds o a
. ds d- d\h'-1'\(ga
. ds D- D\h'-1'\(hy
. ds th \o'bp'
. ds Th \o'LP'
. ds ae ae
. ds Ae AE
.\}
.rm #[ #] #H #V #F C
.\" ========================================================================
.\"
.IX Title "lwptut 3"
.TH lwptut 3 "2019-05-06" "perl v5.16.3" "User Contributed Perl Documentation"
.\" For nroff, turn off justification. Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH "NAME"
lwptut \-\- An LWP Tutorial
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
\&\s-1LWP \s0(short for \*(L"Library for \s-1WWW\s0 in Perl\*(R") is a very popular group of
Perl modules for accessing data on the Web. Like most Perl
module-distributions, each of \s-1LWP\s0's component modules comes with
documentation that is a complete reference to its interface. However,
there are so many modules in \s-1LWP\s0 that it's hard to know where to start
looking for information on how to do even the simplest most common
things.
.PP
Really introducing you to using \s-1LWP\s0 would require a whole book \*(-- a book
that just happens to exist, called \fIPerl & \s-1LWP\s0\fR. But this article
should give you a taste of how you can go about some common tasks with
\&\s-1LWP.\s0
.SS "Getting documents with LWP::Simple"
.IX Subsection "Getting documents with LWP::Simple"
If you just want to get what's at a particular \s-1URL,\s0 the simplest way
to do it is LWP::Simple's functions.
.PP
In a Perl program, you can call its \f(CW\*(C`get($url)\*(C'\fR function. It will try
getting that \s-1URL\s0's content. If it works, then it'll return the
content; but if there's some error, it'll return undef.
.PP
.Vb 2
\& my $url = \*(Aqhttp://www.npr.org/programs/fa/?todayDate=current\*(Aq;
\& # Just an example: the URL for the most recent /Fresh Air/ show
\&
\& use LWP::Simple;
\& my $content = get $url;
\& die "Couldn\*(Aqt get $url" unless defined $content;
\&
\& # Then go do things with $content, like this:
\&
\& if($content =~ m/jazz/i) {
\& print "They\*(Aqre talking about jazz today on Fresh Air!\en";
\& }
\& else {
\& print "Fresh Air is apparently jazzless today.\en";
\& }
.Ve
.PP
The handiest variant on \f(CW\*(C`get\*(C'\fR is \f(CW\*(C`getprint\*(C'\fR, which is useful in Perl
one-liners. If it can get the page whose \s-1URL\s0 you provide, it sends it
to \s-1STDOUT\s0; otherwise it complains to \s-1STDERR.\s0
.PP
.Vb 1
\& % perl \-MLWP::Simple \-e "getprint \*(Aqhttp://www.cpan.org/RECENT\*(Aq"
.Ve
.PP
That is the \s-1URL\s0 of a plain text file that lists new files in \s-1CPAN\s0 in
the past two weeks. You can easily make it part of a tidy little
shell command, like this one that mails you the list of new
\&\f(CW\*(C`Acme::\*(C'\fR modules:
.PP
.Vb 2
\& % perl \-MLWP::Simple \-e "getprint \*(Aqhttp://www.cpan.org/RECENT\*(Aq" \e
\& | grep "/by\-module/Acme" | mail \-s "New Acme modules! Joy!" $USER
.Ve
.PP
There are other useful functions in LWP::Simple, including one function
for running a \s-1HEAD\s0 request on a \s-1URL \s0(useful for checking links, or
getting the last-revised time of a \s-1URL\s0), and two functions for
saving/mirroring a \s-1URL\s0 to a local file. See the LWP::Simple
documentation for the full details, or chapter 2 of \fIPerl
& \s-1LWP\s0\fR for more examples.
.SS "The Basics of the \s-1LWP\s0 Class Model"
.IX Subsection "The Basics of the LWP Class Model"
LWP::Simple's functions are handy for simple cases, but its functions
don't support cookies or authorization, don't support setting header
lines in the \s-1HTTP\s0 request, generally don't support reading header lines
in the \s-1HTTP\s0 response (notably the full \s-1HTTP\s0 error message, in case of an
error). To get at all those features, you'll have to use the full \s-1LWP\s0
class model.
.PP
While \s-1LWP\s0 consists of dozens of classes, the main two that you have to
understand are LWP::UserAgent and HTTP::Response. LWP::UserAgent
is a class for \*(L"virtual browsers\*(R" which you use for performing requests,
and HTTP::Response is a class for the responses (or error messages)
that you get back from those requests.
.PP
The basic idiom is \f(CW\*(C`$response = $browser\->get($url)\*(C'\fR, or more fully
illustrated:
.PP
.Vb 1
\& # Early in your program:
\&
\& use LWP 5.64; # Loads all important LWP classes, and makes
\& # sure your version is reasonably recent.
\&
\& my $browser = LWP::UserAgent\->new;
\&
\& ...
\&
\& # Then later, whenever you need to make a get request:
\& my $url = \*(Aqhttp://www.npr.org/programs/fa/?todayDate=current\*(Aq;
\&
\& my $response = $browser\->get( $url );
\& die "Can\*(Aqt get $url \-\- ", $response\->status_line
\& unless $response\->is_success;
\&
\& die "Hey, I was expecting HTML, not ", $response\->content_type
\& unless $response\->content_type eq \*(Aqtext/html\*(Aq;
\& # or whatever content\-type you\*(Aqre equipped to deal with
\&
\& # Otherwise, process the content somehow:
\&
\& if($response\->decoded_content =~ m/jazz/i) {
\& print "They\*(Aqre talking about jazz today on Fresh Air!\en";
\& }
\& else {
\& print "Fresh Air is apparently jazzless today.\en";
\& }
.Ve
.PP
There are two objects involved: \f(CW$browser\fR, which holds an object of
class LWP::UserAgent, and then the \f(CW$response\fR object, which is of
class HTTP::Response. You really need only one browser object per
program; but every time you make a request, you get back a new
HTTP::Response object, which will have some interesting attributes:
.IP "\(bu" 4
A status code indicating
success or failure
(which you can test with \f(CW\*(C`$response\->is_success\*(C'\fR).
.IP "\(bu" 4
An \s-1HTTP\s0 status
line that is hopefully informative if there's failure (which you can
see with \f(CW\*(C`$response\->status_line\*(C'\fR,
returning something like \*(L"404 Not Found\*(R").
.IP "\(bu" 4
A \s-1MIME\s0 content-type like \*(L"text/html\*(R", \*(L"image/gif\*(R",
\&\*(L"application/xml\*(R", etc., which you can see with
\&\f(CW\*(C`$response\->content_type\*(C'\fR
.IP "\(bu" 4
The actual content of the response, in \f(CW\*(C`$response\->decoded_content\*(C'\fR.
If the response is \s-1HTML,\s0 that's where the \s-1HTML\s0 source will be; if
it's a \s-1GIF,\s0 then \f(CW\*(C`$response\->decoded_content\*(C'\fR will be the binary
\&\s-1GIF\s0 data.
.IP "\(bu" 4
And dozens of other convenient and more specific methods that are
documented in the docs for HTTP::Response, and its superclasses
HTTP::Message and HTTP::Headers.
.SS "Adding Other \s-1HTTP\s0 Request Headers"
.IX Subsection "Adding Other HTTP Request Headers"
The most commonly used syntax for requests is \f(CW\*(C`$response =
$browser\->get($url)\*(C'\fR, but in truth, you can add extra \s-1HTTP\s0 header
lines to the request by adding a list of key-value pairs after the \s-1URL,\s0
like so:
.PP
.Vb 1
\& $response = $browser\->get( $url, $key1, $value1, $key2, $value2, ... );
.Ve
.PP
For example, here's how to send some commonly used headers, in case
you're dealing with a site that would otherwise reject your request:
.PP
.Vb 6
\& my @ns_headers = (
\& \*(AqUser\-Agent\*(Aq => \*(AqMozilla/4.76 [en] (Win98; U)\*(Aq,
\& \*(AqAccept\*(Aq => \*(Aqimage/gif, image/x\-xbitmap, image/jpeg, image/pjpeg, image/png, */*\*(Aq,
\& \*(AqAccept\-Charset\*(Aq => \*(Aqiso\-8859\-1,*,utf\-8\*(Aq,
\& \*(AqAccept\-Language\*(Aq => \*(Aqen\-US\*(Aq,
\& );
\&
\& ...
\&
\& $response = $browser\->get($url, @ns_headers);
.Ve
.PP
If you weren't reusing that array, you could just go ahead and do this:
.PP
.Vb 6
\& $response = $browser\->get($url,
\& \*(AqUser\-Agent\*(Aq => \*(AqMozilla/4.76 [en] (Win98; U)\*(Aq,
\& \*(AqAccept\*(Aq => \*(Aqimage/gif, image/x\-xbitmap, image/jpeg, image/pjpeg, image/png, */*\*(Aq,
\& \*(AqAccept\-Charset\*(Aq => \*(Aqiso\-8859\-1,*,utf\-8\*(Aq,
\& \*(AqAccept\-Language\*(Aq => \*(Aqen\-US\*(Aq,
\& );
.Ve
.PP
If you were only ever changing the 'User\-Agent' line, you could just change
the \f(CW$browser\fR object's default line from \*(L"libwww\-perl/5.65\*(R" (or the like)
to whatever you like, using the LWP::UserAgent \f(CW\*(C`agent\*(C'\fR method:
.PP
.Vb 1
\& $browser\->agent(\*(AqMozilla/4.76 [en] (Win98; U)\*(Aq);
.Ve
.SS "Enabling Cookies"
.IX Subsection "Enabling Cookies"
A default LWP::UserAgent object acts like a browser with its cookies
support turned off. There are various ways of turning it on, by setting
its \f(CW\*(C`cookie_jar\*(C'\fR attribute. A \*(L"cookie jar\*(R" is an object representing
a little database of all
the \s-1HTTP\s0 cookies that a browser knows about. It can correspond to a
file on disk or
an in-memory object that starts out empty, and whose collection of
cookies will disappear once the program is finished running.
.PP
To give a browser an in-memory empty cookie jar, you set its \f(CW\*(C`cookie_jar\*(C'\fR
attribute like so:
.PP
.Vb 2
\& use HTTP::CookieJar::LWP;
\& $browser\->cookie_jar( HTTP::CookieJar::LWP\->new );
.Ve
.PP
To save a cookie jar to disk, see \*(L"dump_cookies\*(R" in HTTP::CookieJar.
To load cookies from disk into a jar, see \*(L"load_cookies\*(R" in HTTP::CookieJar.
.SS "Posting Form Data"
.IX Subsection "Posting Form Data"
Many \s-1HTML\s0 forms send data to their server using an \s-1HTTP POST\s0 request, which
you can send with this syntax:
.PP
.Vb 7
\& $response = $browser\->post( $url,
\& [
\& formkey1 => value1,
\& formkey2 => value2,
\& ...
\& ],
\& );
.Ve
.PP
Or if you need to send \s-1HTTP\s0 headers:
.PP
.Vb 9
\& $response = $browser\->post( $url,
\& [
\& formkey1 => value1,
\& formkey2 => value2,
\& ...
\& ],
\& headerkey1 => value1,
\& headerkey2 => value2,
\& );
.Ve
.PP
For example, the following program makes a search request to AltaVista
(by sending some form data via an \s-1HTTP POST\s0 request), and extracts from
the \s-1HTML\s0 the report of the number of matches:
.PP
.Vb 4
\& use strict;
\& use warnings;
\& use LWP 5.64;
\& my $browser = LWP::UserAgent\->new;
\&
\& my $word = \*(Aqtarragon\*(Aq;
\&
\& my $url = \*(Aqhttp://search.yahoo.com/yhs/search\*(Aq;
\& my $response = $browser\->post( $url,
\& [ \*(Aqq\*(Aq => $word, # the Altavista query string
\& \*(Aqfr\*(Aq => \*(Aqaltavista\*(Aq, \*(Aqpg\*(Aq => \*(Aqq\*(Aq, \*(Aqavkw\*(Aq => \*(Aqtgz\*(Aq, \*(Aqkl\*(Aq => \*(AqXX\*(Aq,
\& ]
\& );
\& die "$url error: ", $response\->status_line
\& unless $response\->is_success;
\& die "Weird content type at $url \-\- ", $response\->content_type
\& unless $response\->content_is_html;
\&
\& if( $response\->decoded_content =~ m{([0\-9,]+)(?:<.*?>)? results for} ) {
\& # The substring will be like "996,000</strong> results for"
\& print "$word: $1\en";
\& }
\& else {
\& print "Couldn\*(Aqt find the match\-string in the response\en";
\& }
.Ve
.SS "Sending \s-1GET\s0 Form Data"
.IX Subsection "Sending GET Form Data"
Some \s-1HTML\s0 forms convey their form data not by sending the data
in an \s-1HTTP POST\s0 request, but by making a normal \s-1GET\s0 request with
the data stuck on the end of the \s-1URL. \s0 For example, if you went to
\&\f(CW\*(C`www.imdb.com\*(C'\fR and ran a search on \*(L"Blade Runner\*(R", the \s-1URL\s0 you'd see
in your browser window would be:
.PP
.Vb 1
\& http://www.imdb.com/find?s=all&q=Blade+Runner
.Ve
.PP
To run the same search with \s-1LWP,\s0 you'd use this idiom, which involves
the \s-1URI\s0 class:
.PP
.Vb 3
\& use URI;
\& my $url = URI\->new( \*(Aqhttp://www.imdb.com/find\*(Aq );
\& # makes an object representing the URL
\&
\& $url\->query_form( # And here the form data pairs:
\& \*(Aqq\*(Aq => \*(AqBlade Runner\*(Aq,
\& \*(Aqs\*(Aq => \*(Aqall\*(Aq,
\& );
\&
\& my $response = $browser\->get($url);
.Ve
.PP
See chapter 5 of \fIPerl & \s-1LWP\s0\fR for a longer discussion of \s-1HTML\s0 forms
and of form data, and chapters 6 through 9 for a longer discussion of
extracting data from \s-1HTML.\s0
.SS "Absolutizing URLs"
.IX Subsection "Absolutizing URLs"
The \s-1URI\s0 class that we just mentioned above provides all sorts of methods
for accessing and modifying parts of URLs (such as asking sort of \s-1URL\s0 it
is with \f(CW\*(C`$url\->scheme\*(C'\fR, and asking what host it refers to with \f(CW\*(C`$url\->host\*(C'\fR, and so on, as described in the docs for the \s-1URI\s0
class. However, the methods of most immediate interest
are the \f(CW\*(C`query_form\*(C'\fR method seen above, and now the \f(CW\*(C`new_abs\*(C'\fR method
for taking a probably-relative \s-1URL\s0 string (like \*(L"../foo.html\*(R") and getting
back an absolute \s-1URL \s0(like \*(L"http://www.perl.com/stuff/foo.html\*(R"), as
shown here:
.PP
.Vb 2
\& use URI;
\& $abs = URI\->new_abs($maybe_relative, $base);
.Ve
.PP
For example, consider this program that matches URLs in the \s-1HTML\s0
list of new modules in \s-1CPAN:\s0
.PP
.Vb 4
\& use strict;
\& use warnings;
\& use LWP;
\& my $browser = LWP::UserAgent\->new;
\&
\& my $url = \*(Aqhttp://www.cpan.org/RECENT.html\*(Aq;
\& my $response = $browser\->get($url);
\& die "Can\*(Aqt get $url \-\- ", $response\->status_line
\& unless $response\->is_success;
\&
\& my $html = $response\->decoded_content;
\& while( $html =~ m/<A HREF=\e"(.*?)\e"/g ) {
\& print "$1\en";
\& }
.Ve
.PP
When run, it emits output that starts out something like this:
.PP
.Vb 7
\& MIRRORING.FROM
\& RECENT
\& RECENT.html
\& authors/00whois.html
\& authors/01mailrc.txt.gz
\& authors/id/A/AA/AASSAD/CHECKSUMS
\& ...
.Ve
.PP
However, if you actually want to have those be absolute URLs, you
can use the \s-1URI\s0 module's \f(CW\*(C`new_abs\*(C'\fR method, by changing the \f(CW\*(C`while\*(C'\fR
loop to this:
.PP
.Vb 3
\& while( $html =~ m/<A HREF=\e"(.*?)\e"/g ) {
\& print URI\->new_abs( $1, $response\->base ) ,"\en";
\& }
.Ve
.PP
(The \f(CW\*(C`$response\->base\*(C'\fR method from HTTP::Message
is for returning what \s-1URL\s0
should be used for resolving relative URLs \*(-- it's usually just
the same as the \s-1URL\s0 that you requested.)
.PP
That program then emits nicely absolute URLs:
.PP
.Vb 7
\& http://www.cpan.org/MIRRORING.FROM
\& http://www.cpan.org/RECENT
\& http://www.cpan.org/RECENT.html
\& http://www.cpan.org/authors/00whois.html
\& http://www.cpan.org/authors/01mailrc.txt.gz
\& http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS
\& ...
.Ve
.PP
See chapter 4 of \fIPerl & \s-1LWP\s0\fR for a longer discussion of \s-1URI\s0 objects.
.PP
Of course, using a regexp to match hrefs is a bit simplistic, and for
more robust programs, you'll probably want to use an HTML-parsing module
like HTML::LinkExtor or HTML::TokeParser or even maybe
HTML::TreeBuilder.
.SS "Other Browser Attributes"
.IX Subsection "Other Browser Attributes"
LWP::UserAgent objects have many attributes for controlling how they
work. Here are a few notable ones:
.IP "\(bu" 4
\&\f(CW\*(C`$browser\->timeout(15);\*(C'\fR
.Sp
This sets this browser object to give up on requests that don't answer
within 15 seconds.
.IP "\(bu" 4
\&\f(CW\*(C`$browser\->protocols_allowed( [ \*(Aqhttp\*(Aq, \*(Aqgopher\*(Aq] );\*(C'\fR
.Sp
This sets this browser object to not speak any protocols other than \s-1HTTP\s0
and gopher. If it tries accessing any other kind of \s-1URL \s0(like an \*(L"ftp:\*(R"
or \*(L"mailto:\*(R" or \*(L"news:\*(R" \s-1URL\s0), then it won't actually try connecting, but
instead will immediately return an error code 500, with a message like
\&\*(L"Access to 'ftp' URIs has been disabled\*(R".
.IP "\(bu" 4
\&\f(CW\*(C`use LWP::ConnCache; $browser\->conn_cache(LWP::ConnCache\->new());\*(C'\fR
.Sp
This tells the browser object to try using the \s-1HTTP/1.1 \s0\*(L"Keep-Alive\*(R"
feature, which speeds up requests by reusing the same socket connection
for multiple requests to the same server.
.IP "\(bu" 4
\&\f(CW\*(C`$browser\->agent( \*(AqSomeName/1.23 (more info here maybe)\*(Aq )\*(C'\fR
.Sp
This changes how the browser object will identify itself in
the default \*(L"User-Agent\*(R" line is its \s-1HTTP\s0 requests. By default,
it'll send "libwww\-perl/\fIversionnumber\fR\*(L", like
\&\*(R"libwww\-perl/5.65". You can change that to something more descriptive
like this:
.Sp
.Vb 1
\& $browser\->agent( \*(AqSomeName/3.14 (contact@robotplexus.int)\*(Aq );
.Ve
.Sp
Or if need be, you can go in disguise, like this:
.Sp
.Vb 1
\& $browser\->agent( \*(AqMozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)\*(Aq );
.Ve
.IP "\(bu" 4
\&\f(CW\*(C`push @{ $ua\->requests_redirectable }, \*(AqPOST\*(Aq;\*(C'\fR
.Sp
This tells this browser to obey redirection responses to \s-1POST\s0 requests
(like most modern interactive browsers), even though the \s-1HTTP RFC\s0 says
that should not normally be done.
.PP
For more options and information, see the full documentation for
LWP::UserAgent.
.SS "Writing Polite Robots"
.IX Subsection "Writing Polite Robots"
If you want to make sure that your LWP-based program respects \fIrobots.txt\fR
files and doesn't make too many requests too fast, you can use the LWP::RobotUA
class instead of the LWP::UserAgent class.
.PP
LWP::RobotUA class is just like LWP::UserAgent, and you can use it like so:
.PP
.Vb 3
\& use LWP::RobotUA;
\& my $browser = LWP::RobotUA\->new(\*(AqYourSuperBot/1.34\*(Aq, \*(Aqyou@yoursite.com\*(Aq);
\& # Your bot\*(Aqs name and your email address
\&
\& my $response = $browser\->get($url);
.Ve
.PP
But HTTP::RobotUA adds these features:
.IP "\(bu" 4
If the \fIrobots.txt\fR on \f(CW$url\fR's server forbids you from accessing
\&\f(CW$url\fR, then the \f(CW$browser\fR object (assuming it's of class LWP::RobotUA)
won't actually request it, but instead will give you back (in \f(CW$response\fR) a 403 error
with a message \*(L"Forbidden by robots.txt\*(R". That is, if you have this line:
.Sp
.Vb 2
\& die "$url \-\- ", $response\->status_line, "\enAborted"
\& unless $response\->is_success;
.Ve
.Sp
then the program would die with an error message like this:
.Sp
.Vb 2
\& http://whatever.site.int/pith/x.html \-\- 403 Forbidden by robots.txt
\& Aborted at whateverprogram.pl line 1234
.Ve
.IP "\(bu" 4
If this \f(CW$browser\fR object sees that the last time it talked to
\&\f(CW$url\fR's server was too recently, then it will pause (via \f(CW\*(C`sleep\*(C'\fR) to
avoid making too many requests too often. How long it will pause for, is
by default one minute \*(-- but you can control it with the \f(CW\*(C`$browser\->delay( \f(CIminutes\f(CW )\*(C'\fR attribute.
.Sp
For example, this code:
.Sp
.Vb 1
\& $browser\->delay( 7/60 );
.Ve
.Sp
\&...means that this browser will pause when it needs to avoid talking to
any given server more than once every 7 seconds.
.PP
For more options and information, see the full documentation for
LWP::RobotUA.
.SS "Using Proxies"
.IX Subsection "Using Proxies"
In some cases, you will want to (or will have to) use proxies for
accessing certain sites and/or using certain protocols. This is most
commonly the case when your \s-1LWP\s0 program is running (or could be running)
on a machine that is behind a firewall.
.PP
To make a browser object use proxies that are defined in the usual
environment variables (\f(CW\*(C`HTTP_PROXY\*(C'\fR, etc.), just call the \f(CW\*(C`env_proxy\*(C'\fR
on a user-agent object before you go making any requests on it.
Specifically:
.PP
.Vb 2
\& use LWP::UserAgent;
\& my $browser = LWP::UserAgent\->new;
\&
\& # And before you go making any requests:
\& $browser\->env_proxy;
.Ve
.PP
For more information on proxy parameters, see the LWP::UserAgent
documentation, specifically the \f(CW\*(C`proxy\*(C'\fR, \f(CW\*(C`env_proxy\*(C'\fR,
and \f(CW\*(C`no_proxy\*(C'\fR methods.
.SS "\s-1HTTP\s0 Authentication"
.IX Subsection "HTTP Authentication"
Many web sites restrict access to documents by using \*(L"\s-1HTTP\s0
Authentication\*(R". This isn't just any form of \*(L"enter your password\*(R"
restriction, but is a specific mechanism where the \s-1HTTP\s0 server sends the
browser an \s-1HTTP\s0 code that says \*(L"That document is part of a protected
\&'realm', and you can access it only if you re-request it and add some
special authorization headers to your request\*(R".
.PP
For example, the Unicode.org admins stop email-harvesting bots from
harvesting the contents of their mailing list archives, by protecting
them with \s-1HTTP\s0 Authentication, and then publicly stating the username
and password (at \f(CW\*(C`http://www.unicode.org/mail\-arch/\*(C'\fR) \*(-- namely
username \*(L"unicode-ml\*(R" and password \*(L"unicode\*(R".
.PP
For example, consider this \s-1URL,\s0 which is part of the protected
area of the web site:
.PP
.Vb 1
\& http://www.unicode.org/mail\-arch/unicode\-ml/y2002\-m08/0067.html
.Ve
.PP
If you access that with a browser, you'll get a prompt
like
\&\*(L"Enter username and password for 'Unicode\-MailList\-Archives' at server
\&'www.unicode.org'\*(R".
.PP
In \s-1LWP,\s0 if you just request that \s-1URL,\s0 like this:
.PP
.Vb 2
\& use LWP;
\& my $browser = LWP::UserAgent\->new;
\&
\& my $url =
\& \*(Aqhttp://www.unicode.org/mail\-arch/unicode\-ml/y2002\-m08/0067.html\*(Aq;
\& my $response = $browser\->get($url);
\&
\& die "Error: ", $response\->header(\*(AqWWW\-Authenticate\*(Aq) || \*(AqError accessing\*(Aq,
\& # (\*(AqWWW\-Authenticate\*(Aq is the realm\-name)
\& "\en ", $response\->status_line, "\en at $url\en Aborting"
\& unless $response\->is_success;
.Ve
.PP
Then you'll get this error:
.PP
.Vb 4
\& Error: Basic realm="Unicode\-MailList\-Archives"
\& 401 Authorization Required
\& at http://www.unicode.org/mail\-arch/unicode\-ml/y2002\-m08/0067.html
\& Aborting at auth1.pl line 9. [or wherever]
.Ve
.PP
\&...because the \f(CW$browser\fR doesn't know any the username and password
for that realm (\*(L"Unicode-MailList-Archives\*(R") at that host
(\*(L"www.unicode.org\*(R"). The simplest way to let the browser know about this
is to use the \f(CW\*(C`credentials\*(C'\fR method to let it know about a username and
password that it can try using for that realm at that host. The syntax is:
.PP
.Vb 5
\& $browser\->credentials(
\& \*(Aqservername:portnumber\*(Aq,
\& \*(Aqrealm\-name\*(Aq,
\& \*(Aqusername\*(Aq => \*(Aqpassword\*(Aq
\& );
.Ve
.PP
In most cases, the port number is 80, the default \s-1TCP/IP\s0 port for \s-1HTTP\s0; and
you usually call the \f(CW\*(C`credentials\*(C'\fR method before you make any requests.
For example:
.PP
.Vb 5
\& $browser\->credentials(
\& \*(Aqreports.mybazouki.com:80\*(Aq,
\& \*(Aqweb_server_usage_reports\*(Aq,
\& \*(Aqplinky\*(Aq => \*(Aqbanjo123\*(Aq
\& );
.Ve
.PP
So if we add the following to the program above, right after the \f(CW\*(C`$browser = LWP::UserAgent\->new;\*(C'\fR line...
.PP
.Vb 5
\& $browser\->credentials( # add this to our $browser \*(Aqs "key ring"
\& \*(Aqwww.unicode.org:80\*(Aq,
\& \*(AqUnicode\-MailList\-Archives\*(Aq,
\& \*(Aqunicode\-ml\*(Aq => \*(Aqunicode\*(Aq
\& );
.Ve
.PP
\&...then when we run it, the request succeeds, instead of causing the
\&\f(CW\*(C`die\*(C'\fR to be called.
.SS "Accessing \s-1HTTPS\s0 URLs"
.IX Subsection "Accessing HTTPS URLs"
When you access an \s-1HTTPS URL,\s0 it'll work for you just like an \s-1HTTP URL\s0
would \*(-- if your \s-1LWP\s0 installation has \s-1HTTPS\s0 support (via an appropriate
Secure Sockets Layer library). For example:
.PP
.Vb 8
\& use LWP;
\& my $url = \*(Aqhttps://www.paypal.com/\*(Aq; # Yes, HTTPS!
\& my $browser = LWP::UserAgent\->new;
\& my $response = $browser\->get($url);
\& die "Error at $url\en ", $response\->status_line, "\en Aborting"
\& unless $response\->is_success;
\& print "Whee, it worked! I got that ",
\& $response\->content_type, " document!\en";
.Ve
.PP
If your \s-1LWP\s0 installation doesn't have \s-1HTTPS\s0 support set up, then the
response will be unsuccessful, and you'll get this error message:
.PP
.Vb 3
\& Error at https://www.paypal.com/
\& 501 Protocol scheme \*(Aqhttps\*(Aq is not supported
\& Aborting at paypal.pl line 7. [or whatever program and line]
.Ve
.PP
If your \s-1LWP\s0 installation \fIdoes\fR have \s-1HTTPS\s0 support installed, then the
response should be successful, and you should be able to consult
\&\f(CW$response\fR just like with any normal \s-1HTTP\s0 response.
.PP
For information about installing \s-1HTTPS\s0 support for your \s-1LWP\s0
installation, see the helpful \fI\s-1README.SSL\s0\fR file that comes in the
libwww-perl distribution.
.SS "Getting Large Documents"
.IX Subsection "Getting Large Documents"
When you're requesting a large (or at least potentially large) document,
a problem with the normal way of using the request methods (like \f(CW\*(C`$response = $browser\->get($url)\*(C'\fR) is that the response object in
memory will have to hold the whole document \*(-- \fIin memory\fR. If the
response is a thirty megabyte file, this is likely to be quite an
imposition on this process's memory usage.
.PP
A notable alternative is to have \s-1LWP\s0 save the content to a file on disk,
instead of saving it up in memory. This is the syntax to use:
.PP
.Vb 3
\& $response = $ua\->get($url,
\& \*(Aq:content_file\*(Aq => $filespec,
\& );
.Ve
.PP
For example,
.PP
.Vb 3
\& $response = $ua\->get(\*(Aqhttp://search.cpan.org/\*(Aq,
\& \*(Aq:content_file\*(Aq => \*(Aq/tmp/sco.html\*(Aq
\& );
.Ve
.PP
When you use this \f(CW\*(C`:content_file\*(C'\fR option, the \f(CW$response\fR will have
all the normal header lines, but \f(CW\*(C`$response\->content\*(C'\fR will be
empty. Errors writing to the content file (for example due to
permission denied or the filesystem being full) will be reported via
the \f(CW\*(C`Client\-Aborted\*(C'\fR or \f(CW\*(C`X\-Died\*(C'\fR response headers, and not the
\&\f(CW\*(C`is_success\*(C'\fR method:
.PP
.Vb 2
\& if ($response\->header(\*(AqClient\-Aborted\*(Aq) eq \*(Aqdie\*(Aq) {
\& # handle error ...
.Ve
.PP
Note that this \*(L":content_file\*(R" option isn't supported under older
versions of \s-1LWP,\s0 so you should consider adding \f(CW\*(C`use LWP 5.66;\*(C'\fR to check
the \s-1LWP\s0 version, if you think your program might run on systems with
older versions.
.PP
If you need to be compatible with older \s-1LWP\s0 versions, then use
this syntax, which does the same thing:
.PP
.Vb 2
\& use HTTP::Request::Common;
\& $response = $ua\->request( GET($url), $filespec );
.Ve
.SH "SEE ALSO"
.IX Header "SEE ALSO"
Remember, this article is just the most rudimentary introduction to
\&\s-1LWP\s0 \*(-- to learn more about \s-1LWP\s0 and LWP-related tasks, you really
must read from the following:
.IP "\(bu" 4
LWP::Simple \*(-- simple functions for getting/heading/mirroring URLs
.IP "\(bu" 4
\&\s-1LWP\s0 \*(-- overview of the libwww-perl modules
.IP "\(bu" 4
LWP::UserAgent \*(-- the class for objects that represent \*(L"virtual browsers\*(R"
.IP "\(bu" 4
HTTP::Response \*(-- the class for objects that represent the response to
a \s-1LWP\s0 response, as in \f(CW\*(C`$response = $browser\->get(...)\*(C'\fR
.IP "\(bu" 4
HTTP::Message and HTTP::Headers \*(-- classes that provide more methods
to HTTP::Response.
.IP "\(bu" 4
\&\s-1URI\s0 \*(-- class for objects that represent absolute or relative URLs
.IP "\(bu" 4
URI::Escape \*(-- functions for URL-escaping and URL-unescaping strings
(like turning \*(L"this & that\*(R" to and from \*(L"this%20%26%20that\*(R").
.IP "\(bu" 4
HTML::Entities \*(-- functions for HTML-escaping and HTML-unescaping strings
(like turning \*(L"C. & E. Bronte\*:\*(R" to and from \*(L"C. & E. Brontë\*(R")
.IP "\(bu" 4
HTML::TokeParser and HTML::TreeBuilder \*(-- classes for parsing \s-1HTML\s0
.IP "\(bu" 4
HTML::LinkExtor \*(-- class for finding links in \s-1HTML\s0 documents
.IP "\(bu" 4
The book \fIPerl & \s-1LWP\s0\fR by Sean M. Burke. O'Reilly & Associates,
2002. \s-1ISBN: 0\-596\-00178\-9, \s0<http://oreilly.com/catalog/perllwp/>. The
whole book is also available free online:
<http://lwp.interglacial.com>.
.SH "COPYRIGHT"
.IX Header "COPYRIGHT"
Copyright 2002, Sean M. Burke. You can redistribute this document and/or
modify it, but only under the same terms as Perl itself.
.SH "AUTHOR"
.IX Header "AUTHOR"
Sean M. Burke \f(CW\*(C`sburke@cpan.org\*(C'\fR