[HOME]

Path : /proc/self/root/usr/local/share/man/man3/
Upload :
Current File : //proc/self/root/usr/local/share/man/man3/XML::SAX::Intro.3pm

.\" Automatically generated by Pod::Man 2.27 (Pod::Simple 3.28)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings.  \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote.  \*(C+ will
.\" give a nicer C++.  Capital omega is used to do unbreakable dashes and
.\" therefore won't be available.  \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
.    ds -- \(*W-
.    ds PI pi
.    if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
.    if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\"  diablo 12 pitch
.    ds L" ""
.    ds R" ""
.    ds C` ""
.    ds C' ""
'br\}
.el\{\
.    ds -- \|\(em\|
.    ds PI \(*p
.    ds L" ``
.    ds R" ''
.    ds C`
.    ds C'
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\"
.\" If the F register is turned on, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD.  Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.\"
.\" Avoid warning from groff about undefined register 'F'.
.de IX
..
.nr rF 0
.if \n(.g .if rF .nr rF 1
.if (\n(rF:(\n(.g==0)) \{
.    if \nF \{
.        de IX
.        tm Index:\\$1\t\\n%\t"\\$2"
..
.        if !\nF==2 \{
.            nr % 0
.            nr F 2
.        \}
.    \}
.\}
.rr rF
.\" ========================================================================
.\"
.IX Title "XML::SAX::Intro 3"
.TH XML::SAX::Intro 3 "2019-06-13" "perl v5.16.3" "User Contributed Perl Documentation"
.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH "NAME"
XML::SAX::Intro \- An Introduction to SAX Parsing with Perl
.SH "Introduction"
.IX Header "Introduction"
\&\s-1XML::SAX\s0 is a new way to work with \s-1XML\s0 Parsers in Perl. In this article
we'll discuss why you should be using \s-1SAX,\s0 why you should be using
\&\s-1XML::SAX,\s0 and we'll see some of the finer implementation details. The
text below assumes some familiarity with callback, or push based
parsing, but if you are unfamiliar with these techniques then a good
place to start is Kip Hampton's excellent series of articles on \s-1XML\s0.com.
.SH "Replacing XML::Parser"
.IX Header "Replacing XML::Parser"
The de-facto way of parsing \s-1XML\s0 under perl is to use Larry Wall and
Clark Cooper's XML::Parser. This module is a Perl and \s-1XS\s0 wrapper around
the expat \s-1XML\s0 parser library by James Clark. It has been a hugely
successful project, but suffers from a couple of rather major flaws.
Firstly it is a proprietary \s-1API,\s0 designed before the \s-1SAX API\s0 was
conceived, which means that it is not easily replaceable by other
streaming parsers. Secondly it's callbacks are subrefs. This doesn't
sound like much of an issue, but unfortunately leads to code like:
.PP
.Vb 6
\&  sub handle_start {
\&    my ($e, $el, %attrs) = @_;
\&    if ($el eq \*(Aqfoo\*(Aq) {
\&      $e\->{inside_foo}++; # BAD! $e is an XML::Parser::Expat object.
\&    }
\&  }
.Ve
.PP
As you can see, we're using the \f(CW$e\fR object to hold our state
information, which is a bad idea because we don't own that object \- we
didn't create it. It's an internal object of XML::Parser, that happens
to be a hashref. We could all too easily overwrite XML::Parser internal
state variables by using this, or Clark could change it to an array ref
(not that he would, because it would break so much code, but he could).
.PP
The only way currently with XML::Parser to safely maintain state is to
use a closure:
.PP
.Vb 2
\&  my $state = MyState\->new();
\&  $parser\->setHandlers(Start => sub { handle_start($state, @_) });
.Ve
.PP
This closure traps the \f(CW$state\fR variable, which now gets passed as the
first parameter to your callback. Unfortunately very few people use
this technique, as it is not documented in the XML::Parser \s-1POD\s0 files.
.PP
Another reason you might not want to use XML::Parser is because you
need some feature that it doesn't provide (such as validation), or you
might need to use a library that doesn't use expat, due to it not being
installed on your system, or due to having a restrictive \s-1ISP.\s0 Using \s-1SAX\s0
allows you to work around these restrictions.
.SH "Introducing SAX"
.IX Header "Introducing SAX"
\&\s-1SAX\s0 stands for the Simple \s-1API\s0 for \s-1XML.\s0 And simple it really is.
Constructing a \s-1SAX\s0 parser and passing events to handlers is done as
simply as:
.PP
.Vb 2
\&  use XML::SAX;
\&  use MySAXHandler;
\&  
\&  my $parser = XML::SAX::ParserFactory\->parser(
\&        Handler => MySAXHandler\->new
\&  );
\&  
\&  $parser\->parse_uri("foo.xml");
.Ve
.PP
The important concept to grasp here is that \s-1SAX\s0 uses a factory class
called XML::SAX::ParserFactory to create a new parser instance. The
reason for this is so that you can support other underlying
parser implementations for different feature sets. This is one thing
that XML::Parser has always sorely lacked.
.PP
In the code above we see the parse_uri method used, but we could
have equally well
called parse_file, parse_string, or \fIparse()\fR. Please see XML::SAX::Base
for what these methods take as parameters, but don't be fooled into
believing parse_file takes a filename. No, it takes a file handle, a
glob, or a subclass of IO::Handle. Beware.
.PP
\&\s-1SAX\s0 works very similarly to XML::Parser's default callback method,
except it has one major difference: rather than setting individual
callbacks, you create a new class in which to receive the callbacks.
Each callback is called as a method call on an instance of that handler
class. An example will best demonstrate this:
.PP
.Vb 2
\&  package MySAXHandler;
\&  use base qw(XML::SAX::Base);
\&  
\&  sub start_document {
\&    my ($self, $doc) = @_;
\&    # process document start event
\&  }
\&  
\&  sub start_element {
\&    my ($self, $el) = @_;
\&    # process element start event
\&  }
.Ve
.PP
Now, when we instantiate this as above, and parse some \s-1XML\s0 with this as
the handler, the methods start_document and start_element will be
called as method calls, so this would be the equivalent of directly
calling:
.PP
.Vb 1
\&  $object\->start_element($el);
.Ve
.PP
Notice how this is different to XML::Parser's calling style, which
calls:
.PP
.Vb 1
\&  start_element($e, $name, %attribs);
.Ve
.PP
It's the difference between function calling and method calling which
allows you to subclass \s-1SAX\s0 handlers which contributes to \s-1SAX\s0 being a
powerful solution.
.PP
As you can see, unlike XML::Parser, we have to define a new package in
which to do our processing (there are hacks you can do to make this
uneccessary, but I'll leave figuring those out to the experts). The
biggest benefit of this is that you maintain your own state variable
($self in the above example) thus freeing you of the concerns listed
above. It is also an improvement in maintainability \- you can place the
code in a separate file if you wish to, and your callback methods are
always called the same thing, rather than having to choose a suitable
name for them as you had to with XML::Parser. This is an obvious win.
.PP
\&\s-1SAX\s0 parsers are also very flexible in how you pass a handler to them.
You can use a constructor parameter as we saw above, or we can pass the
handler directly in the call to one of the parse methods:
.PP
.Vb 4
\&  $parser\->parse(Handler => $handler, 
\&                 Source => { SystemId => "foo.xml" });
\&  # or...
\&  $parser\->parse_file($fh, Handler => $handler);
.Ve
.PP
This flexibility allows for one parser to be used in many different
scenarios throughout your script (though one shouldn't feel pressure to
use this method, as parser construction is generally not a time
consuming process).
.SH "Callback Parameters"
.IX Header "Callback Parameters"
The only other thing you need to know to understand basic \s-1SAX\s0 is the
structure of the parameters passed to each of the callbacks. In
XML::Parser, all parameters are passed as multiple options to the
callbacks, so for example the Start callback would be called as
my_start($e, \f(CW$name\fR, \f(CW%attributes\fR), and the \s-1PI\s0 callback would be called
as my_processing_instruction($e, \f(CW$target\fR, \f(CW$data\fR). In \s-1SAX,\s0 every
callback is passed a hash reference, containing entries that define our
\&\*(L"node\*(R". The key callbacks and the structures they receive are:
.SS "start_element"
.IX Subsection "start_element"
The start_element handler is called whenever a parser sees an opening
tag. It is passed an element structure consisting of:
.IP "LocalName" 4
.IX Item "LocalName"
The name of the element minus any namespace prefix it may
have come with in the document.
.IP "NamespaceURI" 4
.IX Item "NamespaceURI"
The \s-1URI\s0 of the namespace associated with this element,
or the empty string for none.
.IP "Attributes" 4
.IX Item "Attributes"
A set of attributes as described below.
.IP "Name" 4
.IX Item "Name"
The name of the element as it was seen in the document (i.e.
including any prefix associated with it)
.IP "Prefix" 4
.IX Item "Prefix"
The prefix used to qualify this element's namespace, or the 
empty string if none.
.PP
The \fBAttributes\fR are a hash reference, keyed by what we have called
\&\*(L"James Clark\*(R" notation. This means that the attribute name has been
expanded to include any associated namespace \s-1URI,\s0 and put together as
{ns}name, where \*(L"ns\*(R" is the expanded namespace \s-1URI\s0 of the attribute if
and only if the attribute had a prefix, and \*(L"name\*(R" is the LocalName of
the attribute.
.PP
The value of each entry in the attributes hash is another hash
structure consisting of:
.IP "LocalName" 4
.IX Item "LocalName"
The name of the attribute minus any namespace prefix it may have
come with in the document.
.IP "NamespaceURI" 4
.IX Item "NamespaceURI"
The \s-1URI\s0 of the namespace associated with this attribute. If the 
attribute had no prefix, then this consists of just the empty string.
.IP "Name" 4
.IX Item "Name"
The attribute's name as it appeared in the document, including any 
namespace prefix.
.IP "Prefix" 4
.IX Item "Prefix"
The prefix used to qualify this attribute's namepace, or the 
empty string if none.
.IP "Value" 4
.IX Item "Value"
The value of the attribute.
.PP
So a full example, as output by Data::Dumper might be:
.PP
.Vb 1
\&  ....
.Ve
.SS "end_element"
.IX Subsection "end_element"
The end_element handler is called either when a parser sees a closing
tag, or after start_element has been called for an empty element (do
note however that a parser may if it is so inclined call characters
with an empty string when it sees an empty element. There is no simple
way in \s-1SAX\s0 to determine if the parser in fact saw an empty element, a
start and end element with no content..
.PP
The end_element handler receives exactly the same structure as
start_element, minus the Attributes entry. One must note though that it
should not be a reference to the same data as start_element receives,
so you may change the values in start_element but this will not affect
the values later seen by end_element.
.SS "characters"
.IX Subsection "characters"
The characters callback may be called in serveral circumstances. The
most obvious one is when seeing ordinary character data in the markup.
But it is also called for text in a \s-1CDATA\s0 section, and is also called
in other situations. A \s-1SAX\s0 parser has to make no guarantees whatsoever
about how many times it may call characters for a stretch of text in an
\&\s-1XML\s0 document \- it may call once, or it may call once for every
character in the text. In order to work around this it is often
important for the \s-1SAX\s0 developer to use a bundling technique, where text
is gathered up and processed in one of the other callbacks. This is not
always necessary, but it is a worthwhile technique to learn, which we
will cover in XML::SAX::Advanced (when I get around to writing it).
.PP
The characters handler is called with a very simple structure \- a hash
reference consisting of just one entry:
.IP "Data" 4
.IX Item "Data"
The text data that was received.
.SS "comment"
.IX Subsection "comment"
The comment callback is called for comment text. Unlike with
\&\f(CW\*(C`characters()\*(C'\fR, the comment callback *must* be invoked just once for an
entire comment string. It receives a single simple structure \- a hash
reference containing just one entry:
.IP "Data" 4
.IX Item "Data"
The text of the comment.
.SS "processing_instruction"
.IX Subsection "processing_instruction"
The processing instruction handler is called for all processing
instructions in the document. Note that these processing instructions
may appear before the document root element, or after it, or anywhere
where text and elements would normally appear within the document,
according to the \s-1XML\s0 specification.
.PP
The handler is passed a structure containing just two entries:
.IP "Target" 4
.IX Item "Target"
The target of the processing instrcution
.IP "Data" 4
.IX Item "Data"
The text data in the processing instruction. Can be an empty
string for a processing instruction that has no data element. 
For example <?wiggle?> is a perfectly valid processing instruction.
.SH "Tip of the iceberg"
.IX Header "Tip of the iceberg"
What we have discussed above is really the tip of the \s-1SAX\s0 iceberg. And
so far it looks like there's not much of interest to \s-1SAX\s0 beyond what we
have seen with XML::Parser. But it does go much further than that, I
promise.
.PP
People who hate Object Oriented code for the sake of it may be thinking
here that creating a new package just to parse something is a waste
when they've been parsing things just fine up to now using procedural
code. But there's reason to all this madness. And that reason is \s-1SAX\s0
Filters.
.PP
As you saw right at the very start, to let the parser know about our
class, we pass it an instance of our class as the Handler to the
parser. But now imagine what would happen if our class could also take
a Handler option, and simply do some processing and pass on our data
further down the line? That in a nutshell is how \s-1SAX\s0 filters work. It's
Unix pipes for the 21st century!
.PP
There are two downsides to this. Number 1 \- writing \s-1SAX\s0 filters can be
tricky. If you look into the future and read the advanced tutorial I'm
writing, you'll see that Handler can come in several shapes and sizes.
So making sure your filter does the right thing can be tricky.
Secondly, constructing complex filter chains can be difficult, and
simple thinking tells us that we only get one pass at our document,
when often we'll need more than that.
.PP
Luckily though, those downsides have been fixed by the release of two
very cool modules. What's even better is that I didn't write either of
them!
.PP
The first module is XML::SAX::Base. This is a \s-1VITAL SAX\s0 module that
acts as a base class for all \s-1SAX\s0 parsers and filters. It provides an
abstraction away from calling the handler methods, that makes sure your
filter or parser does the right thing, and it does it \s-1FAST.\s0 So, if you
ever need to write a \s-1SAX\s0 filter, which if you're processing \s-1XML \-\s0> \s-1XML,\s0
or \s-1XML \-\s0> \s-1HTML,\s0 then you probably do, then you need to be writing it as
a subclass of XML::SAX::Base. Really \- this is advice not to ignore
lightly. I will not go into the details of writing a \s-1SAX\s0 filter here.
Kip Hampton, the author of XML::SAX::Base has covered this nicely in
his article on \s-1XML\s0.com here <\s-1URI\s0>.
.PP
To construct \s-1SAX\s0 pipelines, Barrie Slaymaker, a long time Perl hacker
whose modules you will probably have heard of or used, wrote a very
clever module called XML::SAX::Machines. This combines some really
clever \s-1SAX\s0 filter-type modules, with a construction toolkit for filters
that makes building pipelines easy. But before we see how it makes
things easy, first lets see how tricky it looks to build complex \s-1SAX\s0
filter pipelines.
.PP
.Vb 4
\&  use XML::SAX::ParserFactory;
\&  use XML::Filter::Filter1;
\&  use XML::Filter::Filter2;
\&  use XML::SAX::Writer;
\&  
\&  my $output_string;
\&  my $writer = XML::SAX::Writer\->new(Output => \e$output_string);
\&  my $filter2 = XML::SAX::Filter2\->new(Handler => $writer);
\&  my $filter1 = XML::SAX::Filter1\->new(Handler => $filter2);
\&  my $parser = XML::SAX::ParserFactory\->parser(Handler => $filter1);
\&  
\&  $parser\->parse_uri("foo.xml");
.Ve
.PP
This is a lot easier with XML::SAX::Machines:
.PP
.Vb 1
\&  use XML::SAX::Machines qw(Pipeline);
\&  
\&  my $output_string;
\&  my $parser = Pipeline(
\&        XML::SAX::Filter1 => XML::SAX::Filter2 => \e$output_string
\&        );
\&  
\&  $parser\->parse_uri("foo.xml");
.Ve
.PP
One of the main benefits of XML::SAX::Machines is that the pipelines
are constructed in natural order, rather than the reverse order we saw
with manual pipeline construction. XML::SAX::Machines takes care of all
the internals of pipe construction, providing you at the end with just
a parser you can use (and you can re-use the same parser as many times
as you need to).
.PP
Just a final tip. If you ever get stuck and are confused about what is
being passed from one \s-1SAX\s0 filter or parser to the next, then
Devel::TraceSAX will come to your rescue. This perl debugger plugin
will allow you to dump the \s-1SAX\s0 stream of events as it goes by. Usage is
really very simple just call your perl script that uses \s-1SAX\s0 as follows:
.PP
.Vb 1
\&  $ perl \-d:TraceSAX <scriptname>
.Ve
.PP
And preferably pipe the output to a pager of some sort, such as more or
less. The output is extremely verbose, but should help clear some
issues up.
.SH "AUTHOR"
.IX Header "AUTHOR"
Matt Sergeant, matt@sergeant.org
.PP
\&\f(CW$Id\fR$