SAX

SAX stands for Simple API for XML, and was originally a Java API for reading XML. (Full details at http://saxproject.org). SAX implementations exist for most common modern computer languages.

FoX includes a SAX implementation, which translates most of the Java API into Fortran, and makes it accessible to Fortran programs, enabling them to read in XML documents in a fashion as close and familiar as possible to other languages.

SAX is a stream-based, event callback API. Conceptually, running a SAX parser over a document results in the parser generating events as it encounters different XML components, and sends the events to the main program, which can read them and take suitable action.

Events

Events are generated when the parser encounters, for example, an element opening tag, or some text, and most events carry some data with them - the name of the tag, or the contents of the text.

The full list of events is quite extensive, and may be seen below. For most purposes, though, it is unlikely that most users will need more than the 5 most common events, documented here.

startDocument - generated when the parser starts reading the document. No accompanying data.
endDocument - generated when the parser reaches the end of the document. No accompanying data.
startElement - generated by an element opening tag. Accompanied by tag name, namespace information, and a list of attributes
endElement - generated by an element closing tag. Accompanied by tag name, and namespace information.
characters - generated by text between tags. Accompanied by contents of text.

Given these events and accompanying information, a program can extract data from an XML document.

Invoking the parser.

Any program using the FoX SAX parser must a) use the FoX module, and b) declare a derived type variable to hold the parser, like so:

   use FoX_sax
   type(xml_t) :: xp

The FoX SAX parser then works by requiring the programmer to write a module containing subroutines to receive any of the events they are interested in, and passing these subroutines to the parser.

Firstly, the parser must be initialized, by passing it XML data. This can be done either by giving a filename, which the parser will manipulate, or by passing a string containing an XML document. Thus:

  call open_xml_file(xp, "input.xml", iostat)

The iostat variable will report back any errors in opening the file.

Alternatively,

  call open_xml_string(xp, XMLstring)

where XMLstring is a character variable.

To now run the parser over the file, you simply do:

 call parse(xp, list_of_event_handlers)

And once you're finished, you can close the file, and clean up the parser, with:

 call close_xml_t(xp)

Options to parser

It is unlikely that most users will need to operate any of these options, but the following are available for use; all are optional boolean arguments to parse.

namespaces
Does namespace processing occur? Default is .true., and if on, then any non-namespace-well-formed documents will be rejected, and namespace URI resolution will be performed according to the version of XML in question. If off, then documents will be processed without regard for namespace well-formedness, and no namespace URI resolution will be performed.
namespace_prefixes Are xmlns attributes reported through the SAX parser? Default is .false.; all such attributes are removed by the parser, and transparent namespace URI resolution is performed. If on, then such attributes will be reported, and treated according to the value of xmlns-uris below. (If namespaces is false, this flag has no effect)
validate Should validation be performed? Default is .false., no validation checks are made, and the influence of the DTD on the XML Infoset is ignored. (Ill-formed DTD's will still cause fatal errors, of course.) If .true., then validation will be performed, and the Infoset modified accordingly.
xmlns_uris Should xmlns attributes have a namespace of http://www.w3.org/2000/xmlns/? Default is .false.. If such attributes are reported, they have no namespace. If .true. then they are supplied with the appropriate namespace. (if namespaces or namespace-prefixes are .false., then this flag has no effect.)

Receiving events

To receive events, you must construct a module containing event handling subroutines. These are subroutines of a prescribed form - the input & output is predetermined by the requirements of the SAX interface, but the body of the subroutine is up to you.

The required forms are shown in the API documentation below, but here are some simple examples.

To receive notification of character events, you must write a subroutine which takes as input one string, which will contain the characters received. So:

module event_handling
  use FoX_sax
contains

  subroutine characters_handler(chars)
    character(len=*), intent(in) :: chars

    print*, chars
  end subroutine
end module

That does very little - it simply prints out the data it receives. However, since the subroutine is in a module, you can save the data to a module variable, and manipulate it elsewhere; alternatively you can choose to call other subroutines based on the input.

So, a complete program which reads in all the text from an XML document looks like this:

module event_handling
  use FoX_sax
contains

  subroutine characters_handler(chars)
    character(len=*), intent(in) :: chars

    print*, chars
  end subroutine
end module

program XMLreader
  use FoX_sax
  use event_handling
  type(xml_t) :: xp
  call open_xml_file(xp, 'input.xml')
  call parse(xp, characters_handler=characters_handler)
  call close_xml_t(xp)
end program

Attribute dictionaries.

The other likely most common event is the startElement event. Handling this involves writing a subroutine which takes as input three strings (which are the local name, namespace URI, and fully qualified name of the tag) and a dictionary of attributes.

An attribute dictionary is essentially a set of key:value pairs - where the key is the attributes name, and the value is its value. (When considering namespaces, each attribute also has a URI and localName.)

Full details of all the dictionary-manipulation routines are given in AttributeDictionaries, but here we shall show the most common.

getLength(dictionary) - returns the number of entries in the dictionary (the number of attributes declared)
hasKey(dictionary, qName) (where qName is a string) returns .true. or .false. depending on whether an attribute named qName is present.
hasKey(dictionary, URI, localname) (where URI and localname are strings) returns .true. or .false. depending on whether an attribute with the appropriate URI and localname is present.
getQName(dictionary, i) (where i is an integer) returns a string containing the key of the ith dictionary entry (ie, the name of the ith attribute.
getValue(dictionary, i) (where i is an integer) returns a string containing the value of the ith dictionary entry (ie the value of the ith attribute.
getValue(dictionary, URI, localname) (where URI and localname are strings) returns a string containing the value of the attribute with the appropriate URI and localname (if it is present)

So, a simple subroutine to receive a startElement event would look like:

module event_handling

contains

 subroutine startElement_handler(URI, localname, name,attributes)
   character(len=*), intent(in)   :: URI  
   character(len=*), intent(in)   :: localname
   character(len=*), intent(in)   :: name 
   type(dictionary_t), intent(in) :: attributes

   integer :: i

   print*, name

   do i = 1, getLength(attributes)
      print*, getQName(attributes, i), '=', getValue(attributes, i)
   enddo

  end subroutine startElement_handler
end module

program XMLreader
 use FoX_sax
 use event_handling
 type(xml_t) :: xp
 call open_xml_file(xp, 'input.xml')
 call parse(xp, startElement_handler=startElement_handler)
 call close_xml_t(xp)
end program

Again, this does nothing but print out the name of the element, and the names and values of all of its attributes. However, by using module variables, or calling other subroutines, the data could be manipulated further.

Error handling

The SAX parser detects all XML well-formedness errors (and optionally validation errors). By default, when it encounters an error, it will simply halt the program with a suitable error message. However, it is possible to pass in an error handling subroutine if some other behaviour is desired - for example it may be nice to report the error to the user, finish parsing, and carry on with some other task.

In any case, once an error is encountered, the parser will finish. There is no way to continue reading past an error. (This means that all errors are treated as fatal errors, in the terminology of the XML standard).

An error handling subroutine works in the same way as any other event handler, with the event data being an error message. Thus, you could write:

subroutine fatalError_handler(msg)
  character(len=*), intent(in) :: msg

  print*, "The SAX parser encountered an error:"
  print*, msg
  print*, "Never mind, carrying on with the rest of the calcaulation."
end subroutine

Stopping the parser.

The parser can be stopped at any time. Simply do (from within one of the callback functions).

call stop_parser(xp)

(where xp is the XML parser object). The current callback function will be completed, then the parser will be stopped, and control will return to the main program, the parser having finished.

Full API

Derived types

There is one derived type, xml_t. This is entirely opaque, and is used as a handle for the parser.

Subroutines

There are four subroutines:

open_xml_file type(xml_t), intent(inout) :: xp character(len=*), intent(in) :: string integer, intent(out), optional :: iostat

This opens a file. xp is initialized, and prepared for parsing. string must contain the name of the file to be opened. iostat reports on the success of opening the file. A value of 0 indicates success.

open_xml_string type(xml_t), intent(inout) :: xpi character(len=*), intent(in) :: string

This prepares to parse a string containing XML data. xp is initialized. string must contain the XML data.
close_xml_t type(xml_t), intent(inout) :: xp

This closes down the parser (and closes the file, if input was coming from a file.) xp is left uninitialized, ready to be used again if necessary.

parse type(xml_t), intent(inout) :: xp external :: list of event handlers logical, optional, intent(in) :: validate

This tells xp to start parsing its document.

(Advanced: See above for the list of options that the parse subroutine may take.)

The full list of event handlers is in the next section. To use them, the interface must be placed in a module, and the body of the subroutine filled in as desired; then it should be specified as an argument to parse as:
name_of_event_handler = name_of_user_written_subroutine
Thus a typical call to parse might look something like:

  call parse(xp, startElement_handler = mystartelement, endElement_handler = myendelement, characters_handler = mychars)

where mystartelement, myendelement, and mychars are all subroutines written by you according to the interfaces listed below.

Callbacks.

All of the callbacks specified by SAX 2 are implemented. Documentation of the SAX 2 interfaces is available in the JavaDoc at http://saxproject.org, but as the interfaces needed adjustment for Fortran, they are listed here.

For documentation on the meaning of the callbacks and of their arguments, please refer to the Java SAX documentation.

characters_handler subroutine characters_handler(chunk) character(len=*), intent(in) :: chunk end subroutine characters_handler

Triggered when some character data is read from between tags.

NB Note that all character data is reported, including whitespace. Thus you will probably get a lot of empty characters events in a typical XML document.

NB Note also that it is not required that a single chunk of character data all come as one event - it may come as multiple consecutive events. You should concatenate the results of subsequent character events before processing.

endDocument_handler subroutine endDocument_handler() end subroutine endDocument_handler

Triggered when the parser reaches the end of the document.

endElement_handler subroutine endElement_handler(namespaceURI, localName, name) character(len=*), intent(in) :: namespaceURI character(len=*), intent(in) :: localName character(len=*), intent(in) :: name end subroutine endElement_handler

Triggered by a closing tag.

endPrefixMapping_handler subroutine endPrefixMapping_handler(prefix) character(len=*), intent(in) :: prefix end subroutine endPrefixMapping_handler

Triggered when a namespace prefix mapping goes out of scope.

ignorableWhitespace subroutine ignorableWhitespace_handler(chars) character(len=*), intent(in) :: chars end subroutine ignorableWhitespace_handler

Triggered when whitespace is encountered within an element declared as having no PCDATA. (Only active in validating mode.)

processingInstruction_handler subroutine processingInstruction_handler(name, content) character(len=*), intent(in) :: name character(len=*), intent(in) :: content end subroutine processingInstruction_handler

Triggered by a Processing Instruction

skippedEntity_handler subroutine skippedEntity_handler(name) character(len=*), intent(in) :: name end subroutine skippedEntity_handler

Triggered when either an external entity, or an undeclared entity, is skipped.

startDocument_handler subroutine startDocument_handler() end subroutine startDocument_handler

Triggered when the parser starts reading the document.

startElement_handler subroutine startElement_handler(namespaceURI, localName, name, attributes) character(len=*), intent(in) :: namespaceUri character(len=*), intent(in) :: localName character(len=*), intent(in) :: name type(dictionary_t), intent(in) :: attributes end subroutine startElement_handler

Triggered when an opening tag is encountered. (see LINK for documentation on handling attribute dictionaries.

startPrefixMapping_handler subroutine startPrefixMapping_handler(namespaceURI, prefix) character(len=*), intent(in) :: namespaceURI character(len=*), intent(in) :: prefix end subroutine startPrefixMapping_handler

Triggered when a namespace prefix mapping start.

notationDecl_handler subroutine notationDecl_handler(name, publicId, systemId) character(len=*), intent(in) :: name character(len=*), intent(in) :: publicId character(len=*), intent(in) :: systemId end subroutine notationDecl_handler

Triggered when a NOTATION declaration is made in the DTD

unparsedEntityDecl_handler subroutine unparsedEntityDecl_handler(name, publicId, systemId, notation) character(len=*), intent(in) :: name character(len=*), intent(in) :: publicId character(len=*), intent(in) :: systemId character(len=*), intent(in) :: notation end subroutine unparsedEntityDecl_handler

Triggered when an unparsed entity is declared

error_handler subroutine error_handler(msg) character(len=*), intent(in) :: msg end subroutine error_handler

Triggered when a error is encountered in parsing. Parsing will continue after this event.

fatalError_handler subroutine fatalError_handler(msg) character(len=*), intent(in) :: msg end subroutine fatalError_handler

Triggered when a fatal error is encountered in parsing. Parsing will cease after this event.

warning_handler subroutine warning_handler(msg) character(len=*), intent(in) :: msg end subroutine warning_handler

Triggered when a parser warning is generated. Parsing will continue after this event.

attributeDecl_handler subroutine attributeDecl_handler(eName, aName, type, mode, value) character(len=*), intent(in) :: eName character(len=*), intent(in) :: aName character(len=*), intent(in) :: type character(len=*), intent(in) :: mode character(len=*), intent(in) :: value end subroutine attributeDecl_handler

Triggered when an attribute declaration is encountered in the DTD.

elementDecl_handler subroutine elementDecl_handler(name, model) character(len=*), intent(in) :: name character(len=*), intent(in) :: model end subroutine elementDecl_handler

Triggered when an element declaration is enountered in the DTD.

externalEntityDecl_handler subroutine externalEntityDecl_handler(name, publicId, systemId) character(len=*), intent(in) :: name character(len=*), intent(in) :: publicId character(len=*), intent(in) :: systemId end subroutine externalEntityDecl_handler

Triggered when a parsed external entity is declared in the DTD.

internalEntityDecl_handler subroutine internalEntityDecl_handler(name, value) character(len=*), intent(in) :: name character(len=*), intent(in) :: value end subroutine internalEntityDecl_handler

Triggered when an internal entity is declared in the DTD.

comment_handler subroutine comment_handler(comment) character(len=*), intent(in) :: comment end subroutine comment_handler

Triggered when a comment is encountered.

endCdata_handler subroutine endCdata_handler() end subroutine endCdata_handler

Triggered by the end of a CData section.

endDTD_handler subroutine endDTD_handler() end subroutine endDTD_handler

Triggered by the end of a DTD.

endEntity_handler subroutine endEntity_handler(name) character(len=*), intent(in) :: name end subroutine endEntity_handler

Triggered at the end of entity expansion.

startCdata_handler subroutine startCdata_handler() end subroutine startCdata_handler

Triggered by the start of a CData section.

startDTD_handler subroutine startDTD_handler(name, publicId, systemId) character(len=*), intent(in) :: name character(len=*), intent(in) :: publicId character(len=*), intent(in) :: systemId end subroutine startDTD_handler

Triggered by the start of a DTD section.

startEntity_handler subroutine startEntity_handler(name) character(len=*), intent(in) :: name end subroutine startEntity_handler

Triggered by the start of entity expansion.

Exceptions.

The FoX SAX implementation implements all of XML 1.0 and 1.1; all of XML Namespaces 1.0 and 1.1; xml:id and xml:base.

Although FoX tries very hard to work to the letter of the XML and SAX standards, it falls short in a few areas.

FoX will only process documents consisting of nothing but US-ASCII data. It will accept documents labelled with any single byte character set which is identical to US-ASCII in its lower 7 bits (for example, any of the ISO-8859 charsets, or UTF-8) but an error will be generated as soon as any character outside US-ASCII is encountered. (This includes non-ASCII characters present only be character entity reference)
As a corollary, UTF-16 documents of any endianness will also be rejected.

(It is impossible to implement IO of non-ASCII documents in a portable fashion using standard Fortran 95, and it is impossible to handle non-ASCII data internally using standard Fortran strings. A fully unicode-capable FoX version is under development, but requires Fortran 2003. Please enquire for further details if you're interested.)

FoX has no network capabilities. Therefore, when external entities are referenced, any entities not available on the local filesystem will not be accessed (specifically, any entities whose URI reference includes a scheme component, where that scheme is not file, will be skipped)

Beyond this, any aspects of the listed XML standards to which FoX fails to do justice to are bugs.

What of Java SAX 2 is not included in FoX?

The difference betweek Java & Fortran means that none of the SAX APIs can be copied directly. However, FoX offers data types, subroutines, and interfaces covering most of the facilities offered by SAX. Where it does not, this is mentioned here.

org.sax.xml:

Querying/setting of feature flags/property values for the XML parser. The effect of a subset of these may be accessed by options to the parse subroutine.
XML filters - Java SAX makes it possible to write filters to intercept the flow of events. FoX does not support this.
Entity resolution - SAX 2 exports an interface to the application for entity resolution, but FoX does not - all entities are resolved within the parser.
Locator - SAX 2 offers an interface to export information regarding object locations within the document, FoX does not.
XMLReader - FoX only offers the parse() method - no other methods really make sense in Fortran.
AttributeList/DocumentHandler/Parser - FoX only offers namespace aware attributes, not the pre-namespace SAX-1 versions.

org.sax.xml.ext:

EntityResolver2 - not implemented
Locator2 - not implemented

org.sax.xml.helpers:

None of these helper methods are implemented.