Introduction
It is frequently useful to be able to batch process many documents at a time.
This example describes a useful pattern to perform batch processing. It can be
adapted to do more sophisticated processing such as HTML to XHTML batch conversion of a website.
Specification
This example batch process will recursively find all system.xml files in a given directory. For each file found we will
count the number of elements it contains. The results will be presented in an HTML table - this assumes that this process
will be started by a web-browser request - it could equally be initiated through the command-line transport, email or any other
available transport.
Overview
Below is a dpml batch process. It contains detailed comments which will allow you to adapt it to your needs.
Here we'll summarize the 3 stages and suggest ways that this pattern can be modified.
Prepare resource list
Any batch process requires a list of resources to process. In the example we use the fls accessor to provide
a document which lists the system.xml files below a root directory. You can change the root and filter to suit your system. Alternatively
you could supply a hand crafted source document with the URI's of the files you want to process as elements. Note since NetKernel provides
a URI resolver infrastructure you can source resources from anywhere not just the local filesystem so, for example, you could use this pattern
as the basis for a web-bot...
Iterate over resource list
The example process iterates over the resource list generated by fls. Each URI element in the source document is used as the target resource
in the inner batch process. In our example we perform an xquery operation to count the elements. We've used an XML process
by way of example, you could do anything at all here, your resource could even be non-XML. A useful example is to use the
XHTMLTidy to batch convert html files to valid XHTML files.
In our example we are not changing the target resource. You could as easily write the results of your process back to the
target resource URI. Though take care since this is permanent and unrecoverable. It's usually a good idea to first execute a test process which
simply logs the result to make sure your process is working as your expected. Once everything works you can add the URI writeback.
Show results
In our example we accummulate the results of each process in a variable. The result could be used for more extensive reporting
including any exceptions that might have occured. To keep our example simple we've only provided basic exception processing.
Finally the results are styled and presented as an HTML table. You could write them to a file or use them as the start of another process...
Executing this process
To execute this dpml process we need to create a host module.
- Use the new module wizard to create and install a new module - choose the default settings
ensuring your module supports dpml. Make sure you choose to import the module into the Front end fulcrum - this will make it available on
localhost port 8080 by default. Your new module will be located in
<install>/modules/your_module_name/.
-
The example process uses the fls accessor supplied from the ext_sys module and the xquery accessor
supplied from ext_xquery. These modules must be imported into your module by
adding the following two imports into the mapping section of your module.xml definition located in the root directory of your module.
<mapping>
...Existing Imports...
<import>
<uri>urn:org:ten60:netkernel:ext:sys</uri>
</import>
<import>
<uri>urn:org:ten60:netkernel:ext:xquery</uri>
</import>
</mapping>
. You must now do a cold restart to pick up the module changes.
-
Finally copy the batch process listing below to a file batch.idoc in the resources/ directory of your module. You should edit the
fls instruction to point to a different root directory and, if you wish, change the regex filter to match different file names.
-
You can start the process by requesting the URI with a web-browser http://localhost:8080/batch.idoc.
Deadlock Detector Exception
Searching the filesystem can take a long time, depending on how deep your filesystem tree. You may encounter a NetKernel Deadlock Detector exception
if the fls search takes a very long time. This is thrown because the Kernel monitors all scheduled request and if no response
is received after a set interval the Kernel kills the request and issues an Error - in a web-application this is very valuable but
can be unhelpful for batch processing! You can increase the deadlock
detection period here.
<idoc> <seq>
<comment>
******************************************
A Batch Processing Pattern.
This example finds all system.xml documents and
counts the number of elements they contain. You
can adapt it to suit your needs.
******************************************
</comment>
<comment>
***********
Use File LS accessor to list files.
o Modify the root for your filesystem
o Modify the filter regex to target other XML files
The result is a tree of matching resources each with a
uri element containing the URI of the resource. We'll
use this as the source for the batch process.
***********
</comment>
<instr>
<type>fls</type>
<operator>
<fls>
<root>file:///home/pjr/dev/</root>
<filter>.*system.xml</filter>
<recursive />
<uri />
</fls>
</operator>
<target>var:fls</target>
</instr>
<comment>
*************
Prepare a results document
**************
</comment>
<instr>
<type>copy</type>
<operand>
<results />
</operand>
<target>var:results</target>
</instr>
<comment>
***********
Start batch processing loop
***********
</comment>
<while>
<comment>
***********
Loop condition - do processing sequence while there's
a file URI left to process
***********
</comment>
<cond>
<instr>
<type>xpatheval</type>
<operand>var:fls</operand>
<operator>
<xpath>/descendant::uri[1]</xpath>
</operator>
<target>this:cond</target>
</instr>
</cond>
<seq>
<comment>
***********
Copy the URI fragment to a variable
and log it to show progress
***********
</comment>
<instr>
<type>copy</type>
<operand>var:fls#xpointer(/descendant::uri[1])</operand>
<target>var:uri</target>
</instr>
<instr>
<type>log</type>
<operand>var:uri</operand>
</instr>
<comment>
***********
Main Process - We could do anything we liked here
including executing another dpml process or modifying
the target file in some way. Here we simply count
the elements in the file.
***********
</comment>
<instr>
<type>xquery</type>
<operator>
<xquery>
(:
*********
Declare the external URI variable and
extract the file URI to $file variable
*********
:)
declare variable $uri as node() external;
declare variable $file {$uri/uri/text()};
(:
*******
Return a fragment:
Quote back the URI fragment and add a
count element with the number of
elements contained in the target document
*******
:)
<result>
{$uri}
<count>
{count(doc($file)/descendant::*)}
</count>
</result>
</xquery>
</operator>
<uri>var:uri</uri>
<target>var:result</target>
</instr>
<comment>
***********
Append the xquery result to our cumulative
var:results document
***********
</comment>
<instr>
<type>stm</type>
<operand>var:results</operand>
<operator>
<stm:group xmlns:stm="http://1060.org/stm">
<stm:append xpath="/results">
<stm:param xpath="/result:sequence/result:element/result" />
</stm:append>
</stm:group>
</operator>
<param>var:result</param>
<target>var:results</target>
</instr>
<comment>
**********
Exception: Catch any processing exceptions...
**********
</comment>
<exception>
<comment>
**********
Since this is a dum example we'll simply log the exception,
you can add more extensive error handling for
your process if required...
**********
</comment>
<instr>
<type>log</type>
<operand>this:exception</operand>
</instr>
</exception>
<comment>
***********
Remove the first URI from the file listing before starting
next iteration of the loop. If this isn't done we'll have an
infinite loop!!!
***********
</comment>
<instr>
<type>stm</type>
<operand>var:fls</operand>
<operator>
<stm:group xmlns:stm="http://1060.org/stm">
<stm:delete xpath="/descendant::uri[1]" />
</stm:group>
</operator>
<target>var:fls</target>
</instr>
</seq>
</while>
<comment>
***********
All done. Style the results for presentation.
***********
</comment>
<instr>
<type>xslt</type>
<operand>var:results</operand>
<operator>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html" />
<xsl:template match="/results">
<html>
<body>
<h1>Batch Results</h1>
<table>
<tr bgcolor="#aaaaaa">
<td>file</td>
<td>elements</td>
</tr>
<xsl:for-each select="result">
<tr>
<td>
<xsl:value-of select="uri" />
</td>
<td>
<xsl:value-of select="count" />
</td>
</tr>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
</operator>
<target>this:response</target>
</instr>
</seq>
</idoc>