Phil Windley blogs a talk by Mike Kruckenberg:
Mike specifically concerned with translating documents for web and print (namely PDF). They created a document standard with a Schema and developed templates for XML authoring application to make creating the documents easy. They created an customized XML authoring environment from an off the shelf tool that was essentially the destination for any conversion process. They also provided an online tool for people who didn’t have access to the authoring tool.
Existing HTML documents were cleaned up with Tidy and then a homegrown tool translated the cleaned-up HTML to XML. Once the XML was valid, the XML document was put into the database. For MS WOrd document, they tried a bunch of things, wvWare, saving as HTML, saving as RTF, and third party stand alone tools. They’re looking forward to WordML. PowerPoint is a big tool for faculty, so it had to be easy to convert to an XML document. For PowerPoint, they have a service which create an XML document from the text and save JPEGs out and wraps everything up in XML.
The transform is done using the libxml2 and libxslt libraries fro Gnome because the have good performance and command line and Perl interfaces. xmllint validates XML against a DTD. xsltproc renders XML as HTML.
Just rendering HTML isn’t the goal however. The goal is to render HTML and PDF for print. Mike and his team used FOP and XSL:FO to create PDF.