Development Seed Blog

Word2Web and XSLT_book Module Released

Better Integrating Drupal into Your Existing Editorial Workflow

Better Integrating Drupal into Your Existing Editorial Workflow

Serious content-producing organizations have evolved large-scale, specialized processes for their documents. Often they manage revisions and notes in Microsoft Word, then transfer content to PDF, and print with strict editorial guidelines. In contrast, traditional copy and paste web publishing interrupts the existing system. In the process of rethinking a current client’s web workflow, we took an interest in this problem and devised a new solution: two modules, word2web and xslt_book, which together combine the advantages of Word and the power of Drupal. Here is a screencast showing the two new modules.

The first challenge we encountered was the output of Word’s “Export to HTML...” function - which provides a loose approximation of valid HTML. Word’s output is fortunately valid XML, so we were able to clean proprietary tags from the markup using XSL transformations - XML stylesheets that selectively preserve certain elements and reorganize others. At this stage, we could enable word2web and, instead of copy and pasting into a field, users could upload a Word document to have clean HTML programmatically generated.

However, that was just the beginning. The client’s legacy site presented documents in a manner much like a Book in Drupal, and we initially considered migrating to that content type. However, the management of a book node is actually quite different than the management of this content - translations do not need to be managed by page, but instead revisions of the document should refer to the entire text. Thus we created the a display module for word2web: xslt_book. XSLT_Book uses XSL transformations again to present the simple HTML of word2web in a book-like format, with a nicely formatted table of contents and convenient paging navigation. In addition, xslt_book parses and reformats the footnotes preserved by word2web.

Finally, we implemented a feature that brings all of the Word document’s content into Drupal: support for embedded images. After uploading a new xslt_book-enabled node, the word2web module will search for any images referenced in the markup and provide a page of upload controls that direct the user to upload specific referenced images. Setup for this function is slightly more complex than the rest of the modules, and image support is an option in both: you’ll need CCK, ImageField, and a content type of name “image” with a ImageField named “imagefile” in order to use images.

These two modules not only enable large organizations to integrate Drupal into their workflow, but also create an opportunity for more progress with content management. The two are loosely coupled: modules can easily format input to be handled by the xslt_book module, and others can display the XML data differently - the possibilities are broad.

You can download word2web on Drupal.org, as well as xslt_book.

Word Formatting Outline:

  • Use the styled headers available in Word for sections: xslt_book works best with headers up to six levels. These are in the “styles” panel on Mac, and on the tool bar in Word 2003.
  • Use the “Save as HTML” option to save your document, and don’t select “filtered” on Windows or “Save only display information into HTML” on Mac
  • The exported data will have a folder called filename_files, which will include all images originally embedded in the .doc file

Patches always welcomed.

Comments
upload as zip

Look great. I just have one note:

It would be great to just let people to upload as tgz/zip, identify the doc file and just automatically attach the images. PEAR/Tar.php can easily uncompress a tgz on the server side. That would make uploading with images much more seamless :)

Hey Gábor - that would be a

Hey Gábor - that would be a great way to take out a kink in the process (especially when these documents have tens of images). I think that in developing it, we initially thought that it might be too much to expect all content editors to be able to zip up their documents, but I think a good portion can. I'll try it out - we'll definitely get an updated public version out by the end of this project!

Post new comment
The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <img> <p> <li> <ul> <ol>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options