Recent Changes · Search:

Support the Project

Wikipublisher

PmWiki

edit SideBar

 

Web and print exist as two solitudes; printed web pages often disappoint, while converting print documents into good Web pages is hard work. A wiki makes it easy for authors to create rich web content, but is little help if readers wish to print the result. Wikipublisher lets readers turn wiki pages or page collections into print with a couple of clicks, with a quality better than 99.99% of word processing documents. This dramatically lowers the time and cost of creating online and print versions of the same content, with no loss of quality in either medium. Using Wikipublisher, a reader can turn wiki content into a letter, an article, a report, or a book.

Fast, cheap, or good. Which 2 do you want? — Project Management 101
  PDF    

PDF settings (show)

Introduction

Is a word processor still necessary if you have a wiki?

Using a wiki is a great convenience for collaborative authoring, making it easy for authors to work together wherever they may be, as long as they have access to a web browser. This is fine if the result you want is web pages, but what if readers wish to print these? Most people reading more than a page of text will print it; for example, [2] describes a number of techniques for making printed web pages more usable. Nevertheless, reading a printed a web page is usually a disappointing experience — we have been conditioned to have low expectations of printing from the Web. Even if the printed result is “good enough” we can still only print one page at a time, unless the site has deliberately created multi-page articles with a combined “printable view” of the content.

Wikipublisher changes this, by giving wiki administrators a way to turn individual pages or page collections into a document suitable for printing. By using LATEX as its typesetting engine, it produces printed output of the highest possible quality, superior to anything that can be achieved using CSS or indeed with most word processors. By using the wiki source file for both Web and print, teams maintain one authoritative version of the source.

The project came about because Affinity Limited, an information management consultancy based in Wellington, New Zealand, started using wiki software as a lightweight way to improve communication and collaboration with its clients. The most common feedback we received was, “This is great, but I need to print the page and give it to my manager, and it looks terrible. Can you convert it to Microsoft Word, please?” Automated conversion to a printable form seemed like a better idea and Wikipublisher is the result. The project was conceived in 2004 and we released the first version of the Wikipublisher system in 2005.

In this experience report, Wikipublisher Design describes the Wikipublisher system including the architecture and implementation. Wikipublisher Features gives an overview of the features we think others will find particularly interesting. Discussion discusses the strengths and weaknesses of our solution, what some users have had to say about the system, some of the challenges we had to overcome and the surprises we encountered. Related Work presents related work and Future Work lists three wishes for the future of Wikipublisher. Finally, Conclusions draws some conclusions. Naturally, the authors collaborating to write this paper used Wikipublisher to do so. The project home page is at http://www.wikipublisher.org/.

Wikipublisher Design

Wikipublisher is a solution for printing high quality documents from wiki pages. The idea is to have one web-based authoritative source using the features a wiki engine is capable of and then have the option to print the content of the wiki page(s) on demand. We now describe the architecture and implementation of Wikipublisher and illustrate the features of the system.

Architecture

Architecture shows the architecture of Wikipublisher. The core architectural decision was to treat generating web pages and generating print pages as separate services. This means that one print server can potentially support many web page servers — printing is in most cases a low volume activity compared to browsing, so it is inappropriate to burden the web page server with print duties. We then need to create a print API that lets the web server expose its content in a way that the print server can process. As a result, the print server can work with any web content management system able to support the print API. It also promotes a more rigorous separation of the underlying content from its presentation in different media, making a wiki an ideal content server.

Architecture
Wikipublisher Architecture

Using a web browser, a reader submits a form to the Wikibook PDF server which says, “If you issue this http request, you will receive a stream of Wikibook XML; convert it to LATEX and PDF, then give me back the result.” The wiki administrator has taught the wiki server that, “If you receive an http request in this format, forget everything you know about wiki to HTML and do wiki to XML instead.” This means the wiki server doesn’t need to know how to print (but does need to know where to get a printing service) and the Wikibook PDF server doesn’t need to know wiki markup. To close the loop, the wiki server needs to give the reader a form to, “Tell the Wikibook server (at this address) to issue this http request.”

Implementation

There are several ways to convert wiki-sourced content into a printable form. The Wikipublisher project set a number of design goals (in the form of constraints) which led us to choose the architecture we used:

  1. the quality must be at least as good as that produced from a desk-top publishing package
  2. an author must be able to use regular wiki markup and let the publishing engine interpret it for Web or print
  3. a reader must be able to generate a print version of any page collection, as well as any individual page

We chose LATEX as the typesetting system after eliminating the other candidate: XSL FO. At the time we looked (December 2004), we could not find any books on FO that had been published using FO. On the other hand, all the books on LATEX were published using LATEX. It seemed to us that choosing FO would introduce unnecessary risk and that the risk would make its presence felt as unexpected costs or worse, insurmountable implementation problems. On the other hand, LATEX appeared to do everything we could think of. And so it has proved.

In terms of what markup to work from for typesetting and when in the pipeline to use LATEX the following questions needed to be answered:

  1. Do we work from wiki markup or from the HTML generated from wiki markup?
  2. Do we translate markup directly into LATEX or into an intermediate form first?

Implementation shows the tool suite approach adopted for Wikipublisher. Wiki markup is translated into an intermediate print-oriented XML form, and then transformed into LATEX. The reasons were largely pragmatic — we built on top of things that already worked. The tbook system is a free software project for converting XML documents into LATEX using XSLT, so if we could convert wiki markup into XML, we could use tbook to typeset it. The PmWiki project is a markup agnostic wiki engine (almost), which lets a site administrator redefine the markup translation rules.

Implementation
Wikipublisher Implementation Tool Suite

So we wrote a plug-in for PmWiki that replaces all the wiki to HTML translation rules with wiki to XML rules. Of course, we found that the wiki markup had rules for which there were no equivalents in the tbook DTD and hence no XML to LATEX translations. We added a range of extensions to the tbook DTD, style files and XSLT, and called the resulting XML to LATEX conversion service Wikibook. The plug-in also provides a “print metadata manager” which lets authors and readers customise the way the print output is presented, by passing configuration parameters to the Wikibook PDF server.

The XML generated from wiki markup needs to be a valid Wikibook document. This means the design needs to provide a way of removing any HTML litter, for example from markup rules that the plug-in does not support. We do this by adding a tbook namespace qualifier to all the wiki to XML translation rules. This means any unqualified tags in the output must be residual HTML, so we remove these during a post-evaluation stage, then we remove the tbook namespace qualifier to produce a document in Wikibook XML.

Readers of print documents expect that they will follow the design patterns developed and refined over the centuries since the invention of the printing press. It is not enough just to take the content and wrap it in a print-oriented layout template; there are other, more subtle conventions that we need to apply to the content itself. For example:

  • smart quotes — Wikipublisher treats straight quote marks as markup characters and “smartens” them into the equivalent HTML entities (including exceptions such as ’phone and prime marks such as 6′ 30″)
  • smart hyphens — it recognises em dash and en dash markup — such as 1–10 and Wellington–Picton — and again translates these into the equivalent entities
  • smart separator — a horizontal rule offers an opportunity for a small typographic flourish

These are examples of the project’s informal motto: the better Wikipublisher does its job, the less people notice it.


We next examine some of the Wikipublisher features.

Wikipublisher Features

== need more detail here and have some nice groupings such as Chapter 1 (Capabilites) from the Wikipublisher user guide

Comprehensive feature set

The Wikipublisher system handles headings, text, links, lists, tables, images, cross-references, equations, footnotes, and citations (among other markup rules). Output can be a letter, an article, a report, or a book. A reader can customise the output on demand, such as change the paper size, change the fonts, generate a table of contents (including an optional “minitoc” for each chapter of a book), set images with “side captions”, or request a cover page for an article. If a page contains a low resolution “thumbnail” image, linked to a high resolution version, Wikipublisher will automatically use the high resolution version, and scale it to fit the space. It will automatically work out the width of table columns, it will split long tables across pages, and can rotate wide tables and images.

This page is itself a good illustration of the range of features supported.


Book Publishing Options

Structure determines presentation

In turning wiki input into Wikibook XML output, we go to great lengths to establish a document hierarchy (chapter, section, subsection, and so on). We use lists of page names to publish a collection and let the list level define each page’s place in the output hierarchy. Then we analyse the headings on each page and slot these into the next available level. Suppose pages A and B are both chapters in a book. The author of page A might have used heading 1 and heading 3 markup, while the author of page B used heading 2 and heading 4 markup. Wikipublisher will map both pages’ headings into sections and subsections. Of course, page B could also be defined as a section in a different book; in this case, its headings become subsections and subsubsections. Page A could be published separately as an article; in this case, its headings form the sections and subsections.

Wikipublisher recognises markup for many semantically distinct types of links:

  • links to wiki pages on the same Web site
  • links to external Web pages
  • cross-references to another section of the same document
  • footnotes
  • references to citations
  • references to equations and “floats” (figures, tables and floating blocks)

References to citations work the wiki way. If the citation exists, the reference links to it; if not, the author sees an Edit link which, when clicked, opens an edit form to define the citation elements. Writers can choose between numerical and Author–Year reference styles. LATEX is of course very good at processing bibliography data. However, the Wikibook XML generates all the references and citations fully formed, so the Wikibook PDF server has to teach LATEX how to handle bibliographies and just print what Wikipublisher gives it, rather than using BIBTEX. The one thing that LATEX has to remember (or rather, Wikipublisher reminds it) is that if hyperref’s colorlinks option is turned on, references to citations are presented in the right colour. Those familiar with the natbib package may wish to note that Wikipublisher’s reference markup includes many options to control the display of the text linking the reference to the citation.

Links to external Web pages often have long URLs. We automatically make all punctuation characters discretionary, thereby allowing a long URL to split over 2 or more lines if necessary. To avoid possible ambiguity, we put the punctuation character after the line break. We print the link text in the body and the URL in a footnote.

Typesetting conventions

When converting web pages to print, the typesetting engine automatically applies standard conventions for printed material. For a given input, it optimises the quality of the printed output and applies the rules of typesetting consistently to every page. This means authors can focus on content, rather than presentation. It also means authors do not need to be typesetting experts to produce professional-looking printed documents from their web page collections. The following are among the more common conventions used:

  • captions are placed above tables and below figures
  • images and tables “float” to the top of the next page if there is insufficient room; the text following flows back around the floated object
  • captions for images floated left or right on the web are on the right of the image on recto pages and the left on verso pages

While the typesetting engine maximises quality for a given input, authors may find the output unsatisfactory in some cases. The generally recommended solution is to adjust the input, rather than trying to change the output rules. This is because:

  • unless we know the rules, we can’t make an informed decision about when to break them — most people are not typesetting experts
  • in a collaborative authoring environment, it is almost impossible to teach everyone the conventions
  • computers are much better than people at applying conventions consistently

Those who are used to controlling the look of their outputs have to learn to relax, go with the flow and resist the temptation to fiddle. Wikipublisher’s developers do not claim to be typesetting experts; the pages look the way they do because that is the way LATEX composes them. Authors coming from a word processing background may take some time to get used to this approach.

Discussion

== need to say something about the history of the project and the motivation which includes the SSC for e-govt standards

In the past five years we have had 340 wiki sites registered to use the Wikipublisher system be it publicly available or behind a firewall. We have had a total of 680 downloads of the PDF Server, on average about 15 per month, for the past three and a half years. Most users have found the Wikipublisher system from the PublishPDF extension.

Some of our users have had the following to say about Wikipublisher:

This is amazing. You guys have quite possibly revolutionized the usage of a wiki and what it means to collaboratively create and maintain documents. Major, major kudos to you. — Krista Stellar
May I say that I’m very impressed with PublishPDF, especially your instructions on getting the server-side up and running — that was easy. … It seems to have the edge on other variants with its intelligent mining of information (e.g. Wiki trail) and the high quality visualisation enhancing readability. — Steve Crisp

Finally even the inventor of wikis had this to say:

This is very cool. — Ward Cunningham

not sure where this next paragraph goes but is definitely discussion, previously it was the last sentence in conclusion

However, the real test will come when people start to apply the Wikipublisher engine to typeset and publish a wide range of existing online content, rather than using it to support creating and publishing new content.

In producing print documents, most people are accustomed to making a trade-off between the convenience of a word processor and the quality of a desk top publishing system. Most choose convenience, with the unfortunate result that typographic mediocrity has become entrenched in our culture. One of the big reasons for the popularity of wikis is their convenience. Wikipublisher lets us combine the convenience of a wiki with the typesetting quality of the finest desk top publishing software. Because the system embeds good typesetting practices in the software, the quality comes free. The Wikipublisher system breaks the first law of project management, delivering fast, cheap, and good.

Over the last few years, we and our users have learnt many things about document layout practice. There is always room for improvement and always more to learn. By making typesetting a shared web-based service, this knowledge automatically becomes available to everyone who uses the Wikipublisher system. The rising tide of knowledge truly lifts all boats. This cannot happen when typesetting is a personal computer service. Even if, for example, we improve a document template, we have to distribute it to everyone and it does nothing for our existing documents. On the other hand, because Wikipublisher forces a separation between content in wiki markup and its presentation either for the Web or print, improving the print presentation automatically upgrades the layout of all content everywhere, the next time it’s accessed.

For example, writers no longer need to know that the convention for use of footnotes is that the footnote reference character appears after a punctuation mark such as a comma or a full stop. Wikipublisher just makes sure it happens. If a reader chooses to publish a page using vertical spaces between paragraphs (rather than indents), Wikipublisher will automatically change the formatting of footnotes to hang the number into the left margin.

The biggest surprise has been the implacable resistance to wiki markup from so many otherwise sensible people. For many, “It’s not WYSIWYG” is a show-stopper. For busy people, having to learn something new is a significant barrier to entry. Fortunately, there are enough people who are over this that we no longer bother with the objections. If people do not wish to use Wikipublisher because it requires them to learn something new, then they can use something else. Eventually, they will all retire. Alternatively, they can of course commission development of a Wikipublisher API for a wiki that supports WYSIWYG.

Related Work

We are aware of the following different approaches for printing web pages: CSS support, built-in web browser support, extensions to web browsers, web specific printing solutions, and extensions to wiki engines.

CSS Support

  • Print CSS style

Built in Browser Support

Extensions to Web browsers

  • Aardvark http://karmatics.com/aardvark/
  • PrintMonkey [1] is a prototype solution that is an extension to Firefox implemented in Java Script? that uses the Greasemonkey engine that allows user scripts to get data from any URL.

Web Specific Solutions

  • Prince XML
  • PDF Press

Extensions to Wiki engines

Future Work

  • platform portability
  • plug-ins for different wiki engines
  • add more flexible output (memoir class) such as a wider choice of fonts, different latex style classes, different citation styles

The typesetting system is functionally complete for our own purposes, but offers several fruitful avenues for further development. The Wikipublisher project sees three areas where future work is desirable, if there is sufficient customer demand. They all focus on growing the community.

Wikibook on Microsoft Windows™

To date, people are running the Wikibook PDF server on various flavours of GNU/Linux plus Mac OS X. The project regularly receives e-mails from people who want to know whether it will run on Windows. We reply that as far as we know, all the software they will need is available for Windows, but we do not know of any Windows installations. We offer to help them work through any issues they may encounter and to document the Windows installation process. We are yet to hear back from one of these emails.

Our hope is that one day soon, somebody will value the software enough to give back to the project a documented Windows installation process. As far as we know, this is not hard to do. Meanwhile, running the software on a Windows server will probably continue to be the most requested feature.

Other content management engines

If we can teach PmWiki to output Wikibook XML, it should be feasible to add the same API capability to engines such as other wikis and blogs, which also support third-party plug-ins. Again, the project has received several enquiries, in particular about support for MediaWiki, but to date none has turned into a real project. Given the scale of the undertaking and in the absence of a customer willing to provide time or money, we have been reluctant to embark on this.

Such plug-ins need to add 4 pieces of functionality:

  • translate the markup into suitable Wikibook XML
  • provide wrappers, including configurable print metadata, for the various document types
  • assemble collections of Web pages into a single document
  • present the reader with an interface to the Wikibook PDF server

Wikipublisher’s restriction to PmWiki is perhaps the greatest barrier to its wider adoption.

User-specified LATEX classes

In an ideal world, an author could instruct the Wikibook PDF server to typeset their content using any valid LATEX class file (as long as it is reachable with an http request). The current Wikibook DTD defines 4 distinct document types: letter, article, report and book. The wiki plug-in makes sure the wiki produces Wikibook XML that complies with the requested DTD. To support user-defined classes, Wikipublisher would have to make sure that the document type used is compatible with the specified class. As a simple example of something that would in all likelihood go terribly wrong, an author may request output as a set of presentation slides.

Much careful design work is needed to ensure a robust solution. Fortunately, the current system is sufficiently flexible that few people have asked for additional document classes. On the other hand, it would have been really useful to load the correct ACM template for this paper! As it was, the authors exported the raw LATEX as an article and manually converted this to use a different class.

Conclusions

In this paper, we have shown that you no longer need a word processor to author: letters, articles, reports, and books. We have described:

  1. an architecture for a typesetting system to print Web content expressed using wiki markup
  2. an implementation of the architecture for one wiki engine, PmWiki, to convert wiki markup into XML
  3. extensions to the tbook system to support richer data structures and make typesetting a Web-based service

One of the advantages of using a wiki as a front-end to a typesetting engine is that people assume “it will just work”. Wiki markup is so simple and powerful that we are used to focusing on the message and letting the wiki take care of presentation. On the other hand when we use a word processor, we generally spend a lot of time tweaking to get the presentation right. Even using LATEX, we have to debug the document when LATEX encounters an error in the input. When the Wikipublisher project released its software into the wild, it quickly became clear that there is a universal expectation of zero errors. Web browsers are tolerant and this sets an expectation that if a reader requests a PDF, then the Wikibook server will deliver a PDF every single time.

As far as Wikipublisher’s users are concerned, if they do not get a PDF when they ask for one, this is a software bug and the Wikipublisher project needs to fix it. Contrast this with the expectations of LATEX users.

We have thus worked really hard to make XML generation and Wikibook transformation as robust as possible. What this means is that consistent presentation of printed outputs is completely automatic — not just within a document type (all reports have the same look), but different document types are all recognisably part of the same family. So businesses such as professional services firms, which typically produce a large number of documents of a small number of document types, can get a consistent look (a house style) at minimal cost. There is a huge quality advantage when we shift typesetting from the desk-top to the server, because we eliminate local stylistic variations.

So within the constraints imposed by use of pre-defined document classes, Wikipublisher delivers:

  • fast — instant Web pages with print-on-demand typesetting
  • cheap — all you need is a web browser and an Internet connection
  • and good — reliable, consistent output of the highest possible quality

Acknowledgements

Donald Gordon wrote the Perl script that drives the PDF server and taught the XSL script many new LATEX tricks. He did the back-end integration that means we can give the PDF server any URL that returns a stream of Wikibook XML and get back a typeset PDF suitable for printing. He also wrote the PmWiki handlers to translate wiki tables and wiki styles into useful XML. Darren Willis wrote the server-side components that teach LATEX how to process the bibliography XML created from wiki citation markup.

Other Stuff

Possible Titles

  1. Wiki Publisher for Fast, Cheap, and Good Print on Demand
  2. Wiki Publisher for Print on Demand
  3. Wiki Publisher: From Wiki to Print
  4. Wiki Publisher: Typesetting Wikis for Print
  5. Wiki Publisher: High Quality Print on Demand

References for Wikipublisher: A Print-on-Demand Wiki

[1]   (edit)PrintMonkey: Giving Users a Grip on Printing the Web, http://portal.acm.org/citation.cfm?id=1410140.1410189, accessed on 01 March 2009.

[2]   (edit)Smashing Magazine, 2007. Printing the Web: Solutions and Techniques, http://www.smashingmagazine.com/2007/02/21/printing-the-web-solutions-and-techniques/, accessed on 19 March 2009.

Creative Commons License
Edit · History · Print · Recent Changes · Search · Links
Page last modified on 22 March 2009 at 11:39 AM