DSS: A Three-Layer Approach to XML Schema Construction using Literate Programming

Kevin Reiss [University of Illinois, Graduate School of Library and Information Science]

1 Introduction

XML is often used to encode documentation for software and other technical information. Given this fact it is somewhat surprising that a general purpose documentation system for XML schemas has not been widely adopted by XML developers. This project proposes to adopt the methodology of Knuth's Literate Programming (LitProg) [Knuth 1984] for XML schema construction. LitProg systems are characterized by a single source document (a web) that contains both the prose documentation about and the source code syntax of a piece of software. The web is then processed by the LitProg environment in two steps: tangle and weave. Tangle produces source code files and weave produces typeset documentation.

This proposed LitProg system, called DSS, will consist of three layers: documentation, syntax and semantic. The semantic layer will be implemented using the BECHAMEL [Dubin, et al 2003] system. BECHAMEL represents, in machine-readable form, the document semantics of XML/SGML document instances. Document semantics are the objects that are identified by a markup language, and the properties of and relations between those objects. This new layer can potentially give application developers a higher level interface to an XML document than those provided by current models such as XPath, the DOM, or SAX.

2 XML & Literate Programming

2.1 History

There is a long, successful history of LitProg in the SGML/XML community. LitProg forms the basis for one of the most successful SGML/XML projects, the TEI (Text Encoding Initiative). The TEI Guidelines and DTD are both generated from the same ODD (One Document Does It All) source files. Michael Sperberg-McQueen's Sweb, based on experience in developing the ODD format, is a general purpose LitProg System that uses SGML to encode a web [Sperberg-McQueen 1996]. Interestingly the base tagset for Sweb can be any SGML document type. The user need only incorporate a few LitProg-specific tags into their chosen tag set for an Sweb.

2.2 The Current State of Schema Documentation

The high-quality, structured documentation produced for the TEI by the ODD system is unique among SGML/XML applications. Though there are some other examples of detailed approaches to schema documentation (such as the online guide to Docbook), the vast of majority of applications rely upon ad hoc or loosely structured comments within the DTD or schema. Comments are often run through documentation generators to document the elements and attributes. Such filters are available in commercial XML editors such as XMLSpy and XMetal, or as command-line utilities. The reference material produced from these tools is similar to Javadoc. These approaches create documentation that, while usable as quick reference, is inadequate for those who need a detailed and thorough discussion of a content model within a schema.

3 Proposed DSS Application Design

3.1 The DSS Layers

3.2 Potential Benefits

  1. Enhanced Documentation: The three layers can produce enhanced documentation that includes both quick reference material and detailed explanations and examples of a schema's content models.
  2. Improved Document Type Modeling: Document semantics will provide a new high-level formalism for markup language designers.
  3. Improved Schema Design: Programmers who use LitProg tools often feel that software created with such tools is better structured and extensible than those created using standard tools.
  4. Simpler XML Processing: Machine-readable document semantics decrease the time it takes to create XML processing applications. The creation of DOM or SAX based programs, and XSLT stylesheets could be simplified and partially automated with these semantics. For a full treatment of the benefits of document semantics see [Renear et. al 2002].

3.3 Goal: XML-In, XML-Out

During Processing well-formed XML should be input and output. This will enable the three-layered system to produce output that can be digested by conforming XML software. The only two exceptions of the XML-in/XML-out rule are the non-XML syntax of XML 1.0 DTDs, and the current Prolog syntax of BECHAMEL. Tags for wrapping DTD and Prolog Declarations will be provided in the core DSS Schema. Alternatives to this approach for the semantic layer will become available as BECHAMEL will shortly be able to serialize rules as RDF. See Figure 1 for input/output formats and processing steps. See Figure 3 for a sample DSS source document. Figure 2 shows the modular structure of the schema for DSS source documents. Figure 4 shows a fragment of the DSS XML Schema.

Figure 1

Processing steps

3.4 Application Characteristics

  1. A Language Independent Approach: Either a DTD, W3C XML Schema or RELAX NG Schema can be output as the Syntax Layer. A schema conversion tool will be employed to produce alternate versions of a particular schema. DSS will only support the creation of schemas that seek to model classes of documents. This excludes schema languages like rule-based Schematron and object-oriented RDF Schema.
  2. XSLT: Implementation of tangle and weave using XSLT.
  3. Modular Schema: A modular schema will provide structure for the DSS source files. See figure 2 for modularization and figure 4 for an example document.
  4. Use of Existing Documentation Tagsets: Markup encoding the prose documentation of content models using an XML tagset chosen by the user. Several common choices (TEI, Docbook, and XHTML) will be available as built-in modules.
  5. BECHAMEL: The semantic layer will be implemented using BECHAMEL. Although rules are currently written in Prolog, upcoming modifications to the system will allow them to be serialized as RDF.
  6. SVG: Visual representations/diagrams created will be output as SVG.
Figure 2

Modular DSS Schema Structure

4 Future Work

4.1 Planned Implementation

An implementation using the tools and specifications listed under headings software and syntax is planned. Please share your comments and suggestions. The project website is
http://www.isrl.uiuc.edu/~kmreiss/projects/xmldoc

4.2 Syntax

Figure 3

DSS Document Instance


      
<?xml version="1.0"?>
<schema syntax="RNG" documentation="XHTML" semantics="BECHAMEL"
  xmlns:rng="http://relaxng.org/ns/structure/1.0" 
  xmlns:doc="http://www.normanwalsh.com/documentation/1.0"
  xmlns:xhtml="http://www.w3.org/1999/xhtml"
  xmlns:bec="http://www.isrl.uiuc.edu/eprg/bechamel/1.0"
  xmlns="http://www.isrl.uiuc.edu/eprg/schemadoc/1.0">
 <!-- Meta-information for schema -->
 <schemainfo>
 <version>0.01</version>
 <application>Schema for office documents.</application>
 </schemainfo>
 <tagset>
 <elements>
 <element id="e.20">
  <!-- documentation layer -->
  <reference>
  <name>paragraph</name>
  <xhtml:p>A paragraph encodes a <xhtml:em>block</xhtml:em> of text in
  a document.  See this <xhtml:a
  href="http://www.paragraphs.com">discussion</xhtml:a> of
  paragraphs.  Paragraphs often contain block level markup.
  Paragraphs should not be used for any number of reasons see
  <xhtml:a href="#l.21">list</xhtml:a> for details.</xhtml:p>
 </reference>
 <!-- syntax layer -->
 <code>
  <rng:element name="p">
   <rng:ref name="block.level.atts"/>
   <rng:interleave>
     <rng:ref name="inline.level.elements"/>
     <rng:ref name="linking.elements"/>
   </rng:interleave>
  </rng:element>
 </code>
 <!-- semantic layer -->
 <semantics>
  <bec:class>declare_class(paragraph)</bec:class>
  <bec:property>declare_property(paragraph, text, string)</bec:property>
  <bec:proptery>declare_property(paragraph, language, string)</bec:proptery>
  <!-- truncated -->
  <bec:mrule>mrule :- node_exists(Pnode, para),
        not(exists(_,paragraph,[Pnode])),
	construct(paragraph,[Pnode],_).</bec:mrule>
</semantics>
 </element>
  </elements>
 </tagset>
</schema>

Figure 4

DSS Schema Fragment


      
 <grammer xmlns="http://relaxng.org/ns/structure/1.0">
  <include href="tei.2.rng"/>
  <include href="xhtml.rng"/>
  <include href="docbook.rng"/>

  <start>
    <ref name="dss"/>
  </start>

  <define name="dss">
   <element name="dss"   ns="http://www.isrl.uiuc.edu/eprg/schemadoc/1.0">
    <ref name="schemainfo" />
    <ref name="tagset"/>
   </element>
  </define>

 <define name="reference">
  <element name="reference"  ns="http://www.isrl.uiuc.edu/eprg/schemadoc/1.0">
   <ref name="name" />
    <choice>
     <ref name="docbook.block.level"/>
     <ref name="html.block.level"/>
     <ref name="tei.block.level"/>
   </choice>
  </element> 
</define>
</grammer>

4.3 Software

4.4 Questions

  1. Are multi-namespace application too brittle?
  2. How much value does detailed documentation add to a schema?
  3. What is your experience with applications that utilize more than one type of schema?
  4. Can you see a use for for machine-readable document semantics?
  5. Are there any LitProg systems that I should study?

Acknowledgments

Thanks to Allen Renear and David Dubin for their advice and assistance on this project.


Bibliography

[Coates and Rendon 2002] Anthony B. Coates and Zarella Rendon. xmLP — a Literate Programming Tool for XML & Text. In B.T. Usdin and S.R. Newcomb, editors, Proceedings of Extreme Markup Languages 2002 , Montreal, Canada, August 2002. Available online at
http://xmlp.sourceforge.net/2002/extreme/index.html

[Dubin, et al 2003] D. Dubin, C. M. Sperberg-McQueen, A. Renear, and C. Huitfeldt. A Logic Programming Environment for Document Semantics and Inference. Literary and Linguistic Computing , 18(1):39-47, 2003. Available online at
http://www3.oup.co.uk/litlin/hdb/Volume_18/Issue_01/180039.sgm.abs.html

[Knuth 1984] Donald Knuth. Literate Programming. The Computer Journal. 27, 2, 97-111. May 1984

[Renear et. al 2002] Allen Renear and David Dubin and C. M. Sperberg-McQueen. Towards a semantics for XML markup. In Proceedings of the 2002 ACM symposium on Document engineering, 119-126, 2002. ACM Press. Available online at
http://doi.acm.org/10.1145/585058.585081.

[Sperberg-McQueen 1996] Michael Sperberg-McQueen. SWEB: an SGML Tag Set for Literate Programming. Available online at
http://www.w3.org/People/cmsmcq/1993/sweb.html. March, 1996

[Walsh 2002] Norman Walsh. Literate Programming in XML. Available online at
http://nwalsh.com/docs/articles/xml2002/lp/paper.html. Octover, 15, 2002



DSS: A Three-Layer Approach to XML Schema Construction using Literate Programming

Kevin Reiss [University of Illinois, Graduate School of Library and Information Science]
kmreiss@isrl.uiuc.edu