Home > Projects > XML Schema for FNA

Team
Introduction
Problem Description
Deliverables
Software
XML IR Issues

XML Schema for the Flora of North America

LIS 429 Course Project Spring 03

Team Members

Primary Member
Kevin Reiss
Secondary Member
JiHye Jung (Indexing Team)

Introduction

Our goal is to increase the granularity and accuracy of the markup of the current Flora of North America (FNA) collection. We will need to accomplish this over several steps outlined below.

Project Functionality

Note: This section was the initial project proposal. What was actually completed can be found in the Deliverables section.

This project's goal is to produce a new XML Schema for the FNA collection, and create programs that will convert the XML documents of the FNA collection to conformance with the new schema. Documents encoded according to this Schema provide more accurate and thorough searching of the data , including the numeric values found within the species description portion of an FNA treatment.

Problem Description

Task One: Text Processing; HTML to XML or XML to better XML

First we need to develop a fairly specific intermediate DTD that will guide us in creating scripts that will create new XML markup or increase the granularity of the XML markup that currently exists in our datasets. We then need to produce scripts that will make our collections conform to this DTD. We need to be careful to consider encoding issues and the use or non-use or entities within the generated XML document instances.

Collections

Resources and Software Tools:

Task Two: Schema Design and Data Validation

In order to properly ensure that the data generated by our automatic markup is valid we think we will need to develop our final schema taking up where task one leaves off. We can use our intermediate DTD from task one as the basis for developing scripts that will appropriately tag the the numeric values that exist in the data. We can than use the intermediate DTD as a guide to creating a more specific XML schema that can test for the validity of the data within each tag at the data-type or pattern level since schema's allow the restriction of element content to a particular regular expression pattern.

Resources and Software Tools:

Task Three: Integration with Indexing Engine w/ JiHye

I will need to collaborate with JiHye who is planning to develop a hybrid indexing that handles both full-text indexing and the searching of numeric range values. We will need to make sure that our schema design and generated markup will be compatible with the tools that she will use. We'll need to be careful on issues of character sets, data validation, and the use or non-use of entities within our XML documents.

Resources and Software Tools:

Data Flow

Our proposed data flow diagram.

Deliverables

This section comprises the actual deliverables produced by the project at the end of the semester. Due to time constraints and the loss of a group member the automatic markup programs for FNA files discussed in the previous section were not completed.

Original FNA XML DTD

The original FNA dtd is quite simple, and represents only the top-level view of an FNA treatment. Further granularity will improve the power of retrieval programs to index the FNA data. It will also provide a means to explicitly mark the numeric values of plant characteristics that occur throughout the treatments. The original did files can be found in the origfnadtd directory.

Final Schema w/example XML files

An XML Schema that expands the original FNA DTD has been created. The schema has a module for each major part of an FNA treatment so that it can be easily expanded if additional elements are required to increase granularity for retrieval purposes.

How to Validate Example Data Files

For information on how to validate the data files with the Relax NG schema see the README file in the examples directory.

For information on how to validate the data files with the W3C XML schema see the README file in the examplesxsd directory. Both of these operations require some java class libraries, available in the utils directory, visit this directory for more info.

Index Only DTD Files

Since swishe-x has difficulty supporting constructs used in the final schema, an version of the schema was created in the form of a DTD. This form can be feed to swishe-x as the configuration file for indexing XML. It is in the indexonlydtd directory.

Webvibe Client Config Files

Config files for the Webvibe client are available in the config directory. Note: This files will likely require modification before use, as I unsure how the client would handle certain constructs in the final schema.

XSLT Transformations

An XSLT stylesheet has been created to transform instances of XML marked according to the final schema in XHTML for display in a web browser. The stylesheet is located in the transform directory. For information on how to run the stylesheet see the README file that directory.

The results of the transformation can be found in the examples directory. They preserve such functionalities as highlighted Latin names and making FNA dictionary terms active links. Note: When the full data set is available in a form marked up according to the schema, care will be taken to avoid the bugs in that currently exist in the dictionary term implementation with regard to overlapping words.

Software

A number of open-source java libraries were used in the development of the schema files the necessary jar files can be found in directory utils. See the README for more details.

For information on Trang and Jing consult the NOTES file in the schema directory. For info on Xerces and Xalan, consult the Apache XML project at http://xml.apache.org

Schema validation Tools

XSLT Processor

Schema Conversion Tool

Issues with XML as a Data Format for IR

Character Sets

Semi-structured Data verses Records (Swishe-x and XML)

Note: Swish-e is an open-source full-text indexing engine designed to index the content of XML, HTML, PDF, and plain text documents. The BIBE project uses a modified version of Swish-e, called Swish-ex, that indexes XML documents using a field-based strategy to attempt to index the occurrence of particular XPath expressions that appear in a document. However the lack of scalability of this software to XML documents that have deep granularity or mixed content presents a difficulty. In order to gain the full benefits of deep markup of the FNA, a new indexing engine should be implemented. The eXist native XML database project could be used as a starting point for an XML aware indexing system for IR.

Information Retrieval with XML from the Markup Language Designer's Point of View

Last Update: 7/27/03