|
Processing a List of Characters and States with Definitions: A
Perl Script to convert the list to xml
Purpose: For Openkey, we have a Perl script that can process a list of
character groups, characters, and states with definitions that someone
wants for their collection. Our script creates xml files out of each
individual state, character, and character group.
For instance, our botanists in the Prairie Plant and South East Trees
projects created such a list. Our script was written specifically for their
list, but if you create your own list using the same format, our
processing script should easily handle it.
We have a template in rich text format that may help people write their
own.
What is Needed Convert A Definitions List to XML:
- Perl or ActivePerl
- A Definitions List (see formatting instructions)
- split_definitions.pl
- the Perl script (copyleft)
- A directory structure for the xml files to go. For instance our directories
were
- ../project_name/characterGroup/xml/,
- ../project_name/character/xml/,
- ../project_name/state/xml/
- A text editor.
- An example of each type of XML you want to create. (see note 1 below)
- A person who feels comfortable altering the .pl file to alter the
hard coded xml element names.
The General Format for the List
(see the template for the best instructions, but here are the main
guidelines)
- The project name should be on the first line, followed by two blank
lines before the first major character group begins.
- The file should be a text file or a rich text file.
- Two blank lines between major character groups (i.e. "Leaves" is a
major character group, "Stems" is another).
- Within a "major character group", there should be a single blank
line between everything except for related states. (For example a single
blank line between "Plant Habit and Lifestyle" and "Life Span"). But
there is not a blank line between "Annual" and "Biennial" because they
are states under "Life Span."
- After each state name, there should be a space a hyphen a space and
then the defition. (See Annual below)(also see "Bunching" which does not
have a definition but still has a hyphen).
- Between two states in the same character, there should be a hard
return, but no blank line. Here is a quick example:
Project:
prairieplant
Plant Habit and Lifestyle
Life
Span
Annual - Normally living one year or less; growing,
reproducing, and dying within one cycle of seasons. [K&P, p.
15] Biennial - Normally living two years; germinating or forming and
growing vegetatively during one cycle of seasons, then reproducing
sexually and dying during the following one. [K&P, p.
21] Perennial - Normally living more than two years, with no definite
limit to its life span. [K&P, p.
79]
Woodiness
Herbaceous - Having little or no living
portion of the shoot persisting aboveground from one growing season to
the next, the aboveground portion being composed of relatively soft,
non-woody tissue. [K&P, p. 56, modified] Woody - With an
aboveground shoot composed of relatively hard tissue that persists from
one growing season to the next.
Herbaceous Plant Growth
Form
Bunching - Single upright stem -
- Only the states should be defined in the first portion of the file.
- Capitalization does not matter because our processing lower-cases
everything.
- State definitions are allowed to be complicated.
Subshrub - 1) A shrub-like plant but with only the base
composed of woody tissue, the herbaceous branches dying back at the end
of each growing season. [K&P, pp. 106-107, modified] 2) A very low
shrub that sprawls on the ground; a trailing shrub. (Compare with
shrub.) [L, p. 772, modified]
Example of the resulting xml files:
Three types of files: character group files, character files, and state
files. File Name: plant_habit_and_lifestyle.xml <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE
characterGroup SYSTEM
"http://www.isrl.uiuc.edu/~openkey/shared/characterGroup.dtd"> <!--
created by split_definitions.pl 2003_9_22_12_1
--> <CharacterGroup>
<CharacterGroupName name="plant_habit_and_lifestyle"
file="/home/openkey/public_html/prairieplant/characterGroup/xml/plant_habit_and_lifestyle.xml">plant
habit and lifestyle</CharacterGroupName>
<LegalValue name="life_span"
file="/home/openkey/public_html/prairieplant/character/xml/life_span.xml">life
span</LegalValue> <LegalValue
name="woodiness"
file="/home/openkey/public_html/prairieplant/character/xml/woodiness.xml">woodiness</LegalValue>
<LegalValue name="growth_habit"
file="/home/openkey/public_html/prairieplant/character/xml/growth_habit.xml">growth
habit</LegalValue> <LegalValue
name="herbaceous_plant_growth_form"
file="/home/openkey/public_html/prairieplant/character/xml/herbaceous_plant_growth_form.xml">herbaceous
plant growth form</LegalValue> <LegalValue
name="nutrition"
file="/home/openkey/public_html/prairieplant/character/xml/nutrition.xml">nutrition</LegalValue>
<LegalValue name="carnivory"
file="/home/openkey/public_html/prairieplant/character/xml/carnivory.xml">carnivory</LegalValue>
<Image>none yet</Image>
<Definition>plant habit and
lifestyle</Definition>
<Synonym></Synonym>
<BroaderTerm></BroaderTerm>
<NarrowerTerm></NarrowerTerm>
<RelatedTerm></RelatedTerm>
<DisplayBefore></DisplayBefore>
<DisplayFor><strong>Plant habit and lifestyle</strong>:
</DisplayFor>
<DisplayAfter>.<BR></DisplayAfter> </CharacterGroup>
File
Name: life_span.xml <?xml
version="1.0" encoding="UTF-8"?> <!DOCTYPE Character SYSTEM
"http://www.isrl.uiuc.edu/~openkey/shared/character.dtd"> <!--
created by split_definitions.pl 2003_9_29_0_12
--> <Character> <CharacterName
name="life_span"
file="/home/openkey/public_html/prairieplant/character/xml/life_span.xml">life
span</CharacterName> <LegalValue
name="annual"
file="/home/openkey/public_html/prairieplant/state/xml/annual.xml">annual</LegalValue>
<LegalValue name="biennial"
file="/home/openkey/public_html/prairieplant/state/xml/biennial.xml">biennial</LegalValue>
<LegalValue name="perennial"
file="/home/openkey/public_html/prairieplant/state/xml/perennial.xml">perennial</LegalValue>
<Image>none yet</Image>
<Definition>life span</Definition>
<Synonym></Synonym>
<BroaderTerm></BroaderTerm>
<NarrowerTerm></NarrowerTerm>
<RelatedTerm></RelatedTerm>
<DisplayBefore></DisplayBefore>
<DisplayFor></DisplayFor>
<DisplayAfter>,</DisplayAfter> </Character>
File
Name: annual.xml <?xml
version="1.0" encoding="UTF-8"?> <!DOCTYPE State SYSTEM
"http://www.isrl.uiuc.edu/~openkey/shared/state.dtd"> <!--
created by split_definitions.pl 2003_9_29_0_12
--> <State> <StateName name="annual"
file="/home/openkey/public_html/prairieplant/state/xml/annual.xml">annual</StateName>
<Definition><em>(plant habit and lifestyle )</em>
Normally living one year or less; growing, reproducing, and dying within
one cycle of seasons. [K&P, p. 15]
</Definition> <Image>none
yet</Image>
<Example></Example>
<Synonym></Synonym>
<Synonym></Synonym>
<BroaderTerm></BroaderTerm>
<NarrowerTerm></NarrowerTerm>
<RelatedTerm> </RelatedTerm>
<Prevelence></Prevelence>
<Certainty></Certainty>
<DisplayBefore></DisplayBefore>
<DisplayFor>annual </DisplayFor>
<DisplayAfter>,</DisplayAfter> </State>
Notes, Comments, Hopes, Wishes, ...
Note 1: Currently, because I was frequently running this script on my
home computer which doesn't have the xml modules on it, I did hard code
the xml elements and tags. In other words, the script must be altered if
the xml elements change. I do not read in the 3 types of xml files.
Someday we'll have an upload area where people can process their own
files on our server.
Please take and improve the script if you are so inclined. We would
love to put your better code up for others to use. GNU copyleft licensing
applies.
We encourage the "borrowing" of lists between projects.
For complicated character group relationships, we would like to enable
the use of adding numbers like 1, 1.1, 1.2, ... 1.10.1 otherwise even the
humans get confused. A sample of my ideal input file would be something
like ideal list.
Updates of additional character groups right now means reprocessing the
entire updated list. (otherwise we'd have to check to see if the character
group already existed and update the legal values for each level).
I'd like to replace all the repeated things in the xml files with the
minimum amount of information we need to access the related files. for
instance: file="/home/openkey/public_html/prairieplant/character/xml/
should only need to be the word "character". The only problem keeping us
from doing this is the xsl files that transform the xml to other xml
and/or html.
|