Government information publishing on the web encounters differing expectations concerning the permanence of documents. Being only about a decade removed from the introduction of the web, informal notions of web authoring and publishing persist in many circles from the earliest days. But, terms like "government documents" convey, at least to the layman, an expectation of formality, official content, and permanence. If a permanent archive of electronic documents is to be constructed, issues in not only locating the appropriate document, but also locating a specific version of that document must be dealt with. Metadata can contribute to the solution of these problems, but issues of metadata quality and consistency of metadata generation are raised.
In the Preserving Electronic Publications project, we examined the complete web-accessible electronic document inventories of the US states of Illinois and Arizona. Statistics describing the profiles of the document inventories of these states are presented herein. Further analysis was done of all markup-language documents to determine the current extent of metadata incorporation. Generally speaking, metadata authoring at the individual state agency level, as expected, was verified to be "just beginning".
In addition to embedded HTML META tags, other useful descriptive information is often available in the header of the HTTP messages sent from the web server to a requesting client program (e.g., a user's web browser, or our "web spider" acquisition system). Classification software based on an analysis of included keywords and phrases could automatically contribute some metadata. And, extrapolation of the subject(s) of a poorly-tagged document from the subjects of its hypertext neighbors is another stop-gap measure. The author believes the use of automated methods to generate, infer, or extract such metadata to be the only economical alternative to the manual retrofitting of metadata into the very extensive document inventories of government agencies and other large organizations.
Government information on the web is somewhat unique in that it simultaneously occupies two publishing niches with differing expectations concerning the permanence of documents. Being only about a decade removed from the introduction of the web, informal notions of web authoring and publishing persist in many circles from the earliest days of the web. But, terms like "government documents" convey, at least to the layman, an expectation of formality, official content, and permanence. If a permanent archive of electronic documents is to be constructed, issues in not only locating the appropriate document, but also locating a specific version of that document must be dealt with. Metadata can contribute to the solution of these problems, but issues of metadata quality and consistency of metadata generation are raised.
In our examination of the websites of two state governments we have found the contradictory expectations of permanence and formality exemplified in a variety of ways. For example, stylistic "look-and-feel" changes are routinely done to webpages that rarely would have been done to materials in print media. Website host machine names are renamed more often than an organization would relocate its toll-free telephone number. And, documents are found stored in directories reflecting the name of the author, rather than a directory named to reflect the more permanent role of the document or the name of the formal sub-group responsible for the document. A parallel question concerning the level of formality and consistency that can be anticipated in the locally authored metadata that is embedded within these documents.
Under a National Leadership grant "Preserving Electronic Publications" (PEP) [PEP] from the U.S. Institute of Museum and Library Services, the Illinois State Library (ISL) and the Graduate School of Library and Information Science (GSLIS) at the University of Illinois, Urbana-Champaign, constructed and operated an electronic document archive for 118 State of Illinois agency websites for the past year. The PEP archive was constructed using web "spider" software to traverse all embedded hyperlinks, and to download all files so referenced, and found to be resident on a web server(s) belonging to the agency in question. Downloaded files were retained using a version control system so that any prior version of a document remains available.
The software, statistical summaries, and various reports produced by the PEP project are available online at [Jackson], including a preliminary statistical summary covering only five Illinois agencies. To gather comparative statistics for the State of Arizona, the PEP software was also run, one time, for 84 of the Arizona agency websites. Both Illinois and Arizona utilize statewide search engines built on the collaboratively developed "Find-It!" architectural plan, as described below. High-level statistical summaries of the web materials of the compared states is provided in table 1.
| Table 1. Comparison of overall website measures between Illinois and Arizona state government webs. These categorizations do not rely on file extensions provided by the web servers. | ||
|---|---|---|
| Measure | Illinois | Arizona |
| Number of websites | 118 | 84 |
| Size of the download, in bytes | 27,671,171,072 | 10,895,245,312 |
| Number of files | 365,151 | 227,960 |
| Markup-language files | 232,408 | 156,188 |
| Plain-text files | 6,054 | 10,819 |
| Binary files | 126,689 | 60,953 |
| Websites without any META tags useful in search | 22 | 10 |
| Average number of META tags useful in search, per markup-language file | 3.06 | 0.89 |
| Average number of unique types of META tags useful in search, per website | 13.55 | 5.74 |
The architecture of Find-It! search facilities reflects work by several U.S. states to improve precision and recall in queries for state government information on the web. Because of the highly similar nature of many state agencies to their counterparts in other states, web-wide search engines generally exhibit considerably less precision in retrieval than might be hoped. The webpages of the counterpart agencies in other states are typically included in the search results, although a citizen is presumably only interested in the agency within his or her own state of residence. And, many web-wide search engines do not capitalize on classificatory information supplied inside webpages using META tags. Search facilities, such as Find-It!, that are locally operated and tuned to capitalize on locally specified metadata are being relatively widely used in an attempt to provide better state-specific searching.
Many U.S. states have met in annual conferences of an un-chartered consortium for a few years (e.g., see the last two State GILS conference webpages at [GILS-3] and [GILS-4]). In these conferences, and an associated e-mail list, state representatives and implementers share ideas, lessons learned, open source or locally developed software, and software configuration materials such as search engine rule sets. Several of these states use a variant of the "Find-It!" search engine system philosophy. The principal fixture of the Find-It! approach is the use of locally generated metadata, conformant with a state-specific variation on a state government classification thesaurus, as a principal input to a search engine. The primary motivations for decentralized metadata authorship seem to be both the budgetary limitations on the State Library or archival agency staff size, and a desire to enfranchise agencies concerning how their documents are represented to users. The search engines themselves come from various vendors, and may therefore incorporate other features, such as whole-text search, in developing its relevance ranking of retrieved documents. As such, Find-It! implementations can vary between one another.
ISL operates the Illinois-customized version called "Find-It! Illinois" (see [ISL1]). As part of their efforts to facilitate agency adoption of the Find-It! system, ISL has created an online metadata generator at [ISL2] that produces syntactically correct HTML META tags in response to a user filling in a web form concerning the contents of a document. Other states have comparable software tools. ISL has also produced an online recommended practices document for Illinois webmasters at [ISL3].
Though first implemented at the Washington State Library, ISL has become the maintenance agency for the classification thesaurus developed by member agencies of the State GILS consortium [GILS Topic Tree]. Several states use derivatives of that thesaurus, making local customizations. Some states use other mechanisms, including variations on the provisions of the Dublin Core metadata conventions [Dublin_Core].
Overall, the Find-It! design philosophy relies heavily on the generation of high-quality, consistent metadata by the agencies authoring the individual documents. This paper describes the current extent to which agency authors have implemented metadata within their documents, and what this seems to imply concerning the viability of accomplishing this large editing task. Automated alternatives to manual metadata generation are discussed.
It was hoped in the PEP project that well-described, standard-format documents, employing standard metadata conventions, would be easily acquired, retained in an electronic archive, and subsequently searched for and retrieved. It turned out that multiple obstacles exist to the automated implementation of an electronic archive of web documents, and that a number of implementation prerequisites and "good practices", such as systematic metadata employment, are still in the process of being implemented statewide.
The PEP project began acquisition of the State of Illinois websites in mid-January 2002. Acquisition of state website materials continues, roughly on a monthly periodicity. An example side-by-side comparison of a substantially changed website is shown in figure 1, where the current version of the Illinois Department of Transportation homepage is compared with the mid-January version from the PEP archive. During the period of the IMLS grant, a number of heretofore unknown websites were found which were added to the PEP acquisition mechanism. In attempts to locate official Illinois state government websites, we employed a number of measures including (1) starting with a list previously developed by ISL, (2) performing deliberate breadth-first searching of know websites looking for hyperlinks to other seemingly official websites, (3) searches using popular Internet search engines and a number of general keywords, and (4) searches using popular Internet search engines and specific terms taken from the hard-copy Illinois telephone directory [IL_DCMS]. Roughly twenty additional websites were so discovered. State of Illinois websites have been found with host computer names ending in ".state.il.us", ".net", ".org", and ".com".
Figure 1. Side-by-side comparison of the current version of the Illinois Department of Transportation homepage with the PEP archive version of January 2002. |
Not all file types support the embedding of any metadata within the file. Some file types have mechanisms for the embedding of forms of metadata defined by the vendors of the editor for that file type, but this set of metadata may not coincide with that of the search engine in use by a particular state. And, proprietary file types might not be able to be processed by software other than that of the vendor. So, there is the possibility that embedded metadata will not be useful for some types of government documents, depending on the file types used to contain those documents or portions thereof. Examining those files of the states of Illinois and Arizona that contain a file extension that indicates file type, some important results emerged. First, and completely unexpectedly, the comparative distribution of file type usage as a percentage of the total was within 1/2 of 1 percent for all file type categories. The distribution of file type usage is presented in table 2, and is summarized graphically for Illinois in Figure 2 and for Arizona in figure 3. Second, a preponderance of the file types utilized in both states was found to be markup-language files. Third, of the numerous word-processor-like document file formats from which states might choose, these two states have almost exclusively adopted Adobe Portable Document Format (PDF).
| Table 2. Comparison of file type distribution between Illinois and Arizona state government webs. | ||
|---|---|---|
| File Type | Illinois | Arizona |
| Markup language | 67.0 % | 67.4 % |
| Word processor documents | 16.4 % | 16.3 % |
| ASCII text documents | 4.5 % | 4.5 % |
| Still images | 11.0 % | 10.8 % |
| Other applications | 0.2 % | 0.1 % |
| Executable code | 0.7 % | 0.6 % |
| Other (unknown) | 0.2 % | 0.2 % |
Figure 2. File types for Illinois files providing a file extension. |
Figure 3. File types for Arizona files providing a file extension. |
Included within table 1 are two measures of the extent to which agencies in the two sampled states are currently providing metadata useful for the Find-It! search facilities. The state libraries of both states have been encouraging and facilitating the incorporation of metadata for over two years. However, the costs of retrofitting metadata into large state document inventories are not necessarily funded. Further, as the state libraries are not a superordinate to most of the numerous other agencies within a state government, the priority those agencies associate with retrofitting metadata may be low. Both states have several agencies who have not yet begun to add even one line of metadata. Other agencies, possibly those employing document management systems to aid them in inventory control of thousands of web pages, have tens of thousands of HTML META tags already embedded in documents. Both state libraries make available tools and instructional materials to facilitate the process of metadata generation and application, but they are not performing the classification work for the individual agencies.
The average number of META tags useful in search, per markup-language file, reflects the elimination of META tags concerning the identity of the authoring tool or templates used in the creation of the file. While it would be possible for an authoring tool vendor to program their web browser software to behave differently based on information contained within META tags (e.g., through the imposition of a set of style-like backgrounds and borders), inspection of several Illinois agency websites did not find a case where such processing was in use. The "generator", "template", and "progid" META tags, plus Microsoft tags related to style and border defaults are ignored herein.
The results listed for the average number of META tags useful in search, per markup-language file, are disappointing in that so little author-generated metadata has been applied to files to date. Further, as the number of META tags sought, and supported by tools such as the Metadata Generator at ISL [ISL2], is typically over twenty per document, the current low averages indicate that only a small fraction of the current web document inventory has had classificatory metadata assigned to date.
At the fourth State GILS conference [GILS-4], there were expressions of interest in possibly minimizing metadata retrofitting costs by only tagging those webpages that are the introductions to major sections of an agency website. Such a process would facilitate the user finding the section starting point, via Find-It! style search engine, but users would then be "on their own" to browse through the hyperlinks they find there. While this approach sounds plausible, the statistics returned indicate it is not the cause of the very small numbers of META tags found in some websites. If so much as the section heading documents were being fully catalogued, the analyses would show a large number of unique types of META tags for that agency. That may be happening in many agencies, but many others have simply almost no metadata of any kind.
A number of assumptions made in the development of the freeware subassemblies used in PEP combined to be problematic for the automatic population of an electronic archive. In particular, the widespread assumption that the characters following the identifier of a website host computer in a Uniform Resource Locator (URL) somehow equate to a physical location is problematic. Multiple problems were encountered with the operation of our selected web spider where creation of directories and files, named per character strings taken from URLs, resulted in erroneous operation of other PEP components or standard UNIX commands. Differing metacharacters (characters with a special meaning to a particular host machine or program) between website host machines and the host machines of the PEP archive were also problematic. A very few websites had to be excluded from this analysis as their implementation technologies proved too dissimilar from the design basis of the PEP web spider and other components. Nevertheless, the statistics presented here represent our best efforts to date to detect and correct, or at least, to compensate for, this variety of errors.
The Illinois web grew much faster than expected in this period, particularly for its largest websites. The largest Illinois website in terms of bytes, that of the Illinois Pollution Control Board, increased in size by a factor of 50 in just over nine months. The growth rates of several websites exhibiting pronounced growth are illustrated in figure 4. Growth rates of such magnitude can be highly problematic for both the website administrative staff and for an associated electronic document archive. Other websites exhibited substantial reductions in size, generally followed by a return to something like their formal size, as illustrated in figure 5. Most websites exhibited continued, manageable growth.
Figure 4. Sizes of several Illinois websites exhibiting pronounced growth. |
Figure 5. Sizes of several Illinois websites exhibiting pronounced decrease, followed by a return to their former size. |
It seems unlikely that the extreme rates of web authoring encountered for a very few agencies will be sustained. It seems more probable that they represent bursts of activity within the agencies, or capitalization upon an existing, easily reformatted, electronic document collection. Considering the size of the state government staff in Illinois, it seems highly unlikely that hundreds of megabytes of documentary materials are being produced as a routine daily event. The spider-produced size of the website collection we process increased from 5.6 gigabytes in January 2002 to 27.7 gigabytes in September 2002. To support an increase of 22.1 gigabytes in roughly 35 weeks time, figuring 40 hours per week of uninterrupted ("no breaks") typing at 50 words per minute (where 5 bytes define one word), and that 72 percent of the current Illinois web size in bytes is made up of text files or markup-language files (which cannot contain embedded images that would contribute a disproportionate number of bytes) would require the employment of 758 such abused typists. It's clear from the occupancy of the office buildings in Springfield that these battalions of employees do not exist.
In examination of the website that increased the most in the period, that of the Illinois Pollution Control Board, a very large number of documents are found that represent minutes of meetings or hearings, or rulings issued in any number of petitions for exceptions. These documents date back only a few years. It seems likely that these documents were "born digital" (i.e., created using a word processor), and that these materials have been translated into HTML or PDF using options available in the "Save As..." menu selection of the word processor(s). Such an effort, while large and tedious, would be far easier than that expended in the initial typing of years worth of material. If this is indeed the case, we should expect such agencies to exhaust their supply of born digital materials, with the eventual tapering off of the rate of increase in their websites. The Illinois Pollution Control Board website has exhibited very far less growth in size since mid July, possibly for this reason. Between August 14, 2002 and November 4, 2002, the PEP CVS copy of this website increased in size by only 9.4 percent. Any number of other agencies may have sizeable inventories of born digital materials they will eventually post to the web, so electronic archive growth rates should not be presumed to be regular and predictable.
Websites are often very volatile, at the agency level, especially with regard to their size. This volatility means government information is being lost now, and measures to prevent such loss cannot wait. The urgency for construction of suitable electronic document archives is as great as the importance of the government documents being lost. While not all government documents are high quality materials such as would have been archived in print form, web publishing should not be assumed to imply unimportance. Further, archival in electronic form can be a means of cost reduction for archives in that individual handling of documents over the life of the archive might be eliminated, or much reduced, by program-controlled processing of whole classes of documents simultaneously.
If an electronic document archive is to be constructed, very much the same metadata that supports search through the current contents of the web can also contribute to the search of the archive. However, measures must be taken to acquire and retain metadata for document types that do not support the embedding of metadata.
Presumably due to a number of causes, metadata retrofitting has only begun for the vast majority of the state agency websites examined. While we can begin automated acquisition of web-published materials now, these materials are largely lacking in metadata. If this metadata is retroactively applied at some point in the future, we will need a means to identify that the edited document is to replace the earlier version acquired for the archive. Or, we need sufficiently capable information retrieval tools so as to be able to achieve reasonable usability of electronic document archives without dependence on consistently available, high-quality metadata. An automatic facility for high-quality analysis of whole-text materials might suffice.
Considering the total expense of retrofitting metadata into the large document inventories of organizations as large as state governments, it would be highly beneficial to know for certain that such metadata actually improves searching. Formally controlled testing needs to address the cost/benefit ratio of a variety of ways through which web document collections can be made searchable. For example, formal cataloging by trained staff at a central facility (such as a state library) might be the best controlled, but also the most expensive metadata implementation option. But, is agency-authored metadata reliable, even when provided with a standardized thesaurus? This seems a reasonable question, particularly considering communications problems in dealing with the distributed authoring and administration of numerous websites within a state government. And, if agency-authored metadata does not contribute sufficiently to improved information retrieval in comparison to whole-text search of the documents themselves, the expense of retrofitting metadata might be unjustified. If shown to be of unsubstantiated value, canceling plans for metadata retrofitting could save governments much money. Automated methods for topical metadata inference by applying data mining and clustering techniques on the text of documents is another possible lower-cost solution, if provably effective.
ISL is taking steps to put in place a two-pronged approach to the retention of electronic documents. The current PEP spider-based duplication of whole agency websites is attractive for its relatively low cost of operation, but suffers from potentially becoming unaffordable if exponential web growth continues for even a few more years. It also suffers in that it cannot differentiate which of the webpages it processes are potentially the most important to history, and thus most deserving of permanent retention. Further, asking the numerous agencies to retrofit metadata into their complete document inventories calls on them to expend quite a lot of administrative labor.
A second approach to the problem of electronic document archival will create a web-accessible electronic document depository facility where state agencies will deliberately deposit (upload) their materials that are to be permanently retained. By employing the agency staffs in deciding which documents warrant long-term retention, it is hoped the quantity of data being stored will be very substantially reduced. Also, with hopefully much smaller numbers of documents intended for long-term retention, much more complete metadata coverage of those documents should be practicable.
The metadata generated at the time of document deposition will be used to produce a webpage very much like a card from a traditional card catalog. A prototype of such an "access card" is shown in figure 6. This card will provide a place wherein we can;
Figure 6. Sample electronic document "access card" webpage, wherein document metadata is displayed to users and made available to search engines. |
In the interest of assisting agencies in making their decisions concerning which documents warrant deposition and long-term archival, we have been discussing the use of questions such as those listed in table 3 as a kind of checklist. A group of such questions could help non-librarians involved in web-based document production within the agencies identify documents of special interest.
| Table 3. Questions Useful in Differentiating Official from Unofficial State Agency Publications |
|---|
|
Note: E-Mail, memoranda, databases, and internal reports are not considered official agency publications of the Illinois state government. |
The existing PEP mechanisms for acquisition of the Illinois web would continue, serving multiple roles. First, the PEP holdings would be a form of insurance against the possibility of loss of documents warranting archival that were not so deposited before being accidentally deleted. Second, the PEP holdings would be available to support the future creation of an online form of digital archive, if that is desired. Third, the PEP acquisitions and acquisition system can be used to feed large sets of highly structured documents (e.g., those produced by document management systems or formally regulated staff such as the Legislative Information System of the Illinois General Assembly) directly into the Depository, obviating the need for clerical and metadata generation action for each document.
[Dublin_Core] Dublin Core Metadata Initiative homepage
http://dublincore.org/
[GILS-3] Illinois State Library, editors. 3rd Annual State Government GILS Conference, Springfield, IL, March 27-30, 2001.
http://www.library.sos.state.il.us/library/isl/gils/gils_conf.html
[GILS-4] Arizona State Library, Archives and Public Records, editors. 4th State GILS conference, Scottsdale, AZ, April, 2002
http://rpm.lib.az.us/4thGILS/index.html
[IL_DCMS] Illinois Department of Central Management Services. 2001-2002 State of Illinois Telephone Directory.
[ISL1] Illinois State Library. "Find-It! Illinois" homepage
http://finditillinois.org/
[ISL2] Illinois State Library. "Metadata Generator" web page
http://www.finditillinois.org/metadata/index.html
[ISL3] Illinois State Library. "Resources for Illinois State Agency Webmasters" web page
http://www.finditillinois.org/metadata/webmasters.htm
[Jackson] Larry S. Jackson. Preserving Electronic Publications project -- GSLIS materials webpage
http://www.isrl.uiuc.edu/pep/
[PEP] Joe Natale, Principal Investigator. Preserving Electronic Publications project homepage.
http://www.cyberdriveillinois.com/library/isl/lat/pep/pep.html