Government Publishing on the Web:
An example Management Tool for large Data Repositories


Bruce Mcleod, Spirit Networks, Email:brucemc@spirit.com.au
Keywords: Microsoft Word , Management Tools, Dynamic Html, Data Repository, Text Retrieval

Introduction

This paper discusses some of the data management issues concerned with the storage of large Data Repositiories of text.

Normally in Government there are large existing Repositories of data either present as Legacy Data on Mainframes or as File Servers full of Word Processing Documents.

Without having to go through a large HTML conversion process and data redundancy headache, the development documented in this paper describes how data can be -

Background:

In response to a specific Departmental Requirement we developed a Web Server that would deliver specific functionality. In creating our version of this required functionality we developed a solution that has great applicability to Government and Corporate requirements. As this conference is mainly Academic in nature and not a Commercial forum we have kept our paper technically informative and business references are contained in a brief section at the end. However please bear in mind that this is a straight commercial development and not the result of any academic research project.

Requirements for the Management of Large Text Based Data Repositories

Web Publishing has virtually come out of nowhere. In a Computer Industry that has being seeing incremental Advances in Computing for a long time with Processor Speed , Operating Systems and Desktop Software , the Web is the only bright new shining star on the scene. This has caused a large amount of Movement in the form of Hype and Home Pages, the initial explosion of this Supernova. Now, with the less intense light of day and with it the need to incorporate existing data systems into the arena of the Web, the need for integrated, managed data systems that support the Web as a feature.

In the current incremental world of text based data repositories Document Management systems have swung into the play to grade, sort through and control the data repositories of Commercial and Government entities. These typically work as overlay components to Word Processing products such as Word Perfect and Microsoft Word.

Some attempt has been made to address these formats with filters such as "RTF to HTML" and Microsofts' "Internet Assistant" but these miss the point as control is immediately lost for Document Management systems of the converted HTML and duplication of data results in a headache. By serving the fully text indexed body of data through a CGI based application as dynamic HTML read directly from Microsoft Word and the like, a large collection of databases can be left unchanged and available on a Web Service.

It should be pointed out that these Web systems are still mainly useful to Government bodies as most Commercial entities protect their data and derive commercial gain from their databanks. It will take an Internet that considers charging for data on the Web as normal to make this feasible. This will only occur when the techniques for charging on the Internet are accepted. By this I mean accepted technically and philosophically.

The User Interface

The User Interface is straightforward. The issue was to provide a method of access to Gigabytes of data in multiple databases.

Selecting Databases

From accessing the Home Page the user is presented with a list of checkboxes corresponding to a text database. One or more databases are selected and a search format screen is shown.

Searching Data

From here a user may search for data using full boolean searching, the searching being built interactively with the user. This was done by allowing Booleans like And Or Except and Proximity Searches as selections in the initial screen. If these are clicked the interface will display the search text so far and allows further search terms to be added. This method attempts to skirt an obvious deficiency of forms based Web Access. That is, the user interface is not dynamic and will not for instance change the fields on a form when an item from a list is selected.

The way around this is to allow Session Globals to exist between Web Screens by passing the partial parameters back to the server so that they can be served out again. In this way only can a Web Client retain any concept of Session Globals. This is also true for Result Lists of Data.

Displaying Database Results

For each Database selected a separate item in a formatted HTML list is returned. Each Database Name is active Hypertext that can be clicked to show Document Lists of Hits.

Displaying Document Lists

Each Document in a Document List supports these functions: Each Document List provided to the user is the User's control Point. If a Bookmark is stored to the Page then the results will be active for a period of time until the search list is purged from the Server, which is configurable.

Displaying Dynamic HTML

When a user asks to browse a document in a list , they are given back 100 words of data containing the first hit. This establishes a limit to the data that must be received for the important process of browsing to occur.

Hypertext navigation tools appear on the screen above and below the text extracted for the Native WP Document. These tools provide for:

Much work was devoted to this as well as formatting of the user interface for the Search String. In each case it is because Huge Data Repositories contain much Data of similar nature and the processing of narrowing down a search is critical in assisting in the Location of Documents. (This does not take into account any Hypertext Access systems that may be constructed to access the data from outside the CGI application. This is certianly possible.

Downloading the Documents

In most situations, Document Stores are controlled by custom File Servers such as Novell and Ethershare (on Apple LANS). These stores are also made accessible by Anonymous FTP. This allows direct links to the documents. Documents stored with .DOC extensions and the like can be directly launched by a Web Browser and viewed in Native format.

Faxing the Documents

Faxing of Native Word Processing Documents requires a Desktop Workstation to open the Document in the appropriate Word Processor and print to a Fax Modem. This was done using AppleScript on Apple but can be as easily done in Visual Basic.

Emailing Documents

Mailing of Large Documents makes sense as it does not impact on a Web Session and the User continues to browse the DataBases. It will make more sense if Internet moves to different charging rates for On-Line bandwidth usage as opposed to background traffic such as Mail. In our system we allow Documents to be attached to a mail message and mailed back to the User. This is again done using Applescript on Apple and Eudora.

About the Engine

This system was achieved through combining Component Technologies. These are:

Text Retrieval API

There was no customisation required from the Text Retrieval API which handles of the Features discussed here without Modification. Again this has the effect of a seamless (how often have you heard this?) interface to existing Text Databases.

The API features used were:

as well as a number of other minor calls allowing markup information to be formatted. Not implemented yet are Thesaurus entries allowing search terms to be entered from clickable lists of Narrower , Broader and Synonym entries.

Dynamic Hypertext

By using these calls not only was it possible to provide the functionality described but it also allowed standard Hypertext Markers to be inserted in Documents as Hidden Text and be dynamically linked to from within the System as well from outside. In this way direct pointers to Hypertext Markers in Native Word documents are fully supported.

Conclusion

We hope this paper has shown an example by which the Web can be applied to existing text systems without alteration. We are aware that many text retrieval systems offer direct indexing of native format documents and any number of these could be used to provide the above. Our point is that this method is preferable to hybrid transalation systems requiring intermediate HTML formats. We predict that this will turn out to be the norm in the process of Web integration in the Computing Community.

Commercial References

Approved Systems
Canberra, Act
Approved Server
61 6 2815344

Text Retrieval System -
TCR (The Corporate Retriever)
QCOM Pty Ltd , Qld
61 7 8393544.

Thanks to -
Commonwealth Attorney Generals Department
for allowing publically available material to be used for demonstration


Originally published in AusWeb95 The First Australian WorldWideWeb Conference (ausweb95@scu.edu.au)