Government Publishing on the Web:
An example Management Tool for large Data Repositories
Bruce Mcleod,
Spirit Networks,
Email:brucemc@spirit.com.au
Keywords: Microsoft Word , Management Tools, Dynamic Html,
Data Repository, Text Retrieval
This paper discusses some of the data management issues concerned with the
storage of large Data Repositiories of text.
Normally in Government there are large existing Repositories of data either
present as Legacy Data on Mainframes or as File Servers full of Word
Processing Documents.
Without having to go through a large HTML conversion process and data
redundancy headache, the development documented in this paper describes how
data can be -
- Free Text Searched
- Dynamically formatted as HTML
- Presented to the user so that Documents can delivered by-
In response to a specific Departmental Requirement we developed a Web
Server that would deliver specific functionality. In creating our version
of this required functionality we developed a solution that has great
applicability to Government and Corporate requirements.
As this conference is mainly Academic in nature and not a Commercial forum
we have kept our paper technically informative and business references are
contained in a brief section at the end.
However please bear in mind that this is a straight commercial development
and not the result of any academic research project.
Requirements for the Management of Large Text
Based Data Repositories
Web Publishing has virtually come out of nowhere. In a Computer Industry
that has being seeing incremental Advances in Computing for a long time
with Processor Speed , Operating Systems and Desktop Software , the Web is
the only bright new shining star on the scene.
This has caused a large amount of Movement in the form of Hype and Home
Pages, the initial explosion of this Supernova.
Now, with the less intense light of day and with it the need to incorporate
existing data systems into the arena of the Web, the need for integrated,
managed data systems that support the Web as a feature.
In the current incremental world of text based data repositories Document
Management systems have swung into the play to grade, sort through and
control the data repositories of Commercial and Government entities.
These typically work as overlay components to Word Processing products such
as Word Perfect and Microsoft Word.
Some attempt has been made to address these formats with filters such as
"RTF to HTML" and Microsofts' "Internet Assistant" but these miss the point
as control is immediately lost for Document Management systems of the
converted HTML and duplication of data results in a headache.
By serving the fully text indexed body of data through a CGI based
application as dynamic HTML read directly from Microsoft Word and the like,
a large collection of databases can be left unchanged and available on a
Web Service.
It should be pointed out that these Web systems are still mainly useful to
Government bodies as most Commercial entities protect their data and derive
commercial gain from their databanks. It will take an Internet that
considers charging for data on the Web as normal to make this feasible.
This will only occur when the techniques for charging on the Internet are
accepted. By this I mean accepted technically and philosophically.
The User Interface is straightforward.
The issue was to provide a method of access to Gigabytes of data in
multiple databases.
From accessing the Home Page the user is presented with a list of
checkboxes corresponding to a text database.
One or more databases are selected and a search format screen is shown.
From here a user may search for data using full boolean searching, the
searching being built interactively with the user.
This was done by allowing Booleans like And Or Except and Proximity
Searches as selections in the initial screen.
If these are clicked the interface will display the search text so far and
allows further search terms to be added. This method attempts to skirt an
obvious deficiency of forms based Web Access. That is, the user interface
is not dynamic and will not for instance change the fields on a form when
an item from a list is selected.
The way around this is to allow Session Globals to exist between Web
Screens by passing the partial parameters back to the server so that they
can be served out again. In this way only can a Web Client retain any
concept of Session Globals. This is also true for Result Lists of Data.
For each Database selected a separate item in a formatted HTML list is
returned. Each Database Name is active Hypertext that can be clicked to
show Document Lists of Hits.
Each Document in a Document List supports these functions:
- Direct viewing of the hits in the Text Search as Dynamically Generated HTML
- Direct Downloading of the Native Format of the Document, so that full
WYSIWYG is offered
- Indirect forwarding of Documents as a Fax
- Indirect forwarding of Documents as Email
Each Document List provided to the user is the User's control Point. If a
Bookmark is stored to the Page then the results will be active for a period
of time until the search list is purged from the Server, which is
configurable.
When a user asks to browse a document in a list , they are given back 100
words of data containing the first hit. This establishes a limit to the
data that must be received for the important process of browsing to occur.
Hypertext navigation tools appear on the screen above and below the text
extracted for the Native WP Document. These tools provide for:
- Next And Previous Hit
- Browse Up and Browse Down
- Next and Previous Document
Much work was devoted to this as well as formatting of the user interface
for the Search String.
In each case it is because Huge Data Repositories contain much Data of
similar nature and the processing of narrowing down a search is critical
in assisting in the Location of Documents. (This does not take into account
any Hypertext Access systems that may be constructed to access the data
from outside the CGI application. This is certianly possible.
In most situations, Document Stores are controlled by custom File Servers
such as Novell and Ethershare (on Apple LANS).
These stores are also made accessible by Anonymous FTP.
This allows direct links to the documents. Documents stored with .DOC
extensions and the like can be directly launched by a Web Browser and
viewed in Native format.
Faxing of Native Word Processing Documents requires a Desktop Workstation
to open the Document in the appropriate Word Processor and print to a Fax
Modem. This was done using AppleScript on Apple but can be as easily done
in Visual Basic.
Mailing of Large Documents makes sense as it does not impact on a Web
Session and the User continues to browse the DataBases.
It will make more sense if Internet moves to different charging rates for
On-Line bandwidth usage as opposed to background traffic such as Mail.
In our system we allow Documents to be attached to a mail message and
mailed back to the User. This is again done using Applescript on Apple and
Eudora.
This system was achieved through combining Component Technologies. These are:
- Anonymous FTP Server
- CGI Application backending to a Text Retrieval API
- Applescript integration tools Microsoft Word, Eudora and Fax Express
There was no customisation required from the Text Retrieval API which
handles of the Features discussed here without Modification. Again this has
the effect of a seamless (how often have you heard this?) interface to
existing Text Databases.
The API features used were:
- Search
- List
- Show Document
as well as a number of other minor calls allowing markup information to be
formatted.
Not implemented yet are Thesaurus entries allowing search terms to be
entered from clickable lists of Narrower , Broader and Synonym entries.
By using these calls not only was it possible to provide the functionality
described but it also allowed standard Hypertext Markers to be inserted in
Documents as Hidden Text and be dynamically linked to from within the
System as well from outside. In this way direct pointers to Hypertext
Markers in Native Word documents are fully supported.
We hope this paper has shown an example by which the Web can be applied to
existing text systems without alteration.
We are aware that many text retrieval systems offer direct indexing of
native format documents and any number of these could be used to provide
the above.
Our point is that this method is preferable to hybrid transalation systems
requiring intermediate HTML formats.
We predict that this will turn out to be the norm in the process of Web
integration in the Computing Community.
Approved Systems
Canberra, Act
Approved Server
61 6 2815344
Text Retrieval System -
TCR (The Corporate Retriever)
QCOM Pty Ltd , Qld
61 7 8393544.
Thanks to -
Commonwealth Attorney Generals Department
for allowing publically available material to be used for demonstration
Originally published in AusWeb95 The First Australian WorldWideWeb Conference
(ausweb95@scu.edu.au)