DOCUMENT DELIVERY


Network Publishing on the Internet in Australia

Tony Barry

Head, Centre for Networked Access to Scholarly Information
Australian National University
Current discussion of document delivery technology and options are operating under print-based assumptions that there are physical documents to copy and that these will be delivered from a centralised service. As electronic publishing using World Wide Web and its successor technologies become more prevalent the model is likely to be very different. In this model documents or components of documents will be mounted on servers which may also be the author's work station. These documents can be accessed remotely over a network and the contents viewed on local machines and then manipulated.

This paper addresses the effect electronic publishing via networks is likely to have on the library profession and libraries. While other forms of electronic publishing, particularly CD-ROMs, are being heralded as the great growth industry of the future in the Prime Minister's Cultural Statement, they, like books, are artefacts. Their organisation and impact are unlikely to be as great as that flowing through from network developments.

We are leaving the period when communication was dominated by paper and moving to one which is electronic. In the dissemination of information we also appear to be at a watershed where the dominance of large central organisations, delivering information to relatively passive recipients, is being challenged by a new information model driven by the economics of silicon-based products, where individuals and small groups are empowered to generate information services that formerly were the domain of larger bodies.

Supporting structures

In the present climate of environmental awareness it is surprising that print on paper as a communication mechanism does not attract more criticism. A communication technology based on the wood chip mill and the effluent of paper making factories is environmentally flawed. Other than grumbles about packaging, newspapers that are too large and the needs for paper recycling few suggest that the whole concept of print on paper as a communication medium should be replaced. A technology has now arisen which potentially can do so for a wide range of print products.

Few question the efficiency of print yet the huge infrastructure required to make print work is all around us. Booksellers and libraries exist as institutions in the form that they do because books are artefacts and the bulk of the work of booksellers, librarians and publishers derives from the physical form of the communication medium. These three groups largely exist to eliminate the deficiencies in print-based communication and fill in the functions that the technology does not perform.

The discussion that follows does not address CD-ROMs as these are static artefacts and do not introduce the range of new issues as do networks. While they are electronic in nature, they can be regarded as roadfill on the information superhighway, being useful for static material and remote areas lacking network connectivity, and fulfil the same type of functions that floppy discs do now.

Networked communication also has its deficiencies but they are quite different to those of print. The supporting professions and industries required to make this form of communication work are likely to be quite different to those required for print on paper.

What is the nature of networked publication?

The capabilities that exemplify the challenge provided by networked publishing are those delivered by gopher and World Wide Web technology. They are `best practice' as far as present electronic publishing is concerned. They offer multimedia, are global in scope and have an ability to link information across multiple machines. The radical differences manifest themselves in many ways.

Distribution

Distribution mechanisms are built into the technology. In many ways such publishing is more like a community notice board or a library reserve collection as only one copy is needed which everyone can see and copy. The act of publishing has effectively placed the document directly to the shelves of the network wide library.

Convergence of function

In effect, the warehouse of the publisher, the stock of the bookseller, the shelves of the library and even the manuscripts of the author, become the same -- the document available on the network.

Dynamic nature

We are so used to print documents as static, it is difficult to consider a situation where this constraint of print is removed. Many of our procedures for producing a publication are based upon the achievement of quality and finality of content because the version once printed can no longer be improved.

Volatility

We use databases that are modified on a continuous basis such as library catalogues. This capability can now be extended to any document that needs frequent updates such as encyclopedias, loose-leaf services etc. But this also extends to textbook material that can be continuously updated rather that being produced in new editions with further print runs. We need to change our viewpoint to one of saying that everything should be continuously updated rather than thinking of only making changes by the creation of a new document. We need good reasons for material to be maintained in a dated form.

Librarians are therefore faced with providing control over a document that is dynamic and whose content can change over time. For instance an author may find that early conclusions on a subject were incorrect and reverse them. The implication of this is that the stable world in which an item can be catalogued once and that cataloguing shared via bibliographical utilities is gone. `Catalogue' entries will need to be checked against the original document from time to time to ensure that they are still accurate and the concept of an `edition' will become unstable.

Link to databases and models

Print documents are passive. Not only do hypertext documents let you switch from place to place instead of following the linear sequence of a book; they can also link to dynamic data constructs such as maps with hot spots or interactive documents such as `live' models where the reader can enter their own test data. This is in addition to the usual unprintable media types.

Citing

The growth of knowledge and scholarship is based upon the acknowledgment of the work of others by being able to cite that work. This allows the reader to verify that the work has been used appropriately by using the citation to locate a copy of the source work, usually in a library, and verify its contents. In a dynamic situation the document may have moved or changed. However, if a hypertext link to the document is made instead of a citation then the actual text becomes available and this bypasses the library as the intermediary supplier.

Archiving

While most publishers will keep some back stocks of their output and archival copies, libraries have generally taken on the role of providing long-term central storage of publications and conservation has been a central concern. The long-term storage of electronic material is more complex. Across the network mirroring arrangements of copies at remote sites are being established to ensure primarily a reduction in access time and network traffic, but also security for the data mirrored. Almost exclusively this is taking place outside the formal library system that, while expressing concern about the problem, has taken little action to solve it. This however is consistent with the approach taken by all but a handful of institutions to wards the long term health of acid based paper.

Caching technology

Caching technology, driven by the need to preserve network bandwidth, is rapidly developing. Rather than collect electronic documents, based upon individual selection decisions, copies retrieved over the network are automatically held locally in a cache server while reading software used by individuals is the automatic first port of call when a remote item is required. The local cache is checked first and a copy delivered from there if available. If not, the copy is retrieved from the remote location, delivered to the user and held in the cache for the next inquiry. Electronic collection development in a sense becomes a by-product of the network engineering.

Lack of bounds

In a static print medium the concept of a document is well defined. It has a physical form and boundary. We are a little more troubled by journals but we accept a continuously expanding journal with individual issues that are the `real' items. But electronic documents on the net give a range of new problems. They do not have 500 years of convention to establish a set of agreed formats to simplify the description so the formats are not yet stable. Worse, through hypertext, a single `conceptual' document may be made up of many interlinked individual files that not only may be the work of many authors but may be mounted on many machines not even in the same country let alone the same institution. The boundaries of a document become imprecise, many distributed parts making up a `virtual' whole.

Cataloguing the network?

There have been debates on a number of mailing lists (go4lib, web4lib, pacs-l) about the desirability or otherwise of cataloguing network resources. OCLC in the US has done some work on this as part of the USMARC Advisory Group; OCLC Internet Resources Cataloguing Experiment and in Britain the CATRIONA project. Because of the problems mentioned above it is questionable whether a traditional cataloguing approach will work. It will certainly have a great deal of trouble in scaling to the global network that is in effect one library. The prime problems are: It is not completely facetious to suggest that the level of detail required in descriptive cataloguing is because a user needs sufficient information about an item to decide whether to expend the effort to try and get access to a copy. In a networked environment when this effort is small the requirement for complex descriptive cataloguing codes are greatly reduced.

There has been a variety of attempts to provide subject access to network resources based upon a variety of automated techniques. Almost totally, these have not emanated from the library community and have come under frequent criticism. As most of these projects have been experiments, maintained often by a single enthusiast, or at best a small group, it is not surprising that they have been less than perfect. What is however amazing is that these techniques have been able to regularly regenerate keyword indexes to material housed in thousands of sites across the world numbering in terms of size the contents of a major research library in a period of days at most, at negligible cost.

Much of the failure of these indexes rests upon what they were indexing -- material obtained from the published source. By doing this the whole problem of trying to maintain access to a highly dynamic corpus of information is greatly reduced. These indexes would be far better if the publishers were able to add information that could be fed into these indexes.

Classification

There have been a variety of attempts to provide a classified approach to network resources. At the ANU the library's gopher has a section organised by the Library of Congress classification. While a number of other servers have attempted arrangements based up library classifications such approaches will not scale up to the full Internet. This is again because of the dynamic nature of the material that shifts location, dies or changes in content and quality. Classifications in a limited subject domain are more common where the corpuses to be dealt with are more restricted. These seem mostly to be arranged by home grown classifications or arrangements. Within a restricted domain it is reasonably easy to devise a scheme of greater rationality than the traditional library classifications. The classification mechanism that might scale involve delegation across institutions and this is the pattern followed by CERN for its distributed WWW subject approach.

Filtering and quality control

Once an electronic document is written the cost of electronic publishing only corresponds to the cost of the network traffic and some fraction of the overheads of supporting the server machine, although some would add the costs of external edition to achieve uniformity to a desired standard. As any modern networked desktop workstation is now capable of acting as a server the overheads are slight. With AARNet's current and proposed charging the cost of serving a document from universities is virtually nil. Already because of the lack of cost and the power given by the hypertext format we are seeing an explosion in electronic publishing using WWW. It does lead to the prospect of each author becoming a publisher -- an explosion in vanity presses, a rapid increase in stylistic variation between documents and experimentation and an overall drop in quality.

With print material, publishers, journal referees and libraries ensured that only quality information was readily available. On the network these constraints on publishing are released. Despite concerns expressed about retrieval, the easing of these constraints will make the filtering of quality information out of the material available, the main and central problem in networked publishing and digital libraries. There are no easy automated solutions to assessment of quality.

What new support structures?

The environment of networked publishing has the following features:

What are the support organisations that will be needed to make this pattern work?

At least two groups, which may be integrated in one organisation, would seem to need to be required. The first is the publisher/cataloguer. For the reasons given the only viable place for cataloguing information to be inserted is at the publishing stage by the publisher. In this model the cataloguer/indexer becomes the publicist for the publications -- as a result of the cataloguing work they become easily retrievable.

The second grouping is the gateway provider. This group selects material within a restricted domain on the basis of quality and provides a value-added service for information within that domain. This group needs to combine a combination of skills including those of bibliographers and reference librarians, and also subject-based skills. This group would organise access to quality information. Two examples of interesting models follow.

The American Mathematical Society has adopted a model where an interdisciplinary group of mathematicians, publishers, librarians and computer specialists has mounted Mathematical Reviews and Current Mathematical Publications as well as the full text of all their publications back for 50 years by way of hypertext with a Mosaic front end. The traditional journal approach provides the filtering, and Mathematical Reviews provides the indexing information. They have taken control of the literature of their discipline and may be offering a commercial service shortly.

Firenet, hosted by ANU, is a cooperative set of World Wide Web and gopher servers for discipline specialists in the field of fire management. In this case librarians have not yet been involved. Publications in this area are locally mounted and the central server provides a view of the `quality' material and it is managed by the professional group themselves.

What does this mean to libraries?

The library as a place exists because books are physical artefacts. In view of the low use of much material that libraries collect (research libraries in particular) there will be little incentive, other than conservation, to retrospectively convert all material. The Japanese Diet Library is reputed to be planning to convert 5 million items to machine-readable form but this may because it is in Tokyo which is likely to suffer a major earthquake. The reconstruction of the destroyed Sarajevo library in virtual form is also a special case. We should see libraries continuing their present functions for many decades. In addition some formats such as linear fictions are very suited to a print format.

Libraries are also likely to continue their role as access points to information, especially to those who are disadvantaged in terms of network access.

Librarians however could see some new opportunities where they may be in competition with other groups.

In network publishing librarians may well have a role in designing information systems to ensure good retrieval of the material published. In particular they should have a valuable role in ensuring that various automated web robots can effectively retrieve the material.

Librarians will also have opportunities in the provision of value added service on the network aiding in the design of servers which point to and deliver quality information in a coherent manner. In this area librarians have already been very active. Many WWW and gopher servers have been produced by libraries and library consortia and this trend is likely to continue. In doing this however libraries will be in conflict with individuals in the discipline who feel correctly that they are the best arbiters of quality. Librarians need to get back to their roots in information management and work with the specialists to make this approach a valuable one for the profession.


Information Online & On Disc 95