NIR in not enough


Paper presented to Questnet '95, Bond University, 6-8 September 1995
by Tony Barry - tony@info.anu.edu.au
<URL:http://snazzy.anu.edu.au/People/TonyB.html>
Center for Networked Access to Scholarly Information
Australia National Univerity Library

Introduction

The internet has a reputation of being complex and disorganised lacking finding tools. As a consequence it is seen by many as not being a "serious" information system because of the perceived anarchy. Certainly the freedom it gives to individuals and small groups to publish inexpensively have led to an explosion in the availability of very specialist material collected by individuals which fall into often exotic and hobbyist areas. Webspace now contains close to 6 million documents from virtually none two years ago. Had this level of material suddenly become available in print format and using reasonable figures for labour required to do full library cataloguing this would require about 3000 person years of effort to generate a traditional catalogue. It is questionable however such an approach would be useful even if the resources were available.

Viewed on the basis of print material the lack of formal catalogues and other indexing services not only is the internet disorganised but any hope of providing organisation in the traditional manner is slight. However a variety of networked information retrieval tools have emerged which fulfill many of the features of traditional finding tools in a library.

Comparing the print world with the net

Documents in both the print world and the network have indexes and a range of search tools to assist in access to them and provide organisation. Without going into great depth the section below makes some comparisons between these two domains.

Organisation of the print world

The world of print is provides with a huge range of tools to assist in finding material but no central index is possible.

Most publishers issue their own catalogues of material they produce usually limited to material that is in print. The publishing trade issues various "books in print listings" usually limited to one country or language.

Libraries maintain catalogues. In this country alone there are about 2000 libraries each with a catalogue of material it holds. The National Library tries to keep a maintain a union catalogue of all their holdings, certainly of the major libraries, as does the national libraries of other countries.

Within specific subject disciplines abstracting and indexing services endeavor to maintain services to give access to the journal literature. There would be over a thousand of these services. Within each country there may be a national service of this kind which attempts to index local publications.

There are hundreds of years of experience and effort which has built up this complex array of overlapping services which ensure that most material of worth can be identified and located. It should be noted that each of these is selective in some way in what it covers. For a particular question the choice of service chose to locate relevant material is important.

Organisation of the net

Across the net a successive wave of network information retrieval tools have been deployed with ever increasing scope - archie, gopher, veronica, jughead, lycos, infoseek, harvest etc. Those most heavily used can be categorised in a number of ways eg -

Comprehensive with automated collection
Lycos and infoseek
Comprehensive with manual collection
The W3 global electronic library, The CSU Australia page
Selective to a certain protocol
archie
Selected on the basis of quality
Many manually developed collections.

The analogues of many of the print bases services being developed. In one area where the net provides a service which is impractical in the print world and that is the provision of a fairly comprehensive index through automated means across all material.

Constraints and controls on access to paper publications

It is worth looking briefly at the organisation and controls placed on access to print material so that the possible changes that electronic publishing may bring about can be discerned. This description is not meant to have any great depth merely to provide a point of contrast with the different approach that is being looked at for electronic material

Economic constraints

The main constraints on paper publishing are economic.

Paper and printing are expensive as is the distribution of printed products. The capital expenditure required to produce, sell and distribute a best seller or popular journal are such that only large organisations with significant expertise can afford to do so. It is also a business which has significant risks, as copies need to be printed and paid for, before sales are assured. This has certain consequences.

The production and distribution of material is limited by economic factors. publishers will not produce something that will not sell. This limits what is produced.

As production, like other media, goes through the bottleneck of a few organisations, government are able to place controls on what may be distributed to ensure that community tastes are not offended.

The appetite of people to read exceeds their capacity to pay for the printed material. This simple fact together the physical size of books goes a long way to explaining why libraries exist in the form that they do.

The economics of print make the production of larger volumes more cost effective than the very small. The journal exists to aggregate many small items on a similar topic to achieve an economic size.

In a networked world these things all come into question.

Library constraints

Another player in the game of controlling access is the library. With restricted funds libraries must choose what they buy and process. Through their collection development policies they determine what their clientele can access easily ,although they are also able to acquire any other published material if required via the interlibrary loan system and other document delivery services. the collection development policies are often expressed in public form and may be shared. Libraries in Australia are currently developing joint policies for the acquisition of material through the a concept, sponsored by the National Library of Australia, of the "distributed national collection"

Refereeing - quality constraints

Via the refereeing system the academic community has limited what can be published in the journal literature, both to ensure quality and to limit costs. Material which bypasses this system eg internal reports has lesser validity and a smaller distribution. Libraries call this latter material "grey" literature and its collection and organisation is usually avoided except for small specialised libraries, working mainly in the technical areas.

Consequence of the restraints

As a consequence of what is described earlier in this paper a whole series of filtering operations determine what material can be easily accessed.

  1. The publishers will only produce what is economic and low use material must be subsidised in some way.
  2. Refereeing processes filter out low quality material and authenticate the worth of the material published.
  3. Libraries provide a further filter acquiring only the best material for their clientele

A further filtering step also applies. the secondary services, catalogs and abstracting and indexing service are also selective in what the cover and sometime the depth of indexing applied to the material.

Relaxation of controls on the net

While there may be disagreement about the precise costs of electronic publishing once a basic network connection is obtained and the material to publish is available the publishing and distribution costs compared with print are negligible. This opens up a huge set of opportunities and problems.

  1. The economic constraints on what can be published are greatly weakened.

  2. Self publishing becomes feasible as little constrained by cost

  3. The ability of an organisation to control and authorise material issued from within it's boundaries is considerably reduced

  4. Individuals or small groups can publish material which before was only within the province of large organisations.

  5. The ability to modify and correct published material reduce the need for the extensive controls placed on print material to achieve accuracy before publication

  6. As the net is also a communication medium its use for working drafts for comment is becoming prevalent.

  7. Five hundred years of convention which now dictate the format of print documents and simplify navigation within them are not available yet to electronic documents to provide guidance.

  8. The regulatory constraints that government can place on the dissemination of material judged to be inappropriate are greatly weakened as so many more groups can become publishers and the reach of the net is global. Control of publishing on the net will have greater similarity to the control of the content of telephone conversations that that of the traditional mass media of communications.

This will result in the availability of -

Closer scrutiny of an electronic document will be required to determine its status than would be the case of print

With opportunity comes problems

The increased freedom to publish more information will be available. Unlike the print world there are be unified indexes to locate this material. That is not to suggest that such indexes are perfect. the lack of a controlled indexing vocabulary coupled with the deployment of natural language indexing across multiple subjects and languages will reduce the precision of the systems for a specific level of recall (using precision and recall in their technical sense).

It will be possible to find the material on a subject sought and winnow out the clearly irrelevant "false drops" from the system by hand. There will still remain a major problem that of filtering what is retrieved for quality and suitability which indexes do not address. This is not a problem which can be easily automated.

While the network has reduced the controls and filters which block the publication of material it has not yet provided comprehensive systems which will assist the viewer of the publications to screen chaff from wheat. The task of filtering quality material has been largely shifted from the publisher to the consumer.

Filtering

There are a variety of approaches to filtering being deployed on the network and many are discussed briefly below. There are basically two approaches to filtering one positive positive and one negative -

The first uses exclusion and filters certain hosts or URLs. This is the approach used by services which seek to exclude objectionable material (however defined) from the end user. These services normally apply at the client.

The second is inclusive and provides information on "approved" sites rather than the reverse. Such services seem so far to be networked based.

Filtering objectionable material

Raising the most controversy and of most sensitivity to government is a wish to filter objectionable material and limit access to such material particularly for school children. Currently the Commonwealth Government has just initiated a third inquiry into the regulation of electronic media which I have documented. <URL:http://snazzy.anu.edu.au/CNASI/gov/augov/bibl.html>.

Some examples of how access can be filtered follow.

Surfwatch

The Surfwatch program appears to use a database to limit access to URLs and does string matches on the URL. This is an add on to the client. They state -

"SurfWatch is a new type of software which helps parents, educators and employers reduce the risk of children and others uncovering sexually explicit material on the Internet."
<URL:http://www.surfwatch.com/surfwatch/>

Censorman

Australia has its own equivalent to this software which appears to operate at a cache level and allows you to deny access to specific URLs. The product is produced by Schoolsnet in Victoria and they state -

"CensorMan is a series of Perl scripts which allows you to use a web browser to censor particular URLs."
<URL:http://www.schnet.edu.au/~lukeh/samples/cm-demo.html>

Internet week article

Of such concern is access to pornography that the problem has even entered the standards field -

"Three major players in the Internet software market are spearheading an industry-wide effort to "create and implement standards that will enable parents,

educators, and other adults to 'lock out' access to inappropriate materials" on the Internet. The I information Highway Parental Empowerment Group (IHPEG) was formed last week by Microsoft Corporation, Netscape Communications, and Progressive Networks in an effort to show legislators that the Internet community can regulate itself --without help from Washington."

<URL:http://www.phillips.com:3200/sample.htm>

Cybersitter

Some of these programs appear to be somewhat more pernicious. Cybersitter for instance
"CYBERsitter gives parents the capability to block or be alerted to access of adult-oriented pictures and pornography on the Internet as well as all the popular on-line services. Additionally, CYBERsitter will block access to these types of files from the computer's own hard disk, floppy disks and CD-ROM drives.

CYBERsitter works by secretly monitoring all computer activity and when the child tries to download or view an adult-oriented picture, the process is automatically aborted, and/or an alert to the parent is generated for later viewing.

CYBERsitter can also block access to games, personal files or specific programs on the computer that the parents may want to keep children from accessing."

<URL:http://www.rain.org/~solidoak/cybersit.htm>

Circit research project

The problem has even become the subject of research in Australia at RMIT where -

"CIRCIT is conducting a research project for the Schools Council of the National Board of Employment, Education and Training. Our task is to bring together information from schools about their experiences with students' access to the Internet, whether the exposure to controversial materials has proved to be a problem, and what strategies schools are using to deal with the issue. We will be doing this in various ways, including communicating with interested parties using the Internet, via telephone interviews and by site visits to a small number of schools. The project is intended to run until about the end of October 1995."
<URL:http://teloz.latrobe.edu.au/circit/schome.html>

Selection for quality

Most web authors add links from their server to material which they think might be useful to those they are trying to reach. A number of sites are now going about this in a systematic way.

The Library community

The library community which has been in the business of manually selecting material is creating web server with pointers to material judged to be of value and interest.

OCLC, one major players in the development of services for libraries has set up its Internet Cataloging Project

" to create, implement, test, and evaluate a searchable database of USMARC format bibliographic records, complete with electronic location and access information (USMARC field 856), for Internet-accessible materials."
<URL:http://www.oclc.org/oclc/man/catproj/catcall.htm>

Many Vendors of library systems are now offering web server capability to their OPACs and the ability to add URLs to catalogue records instead of call numbers. With this capability libraries can treat material published on the network in much the same way as they do paper publications but make it deliverable via the library catalogue rather than just serving up a citation and a location.

Entrepreneurs

A variety of private groups have set up electronic libraries which rival or exceed anything which the library community has yet done. A notable example being the Yahoo server. <URL:http://www.yahoo.com/>

It is unclear how well these manually approaches will hold up compared with some other approaches in the longer term.

Group efforts

Actions by cooperative groups to filter and simplify access to better material is also an attractive approach.
Peskin"s proposal
Michael E. Peskin has put forward a proposal to completely reorganise scholarly communication in the field of physics - "Reorganization of the APS Journals for the Era of Electronic Communication" <URL:http://publish.aps.org/EPRINT/peskin.html>

Briefly his proposal is to open up physics publishing to a base unrefereed level in a form of preprint archive with items which can be modified with a higher level to which items could be promoted at any time based upon community agreement rather than refereeing.

SOAPs
A proposal to establish a Global Encyclopaedia generated to concept of Seals of Approval (SOAPs)

"In the future, we will add a system of seals of approval (SOAPs). This mechanism keeps readers from being at the mercy of editors. Any article submitted (that does not violate copyrights or present offensive material) will eventually be available to readers, but readers could tell the server which articles to send based on the presence of the seal of approval of some body or individual"
<URL:http://www.halcyon.com/jensen/encyclopedia/more/GlEnVolunteer.html> The newsgroup supporting this initiative now appears to be moribund. <URL:news:comp.infosystems.interpedia>

The SOAP concept however had the virtue that it could be extended to a form of refereeing service independent of an publishing server and could be used to validate any URL and which may have been similar to some of the thinking behind the next two schemes.

Intellegent publishing environment
James E. Pitkow and his colleagues presented a paper Towards an Intelligent Publishing Environment to the WWW95 conference where they stated -

"We present a prototype environment that facilitates the publishing of documents on the Web by automatically generating meta-information about the document, communicating this to a local scalable architecture, e.g WHOIS++"

<URL:http://www.igd.fhg.de/www/www95/papers/72/publish/publishing.html>

Commentor
Martin Röscheisen and colleagues in a paper "A Platform for Third-Party Value-Added Information Providers: Architecture, Protocols, and Usage Examples" presented -

"an architecture, called "ComMentor", which provides a platform for third-party providers of lightweight super-structures to material provided by conventional content providers. It enables people to share structured in-place annotations about arbitrary on-line documents."
<URL:http://www-diglib.stanford.edu/rmr/TR/TR.html>

In another paper "Beyond Browsing: Shared Comments, SOAPs, Trails, and On-line Communities" where they describe -

"a system we have implemented that enables people to share structured in-place annotations attached to material in arbitrary documents on the WWW. The basic conceptual decisions are laid out, and a prototypical example of the client-server interaction is given. We then explain the usage perspective, describe our experience with using the system, and discuss other experimental usages of our prototype implementation, such as collaborative filtering, seals of approval, and value-added trails. "
<URL:http://www-diglib.stanford.edu/diglib/pub/reports/brio_www95.html>

Special Interest Networks - SINs
In Australia, David Green at Charles Sturt University, has been pushing his concept of Special Interest Networks where a group sharing a common interest would cooperate together to combine their information resources across many web sites protecting quality. He has mounted a paper -

"A Web of SINs - the nature and organization of Special Interest Networks" which explores the idea more fully. <URL:http://www.csu.edu.au/links/sin/sin.html>

Speculation

We have seen the deployment of early generation global level indexing tools and have argued that for them to be more effective there is the need for some selectivity in the material which they return. Clearly such selectivity should be set by the user who may want to filter what is retuned by language, intellectual level, quality or other criteria. To do this indexing systems will need to do more than just retrieval material on a given topic but will also need meta level information of a higher order. Some of this information is already collected in cataloguing systems for the print world and are fields within the standard MARC record but the requirement go well beyond what is there. One possible model is the Harvest system which decouples the collection of indexing information in the "gathering" phase from its delivery in the "broker" phase.

<URL:http://harvest.cs.colorado.edu/harvest/>

The former potentially allows customisation of the meta information fed into the system. The latter envisaged the collection of specialised indexes chosen from what is collected. If these could be coupled with the provision of reviewing information from the same or other sources it might be possible to combine selectivity based upon content from the publisher as well as quality indications provided from a trusted source.

Is there a way to assist the transmission of judgments about material to assist those who seek the information to not only the items which might contain the words which relate to their topic of interest but also filter what might be found by more general criteria of content? Can systems be built which might allow the reader of material to feed back into the network views as to the usefulness of the material read, such that future readers can benefit from those views?

As an information publishing system, the central problem of the network will not be retrieval of information, but filtering what is retrieved to select that which is useful. This is a problem to which developers should address their skills as without filtering mechanisms network users will be swamped by the relevant but unuseful material that they retrieve.