Will indexers be redundant by the year 2005?

Paper presented to the Australian Society of Indexers Conference,
The Futureproof Indexer,
Katoomba, 27-28 September 1997

by

Tony Barry

Visiting Fellow,
Department of Computer Science, Faculty of Engineering and Information Technology,
Australian National University.

http://www.purl.org/NET/Tony.Barry
mailto:tonyb@netinfo.com.au


World of indexing is changing

The world of indexing is in transition. Regardless of where you look, be it the subject headings assigned by cataloguers to books in libraries, by indexers to articles in abstracting and indexing services, or in the compilation of indexes to the content of publications, the environment and tools used to create indexes are rapidly changing.

There is a transition from paper to electronic forms of the material being indexed.

Network publishing with hypertext links between document compared with stand alone publications alters the nature of documents. It also changes indexes into delivery services not just finding aids.

There are new indexing technologies which can produce an index product completely automatically which, while lacking in the quality of manually applied indexes, can be produced at much lower cost

The industries which support the interface between author and reader - publishing, bookselling, and libraries are now faced with radically new delivery methods generated by networking technologies and are realigning their responsibilities for the functions they undertake. One of these functions is indexing.

Digression on paper publication

In the world of paper indexing activities tended to specialise and were aligned to the various forms of publication and supporting activities. There is a major distinction between those indexes which cover the content of publications, of which individual book indexes are the prime example, and those which index material at the bibliographical level. These are library catalogues dealing with bibliographical items in a macro sense and abstracting and indexing services at a more micro level. This rough categorisation has been stable for much of this century and reflects the stability in the types of material which are being indexed. While there were substantial improvements in the form of indexing vocabularies used the only radical development was the extension of the citation index from the specialised field of legal material to the general periodical literature by Eugene Garfield's Institute of Scientific Information.

Five technological developments are now changing all this -

  1. Hardware developments in microelectronics and telecommunications
  2. The development of computers during and since WW2
  3. Development of the theoretical understanding of automatic indexing and databases typified by Gerard Salton's work 20 years ago.
  4. The invention of packet switched networking technologies which became the internet by Vint Cerf and others.
  5. The development of the World Wide Web, surprisingly enough originally for the purpose of scientific documentation by Tim Berners-Lee.

The transition from paper

We are in a time of transition for publishing. Some publications, but not all, will migrate from paper to network publishing. In the latter, the technologies available to supply indexing services are radically different to those which are involved for paper and they are rapidly changing. Those involved in supplying indexing services will be forced to adapt to their use.

Let us look at where this has reached for various categories of publications.

Firstly there are those publication which have already almost completely handled the transition. While print versions may still survive there do so in increasingly marginal form. This is principally for publications which rapidly change and requires the support of a major database. We tend to forget that early library catalogues, indexes to the library's collections, were printed as books and although the transition to cards was a very great advance they were still in printed form. the transition to the automated production of catalogues started twenty years ago but the transition to the online forms common today only a decade ago. The web transition is only in its infancy. The shift of the library catalogue and of major bibliographical services to online form is largely complete.

There also a wide range of data services, mainly in the sciences, now available via network access. Many of these however were never available in printed form but started as electronic databases which are only now more universally available through the connectivity provided by the internet.

Secondly there are those services which again have rapidly changing data although may not be held in a rigid database structure. It is clear that these services have distinct advantages in electronic and network form and the paper versions will be unlikely to survive. Examples are -

Thirdly there are those publications for which the jury is out and may survive in both forms. Such publications are -

The "grey" academic literature

. In some areas such as physics and astronomy the transition is almost complete. In many field conference papers (such as those at this conference) are appearing in electronic form.

The academic journal

. Most major academic publishers are now struggling with the problem of how to handle the transition and maintain their revenue stream. Developments in electronic commerce will be needed before a suitable economic model is available to publishers. This could could see them shift to a delivery model based on the individual article rather than the serial title.

Newspapers

. Most newspapers now have electronic version available. With the advantage of a good economic model, advertising, to pay for the development, their online presence is likely to become increasing sophisticated. It is hard to conceive of the newspaper in printed form disappearing for a considerable period if ever however.

Fourthly there are form of writing for which the book is uniquely suited and into which online variants are unlikely to make significant inroads. Examples are -

New forms of publishing

The internet introduces new forms of publishing in two ways of interest to us. Most obviously there are completely new forms such as email lists and their associated archives and search engines. However there are also new mechanisms which provide interlinking between those things which were formerly disparate. Hypertext blurs the boundary between publication and between publications and those things that index them. For instance a web based library catalogue can have a hypertext link to the full text of the item described as can an index. The former separation between those indexes which indexed a single entity such as a book and those that indexed at the bibliographical level becomes blurred as the boundaries between separate publications becomes less distinct. The role of the indexer moves from solely providing the address of the information indexed to providing a delivery mechanism. Some examples are -

Types of indexes on the web

To provide access to the rapidly increasing material on the internet, largely now dominated by World Wide Web, there has been an explosion of different attempts to provide organised access via classification and indexing. Figures on the number of "pages' of information on the network increase so rapidly that any figure dates. At the time of writing about 100 million "pages" was being quoted by some commentators based on data generated by indexing robots. This figure is very uncertain as even the number of connected computers cannot be measured with a high degree of accuracy. Because of the growth rate, the ephemeral and non standard nature of much of the material available, and the relative ease of technological "quick fix" solutions, most of the indexing on the net has been based on automated free text methods. An attempted categorisation of the types of indexes and classifications are listed below,

Single server indexes

Full text indexes of single servers are parts of the are common. Typically these are based on public domain or shareware software typically swish in the case of Unix and eg.acgi on MacOS. With most of these the scope for augmenting the indexing manually is about nil.

Multi host indexes

There is also software to create indexes drawn from the material on many other computers. A firm or university might want to index all the computers in their domain or in their area of interest. Harvest is popular software for this. There are rapid developments in commercial software in this area targeted at the rapidly growing intranet market. Many of the big search internet engines have been set up with the hope that their success will generate sales in this market. Such indexes have some some ability to limit what is indexed by indications on each host as to which parts of their server directory should be considered and which ignored.

Some of these search engines are adding the ability to look for specific indexing data embedded in the document called metadata, and to use it as the basis of the generated index terms so the scope for manual intervention into indexing is growing. An interesting experiment in this area is currently being performed by the National Library of Australia and a group of universities funded by the Australia Vice-Chancellors Committee. The DEETYA EdNA project is also working in this area.

Global indexes

There is a growing literature on the big internet search engines and new ones seem to leap into existence on almost a monthly basis. They vary in their coverage, in the depth with which they reach down into the hosts they index, and in the cycle time they use to reindex sites. Though regarded by many users of the internet as prime access points, and used as such, the limitation of full text indexing are apparent in their performance although surprisingly good results can be obtained with careful choice of search terms taking account of their sophisticated statistical ranking algorithms.

But as the web grows they will be unable to scale and in my view will deteriorate in performance. Specialised carefully crafted indexes will perform far better as they do in the print world. There are two reasons for this, the problem of indexes which cross many subject domains increases synonym problems and reduces precision, and the lack of selectivity for quality material. The big robotes index what is there regardless of utility or quality.

Selective Indexes

The internet is notorious for the low quality of much of the material. This is inevitable in an environment where the entry cost for publishing is so low. Drafts, idle correspondence, individual overheads for presentations and crank material is mounted as easily as material of depth and thought. I have held the view for three years that the central problem of the internet is not going to be indexing but filtering. The bulk of the material on the internet is, for serious purposes, inappropriate for retrieval. Indexes which include this material will of their nature include noise.

Selection of material is a task that publishers, indexers and librarians are well familiar.

The filtering that they do is a task that cannot be automated as it involves judgments of utility and quality.

Because of this we see increasingly see indexes of material being developed of only selected material. Many of these are being produced by individual institutions for their own purposes based on web pages which point at selected sites and which are supported by keyword indexes. Some of these are commercial such as Yahoo. I believe there will be a shift to the provision of quality manual indexing for such sites.

OPACs and the internet

The development of web enabled OPAC systems for libraries has led to a moves to catalogue internet resources. This seems to be fairly slow to start up as - Nevertheless this is potentially a growth area for indexed access to networked resources

Cross linkage of indexes

The global nature of the internet seems to be leading to another trend which those designing indexes would keep to the forefront of their consideration when starting into new internet projects. The ability to add links and to share information between sites create new opportunities for indexing which is increasingly being utilised.

Some examples are -

In future to index may not be enough. The ability to link to a service that the user of the index might want after the items sought is determined will also be an important feature to consider including in the index design.

Selection of what to index

I have argued that the key problem for internet indexing is not so much the indexing but the choice of what is indexed. The problem of finding tools to assist in this is being addresses by a number of groups. The most important of these is the work being done at the W3C consortium in relation to the "Platform for Internet Content Selection" (PICS). This originally was driven by domestic politics in the United States and was seen not as a mechanism for selecting quality material but of blocking access to unsuitable material specifically by children as figure 1 shows.


Figure 1.

While originally conceived as a mechanism for censorship this was broadened by the W3C group with the idea of having a general mechanism which could deliver-

The original principles were - The political clout behind this scheme will ensure that it will happen. As the netparent site stated on 16 July this year -

"Companies in the Internet online industry joined today with organisations representing education, children, parents, consumers and law enforcement to support President Clinton's and Congress' call for an Internet online environment that's family-friendly and rewarding and safe for children.

The groups cited the current widespread availability of tools that can empower parents to shape children's Internet online experience, tools that are "effective, easy-to-use and 100 percent available to anyone whose child is on the Internet."

It is unusual to find a protocol which can be used to support indexing endorsed by a head of state! This protocol is not just for blocking however. the W3C is now expanding it into the Resource Description Format which will be a more generalised mechanism for delivering information about a document. It will be able delivery referee and indexing information. This could form the basis of a distributed service which might form the electronic basis for the equivalent of a library collection - selected catalogued material. Even without this format a number of groups are already working at using PICS to support information services including EdNA here in Australia and the IEEE. The key components in the development of this protocol is the ability to deliver third part judgments on quality which is the essence of refereeing system for publisher and collection development for libraries and also the ability to delivery independent quality indexing to get round the problem of 'spamming" indexes seen on the general search engines.

Manual versus automated indexing

I have mentioned some preliminary attempts to add internet pages to library catalogues and cross link abstracting and indexing services to document delivery backup services. There are as well completely new services in the formative stage being put together by booksellers, who, like libraries and publishers, see their role changing in this new environment. Two examples are Blackwell's Navigator service and EbsoHost both of which are providing indexed access at article level to the serial content of a range of journals. A number of publishers are also starting to provide such services such as Academic Press and the American Mathematical Society.

A more general approach is the emerging, the Dublin Core standard which is the subject of a paper at this conference so I will not go into details on it here other than to make the points that -

There is still a lot of work to be done on this format which will be progressed at the current meeting of the international group in Helsinki.

It is likely that the ability of this protocol will be enhanced by the following developments -

The involvement of the big players in the internet, Microsoft and Netscape with W3C to standardise these formats who apparently met in the last week of August to hammer out a specification based on a combination of the standards,

Confusion of roles?

I believe that the traditional roles and division of responsibilities between those groups who support communication between author and reader are in a state of flux. There is likely to be a transfer of responsibilities of some roles between them and a new pattern will emerge which may in the long term bear little relationship to that which holds in the printed world. There may in fact end up with fewer players than there are now or more. To explore this further I have created a simplistic model. Initially I looked at some of the major functions currently carried out in the sectors and what is happening as they enter the networked environment and this is summarised in the table below.

Function Publisher Abstracting and Indexing Service Bookseller Library PICS Search engines
Publish Yes No No No No No
Filter Editorial control Yes Some Collection development Add to server No
Index CIP/Meta data Yes Starting to Catalog Meta data No
Database Starting to Yes Starting to OPAC No Yes
Assistance No No Some Yes No No

Function comparison table

Indexing is an activity which is starting to spread across all sectors as it is an essential supporting function required to make any extensive information system work. The development of the protocols mentioned above create the prospect of completely new methods of information selection, indexing and delivery. The key to these possibilities is the potential to decouple those things that are presently tied together because of the physical format of paper. A variety of sources of information, if all on the network, can be assembled as needed in different ways by different groups to deliver quite different information services. Initial conflicts are already arising in the courts where groups have used the network to enhance their services by linking into other commercial services across then network and delivering that information as if it was their own. Some of the interlinking services mentioned above provide further examples.

The possibility exists for the following services to be supplied on an independent basis which could be tapped into and used as required -

In this way the selection of which material is of value could be provided to multiple sites. For instance a physics site could rate physics information on the network. This could be collected by multiple indexing sites each of which could index the material using different and competing thesaurus schemes. Competing index database services could then collect the indexing information to generate different services providing value added cross linkages to additional services.

This approach is set out in figure 2 below.


Figure 2

Conclusion

We are in the most profound period of change in human communication since the invention of printing. Those industries and services that have grown up to support communication via printing will need to find new modes of information service to continue in a networked world. The manner in which communication between author and reader will be supported is not yet clear. The model provided is just one of many which are possible.

Whatever organisational structures emerge one thing remains clear. The ability to find high quality information will be remain of central importance. This is not a task that where human skills can be replaced by computers in the foreseeable future. The indexer will retain a central role.

Tony Barry
Monday, 22 September 1997
Last revised Tue, 7 Oct 1997