Will indexers be redundant by the year 2005?
Paper presented to the Australian Society of Indexers Conference,
The Futureproof Indexer,
Katoomba, 27-28 September 1997
by
Tony Barry
Visiting Fellow,
Department of Computer Science, Faculty of Engineering and Information
Technology,
Australian National University.
http://www.purl.org/NET/Tony.Barry
mailto:tonyb@netinfo.com.au
World of indexing is changing
The world of indexing is in transition. Regardless of where you look, be it the subject
headings assigned by cataloguers to books in libraries, by indexers to articles in
abstracting and indexing services, or in the compilation of indexes to the content
of publications, the environment and tools used to create indexes are rapidly changing.
There is a transition from paper to electronic forms of the material being indexed.
Network publishing with hypertext links between document compared with stand alone
publications alters the nature of documents. It also changes indexes into delivery
services not just finding aids.
There are new indexing technologies which can produce an index product completely
automatically which, while lacking in the quality of manually applied indexes, can
be produced at much lower cost
The industries which support the interface between author and reader - publishing,
bookselling, and libraries are now faced with radically new delivery methods generated
by networking technologies and are realigning their responsibilities for the functions
they undertake. One of these functions is indexing.
Digression on paper publication
In the world of paper indexing activities tended to specialise and were
aligned to the various forms of publication and supporting activities. There
is a major distinction between those indexes which cover the content of
publications, of which individual book indexes are the prime example, and
those which index material at the bibliographical level. These are library
catalogues dealing with bibliographical items in a macro sense and
abstracting and indexing services at a more micro level. This rough
categorisation has been stable for much of this century and reflects the
stability in the types of material which are being indexed. While there
were substantial improvements in the form of indexing vocabularies used the
only radical development was the extension of the citation index from the
specialised field of legal material to the general periodical literature by
Eugene Garfield's Institute of Scientific Information.
Five technological developments are now changing all this -
- Hardware developments in microelectronics and telecommunications
- The development of computers during and since WW2
- Development of the theoretical understanding of automatic indexing and databases
typified by Gerard Salton's work 20 years ago.
- The invention of packet switched networking technologies which became the internet
by Vint Cerf and others.
- The development of the World Wide Web, surprisingly enough originally for the purpose
of scientific documentation by Tim Berners-Lee.
The transition from paper
We are in a time of transition for publishing. Some publications, but not all, will migrate
from paper to network publishing. In the latter, the technologies available
to supply indexing services are radically different to those which are involved for paper
and they are rapidly changing. Those involved in supplying indexing services
will be forced to adapt to their use.
Let us look at where this has reached for various categories of publications.
Firstly there are those publication which have already almost completely handled the
transition. While print versions may still survive there do so in increasingly marginal
form. This is principally for publications which rapidly change and requires the
support of a major database. We tend to forget that early library catalogues, indexes to
the library's collections, were printed as books and although the transition to
cards was a very great advance they were still in printed form. the transition to
the automated production of catalogues started twenty years ago but the transition to
the online forms common today only a decade ago. The web transition is only in its
infancy. The shift of the library catalogue and of major bibliographical services
to online form is largely complete.
There also a wide range of data services, mainly in the sciences, now available via
network access. Many of these however were never available in printed form but started
as electronic databases which are only now more universally available through the
connectivity provided by the internet.
Secondly there are those services which again have rapidly changing data although
may not be held in a rigid database structure. It is clear that these services
have distinct advantages in electronic and network form and the paper versions will
be unlikely to survive. Examples are -
- Access to legislation and law reporting services
- Encyclopaedias - The Britannica's order of magnitude drop in price recently for
the CD form and the cross linking of the internet version into library catalogues
is a useful pointer here. The internet version is available on a monthly subscription
basis.
- Directories - the AGPS gold service replacing the Commonwealth Government Directory
and the While and Yellow pages of Telstra are examples.
Thirdly there are those publications for which the jury is out and may survive in
both forms. Such publications are -
The "grey" academic literature
. In some areas such as physics and astronomy the transition is almost complete.
In many field conference papers (such as those at this conference) are appearing
in electronic form.
The academic journal
. Most major academic publishers are now struggling with the problem of how to handle
the transition and maintain their revenue stream. Developments in electronic commerce
will be needed before a suitable economic model is available to publishers. This could
could see them shift to a delivery model based on the individual article rather than
the serial title.
Newspapers
. Most newspapers now have electronic version available. With the advantage of a
good economic model, advertising, to pay for the development, their online presence
is likely to become increasing sophisticated. It is hard to conceive of the newspaper
in printed form disappearing for a considerable period if ever however.
Fourthly there are form of writing for which the book is uniquely suited and into
which online variants are unlikely to make significant inroads. Examples are -
- The novel so clearly suited to the linear form of the book
- The scholarly monograph - although it will face competition from continuously updated
online variants
- Teaching texts which will face competition from online courseware.
New forms of publishing
The internet introduces new forms of publishing in two ways of interest to
us. Most obviously there are completely new forms such as email lists and
their associated archives and search engines. However there are also new
mechanisms which provide interlinking between those things which were formerly
disparate. Hypertext blurs the boundary between publication and
between publications and those things that index them. For instance a web
based library catalogue can have a hypertext link to the full text of the
item described as can an index. The former separation between those indexes
which indexed a single entity such as a book and those that indexed at the
bibliographical level becomes blurred as the boundaries between separate
publications becomes less distinct. The role of the indexer moves from
solely providing the address of the information indexed to providing a
delivery mechanism. Some examples are -
- Various kinds of threaded and indexed discussion forums.
The Link list I maintain provides an example
- Interactive conferencing sites with an administrative role as well.
The AusWeb conferences provides an example
- Interactive course material being generated for the many virtual
university developments
- Organisational home pages designed to offer across the network a range of services
which would previously have been done at a physical site. The
Amazon.com bookshop
provides an interesting example. The bookshop's index is the key to its success.
They encourage other sites to set up virtual bookshops pointing to items in their
database and will pay royalties for books sold on this basis.
- Distributed services relying on loose coordination between a range of participants.
The WWW Virtual library is
an example of this providing as it does a classified "index" to evaluated material
on the web but with little coordination offered other than that of delegation.
Types of indexes on the web
To provide access to the rapidly increasing material on the internet,
largely now dominated by World Wide Web, there has been an explosion of
different attempts to provide organised access via classification and
indexing. Figures on the number of "pages' of information on the
network increase so rapidly that any figure dates. At the time of writing
about 100 million "pages" was being quoted by some commentators
based on data generated by indexing robots. This figure is very
uncertain as even the number of connected computers cannot be measured with
a high degree of accuracy. Because of the growth rate, the ephemeral and
non standard nature of much of the material available, and
the relative ease of technological "quick fix" solutions, most of
the indexing on the net has been based on automated free text methods. An
attempted categorisation of the types of indexes and classifications are
listed below,
Single server indexes
Full text indexes of single servers are parts of the are common. Typically
these are based on public domain or shareware software typically swish in the case
of Unix and eg.acgi on MacOS.
With most of these the scope for augmenting the indexing manually is about
nil.
Multi host indexes
There is also software to create indexes drawn from the material on many
other computers. A firm or university might want to index all the computers
in their domain or in their area of interest. Harvest is
popular software for this. There are rapid developments in commercial
software in this area targeted at the rapidly growing intranet market.
Many of the big search internet engines have been set up with the hope that
their success will generate sales in this market. Such indexes have some
some ability to limit what is indexed by indications on each host as to which
parts of their server directory should be considered and which ignored.
Some of these search engines are adding the ability to look for specific
indexing data embedded in the document called metadata, and to use it as the
basis of the generated index terms so the scope for manual intervention
into indexing is growing. An interesting experiment in this area is
currently being performed by the National Library of Australia and a group
of universities funded by the Australia Vice-Chancellors Committee. The
DEETYA EdNA project is also working in this area.
Global indexes
There is a growing literature on
the big internet search engines and new ones seem to leap into existence on
almost a monthly basis. They vary in their coverage, in the depth with
which they reach down into the hosts they index, and in the cycle time they
use to reindex sites. Though regarded by many users of the internet as
prime access points, and used as such, the limitation of full text indexing
are apparent in their performance although surprisingly good results can
be obtained with careful choice of search terms taking account of their
sophisticated statistical ranking algorithms.
But as the web grows they will be unable to scale and in my view will
deteriorate in performance. Specialised carefully crafted indexes will
perform far better as they do in the print world. There are two reasons for
this, the problem of indexes which cross many subject domains increases
synonym problems and reduces precision, and the lack of selectivity for
quality material. The big robotes index what is there regardless of utility or
quality.
Selective Indexes
The internet is notorious for the low quality of much of the material. This is
inevitable in an environment where the entry cost for publishing is so low. Drafts,
idle correspondence, individual overheads for presentations and crank material is
mounted as easily as material of depth and thought. I have held the view for three years
that the central problem of the internet is not going to be indexing but filtering.
The bulk of the material on the internet is, for serious purposes, inappropriate
for retrieval. Indexes which include this material will of their nature include noise.
Selection of material is a task that publishers, indexers and librarians are well
familiar.
- The publisher only publishes material of quality that will ensure a return
-
The indexer only indexes concepts that are useful in material which has some utility
-
The librarian only selects for purchase material useful to the clientele served.
The filtering that they do is a task that cannot be automated as it involves judgments
of utility and quality.
Because of this we see increasingly see indexes of material being developed of only selected
material. Many of these are being produced by individual institutions for their
own purposes based on web pages which point at selected sites and which are supported
by keyword indexes. Some of these are commercial such as Yahoo. I believe there will
be a shift to the provision of quality manual indexing for such sites.
OPACs and the internet
The development of web enabled OPAC systems for libraries has led to a moves to catalogue
internet resources. This seems to be fairly slow to start up as -
-
Library cataloguing as a basis for supplying index information is very labour intensive
-
The instability of many web sites makes library managements reluctant to invest
time in cataloguing them
-
Cataloguing information is not available for purchase from the main national cataloguing
agencies
-
Libraries are short of funds to invest effort in areas outside their traditional
responsibilities.
-
Cataloguing practices are rigidified by complex rule sets and the practitioners
are therefore conservative in nature.
Nevertheless this is potentially a growth area for indexed access to networked resources
Cross linkage of indexes
The global nature of the internet seems to be leading to another trend which those
designing indexes would keep to the forefront of their consideration when starting
into new internet projects. The ability to add links and to share information
between sites create new opportunities for indexing which is increasingly being utilised.
Some examples are -
- The development of standards to enable publishers to embed indexing
information in their publications will be linked with automated methods to
collect that information by whoever wishes to include that site in their
indexes. The first distributed indexing system on the internet, archie, did
this in 1990. Indexing can be decoupled from the generation of the index
database
-
Libraries are using ISBN and ISSN data in the catalogues to automatically provide
links to corresponding entries in online bookshops for monographs and document delivery
services for serials.
-
Online bookshops are providing links from entries for pages where reviews can be
entered by users and viewed by other patrons.
-
The Innovative Interfaces OPAC system provides online links to the Britannica to
expand a search which in turn checks the OCLC database to see if the library is a
member to provide hypertext links back into the catalogue so that availability can
be checked for those items the library holds cited in the encyclopaedia entry.
-
Abstracting and Indexing services are placing cross links to document supply vendors
for retrieved articles.
-
In future to index may not be enough. The ability to link to a service that the user
of the index might want after the items sought is determined will also be an important
feature to consider including in the index design.
Selection of what to index
I have argued that the key problem for internet indexing is not so much the
indexing but the choice of what is indexed. The problem of finding tools to
assist in this is being addresses by a number of groups. The most important
of these is the work being done at the W3C consortium in relation to the
"Platform for Internet
Content Selection" (PICS). This originally was driven by domestic
politics in the United States and was seen not as a mechanism for selecting
quality material but of blocking access to unsuitable material specifically
by children as figure 1 shows.

Figure 1.
While originally conceived as a mechanism for censorship this was broadened by the
W3C group with the idea of having a general mechanism which could deliver-
- An infrastructure for associating labels (metadata) with internet content.
- A signing mechanism
- Delivery of privacy information
- Intellectual property rights
The original principles were -
- Self-rating to enable content providers to voluntarily label the content they create
and distribute.
-
Third-party rating to enable multiple, independent labelling services to associate
additional labels with content created and distributed by others.
-
Services may devise their own labelling systems, and the same content may receive
different labels from different services.
-
Ease-of-use to enable parents and teachers to use ratings and labels from a diversity
of sources to control the information that children under their supervision receive.
The political clout behind this scheme will ensure that it will happen. As the netparent site stated on 16 July this year -
"Companies in the Internet online industry joined today with organisations representing
education, children, parents, consumers and law enforcement to support President
Clinton's and Congress' call for an Internet online environment that's family-friendly
and rewarding and safe for children.
The groups cited the current widespread availability of tools that can empower parents
to shape children's Internet online experience, tools that are "effective, easy-to-use
and 100 percent available to anyone whose child is on the Internet."
It is unusual to find a protocol which can be used to support indexing
endorsed by a head of state! This protocol is not just for blocking
however. the W3C is now expanding it into the Resource Description Format
which will be a more generalised mechanism for delivering information about
a document. It will be able delivery referee and indexing information.
This could form the basis of a distributed service which might form the
electronic basis for the equivalent of a library collection - selected
catalogued material. Even without this format a number of groups are
already working at using PICS to support information services including EdNA here in Australia and the IEEE. The key components in
the development of this protocol is the ability to deliver third part
judgments on quality which is the essence of refereeing system for
publisher and collection development for libraries and also the ability to
delivery independent quality indexing to get round the problem of
'spamming" indexes seen on the general search engines.
Manual versus automated indexing
I have mentioned some preliminary attempts to add internet pages to library
catalogues and cross link abstracting and indexing services to document
delivery backup services. There are as well completely new services in the
formative stage being put together by booksellers, who, like libraries and
publishers, see their role changing in this new environment. Two examples
are Blackwell's
Navigator service and EbsoHost
both of which are providing indexed access at article level to the serial
content of a range of journals. A number of publishers are also starting
to provide such services such as Academic Press and
the American Mathematical Society.
A more general approach is the emerging, the Dublin Core
standard which is the subject of a paper at this conference so I will not
go into details on it here other than to make the points that -
- Involvement of the library and computing community
- Avoidance of complexity (so far)
- Australia involvement with development work in Canberra and Brisbane
- Originally aimed at publishers but now generalised
- Provides a mechanism to deliver general cataloguing, indexing and meta data about
a document
There is still a lot of work to be done on this format which will be progressed at
the current meeting of the international group in Helsinki.
It is likely that the ability of this protocol will be enhanced by the following developments
-
- A move from HTML to XML as a
standard for document layout which is better suited to carry meta
information
-
The development of the Resource Description Format in which the Dublin Core elements
can be embedded
-
The development of mechanisms to distribute RDF information from independent servers
-
The possible integration of the Meta Content Format MCF developed by Apple and adopted by Netscape into the suite of standards
-
The involvement of the big players in the internet, Microsoft and Netscape with
W3C to standardise these formats who apparently met in the last week of August to
hammer out a specification based on a combination of the standards,
Confusion of roles?
I believe that the traditional roles and division of responsibilities between those
groups who support communication between author and reader are in a state of flux.
There is likely to be a transfer of responsibilities of some roles between them
and a new pattern will emerge which may in the long term bear little relationship to that
which holds in the printed world. There may in fact end up with fewer players than
there are now or more. To explore this further I have created a simplistic model.
Initially I looked at some of the major functions currently carried out in the sectors and
what is happening as they enter the networked environment and this is summarised
in the table below.
| Function | Publisher | Abstracting and Indexing Service | Bookseller | Library | PICS | Search engines |
| Publish | Yes | No | No | No | No | No |
| Filter | Editorial control | Yes | Some | Collection development | Add to server | No |
| Index | CIP/Meta data | Yes | Starting to | Catalog | Meta data | No |
| Database | Starting to | Yes | Starting to | OPAC | No | Yes |
| Assistance | No | No | Some | Yes | No | No |
Function comparison table
Indexing is an activity which is starting to spread across all sectors as it is an
essential supporting function required to make any extensive information system work.
The development of the protocols mentioned above create the prospect of completely
new methods of information selection, indexing and delivery. The key to these possibilities
is the potential to decouple those things that are presently tied together because
of the physical format of paper. A variety of sources of information, if all on
the network, can be assembled as needed in different ways by different groups to deliver
quite different information services. Initial conflicts are already arising in the
courts where groups have used the network to enhance their services by linking into
other commercial services across then network and delivering that information as if it
was their own. Some of the interlinking services mentioned above provide further
examples.
The possibility exists for the following services to be supplied on an independent
basis which could be tapped into and used as required -
- Referee assessments could be supplier remotely and independently of the publisher
via a PICS/RDF server. This gets round the vanity press problem so prevalent on
the net (I do it myself!).
-
Indexing information for individual documents could be supplied via a
PICS/RDF server which could be collected by multiple index databases as
required. In the same way that central agencies now supply cataloguing
information for books and this information is used by multiple libraries
the same model could be used for other indexing information.
-
Indexers could reply on a remote rating service to provide them with suggestions
on which documents to index which they could then sell to search engines.
-
Search engines could rely on selected indexing sites to collect their indexing terms
based upon selection of material provided by other referee sites..
In this way the selection of which material is of value could be provided to multiple
sites. For instance a physics site could rate physics information on the network.
This could be collected by multiple indexing sites each of which could index the
material using different and competing thesaurus schemes. Competing index database services
could then collect the indexing information to generate different services providing
value added cross linkages to additional services.
This approach is set out in figure 2 below.

Figure 2
Conclusion
We are in the most profound period of change in human communication since
the invention of printing. Those industries and services that have grown
up to support communication via printing will need to find new modes of
information service to continue in a networked world. The manner in which
communication between author and reader will be supported is not yet clear.
The model provided is just one of many which are possible.
Whatever organisational structures emerge one thing remains clear. The ability to
find high quality information will be remain of central importance. This is not a
task that where human skills can be replaced by computers in the foreseeable future.
The indexer will retain a central role.
Tony Barry
Monday, 22 September 1997
Last revised Tue, 7 Oct 1997