The Seven Ages of Information Retrieval - UDT Occasional Paper #5
  
IN THIS DOCUMENT:
Abstract
Introduction
Childhood (1945-1955)
The Schoolboy (1960s)
Adulthood (1970s)
Maturity (1980s)
Mid-Life Crisis (1990s)
Fulfillment (2000s)
Retirement (2010)
What might go wrong?
What might go right?
References
UDT Occasional Paper # 5
The Seven Ages of Information Retrieval
Michael Lesk
Bellcore
Email: lesk@bellcore.com
URL: http://www.purl.net/NET/lesk
Abstract
Vannevar Bush's 1945 article set a goal of fast access to the
contents of the world's libraries which looks like it will
be achieved by 2010, sixty-five years later. Thus, its history
is comparable to that of a person. Information retrieval had
its schoolboy phase of research in the 1950s and early 1960s;
it then struggled for adoption in the 1970s but has, in the 1980s
and 1990s, reached acceptance as free-text search systems are
used routinely. The tension between statistical and intellectual
content analysis seemed to be moving towards purelyg statistical methods;
now, on the Web, manual linking is coming back. As we have learned
how to handle text, information retrieval is moving on, to projects
in sound and image retrieval, along with electronic provision of
much of what is now in libraries. We can look forward to
completion of Bush's dream, within a single lifespan.
Introduction
Shakespeare described seven ages of man, [Shakespeare 1599].
starting from infancy and leading to senility. The history of information retrieval parallels such a life.
The popularization of the idea of information retrieval started in 1945,
with Vannevar Bush's article (still cited 96 times in the 1988-1995 Science Citation Index).
[Bush 1945]. And, given the current rate of progress, it looks like it
will finish by 2015 or so, the standard life-span for someone born
in 1945. By that time, most research tasks will be performed on
a screen, not on paper. There has, however, been a tension throughout the life of information
retrieval between simple statistical methods and sophisticated information
analysis. This dates to a memo by Warren Weaver in 1949
[Weaver 1955].
thinking about the success of computers in cryptography during the war,
and suggesting that they could translate languages. He wrote
``it is very tempting to say that a book written
in Chinese is simply a book written in English which was coded into the
`Chinese code.' If we have useful methods for solving
almost any cryptographic problem, may it not be that with proper
interpretation we already have useful methods for translation?''
Just as Vannevar Bush's paper began hypertext, Warren Weaver's
paper began research in machine translation.
In some ways, information retrieval has two parents. Bush reacted
to the success of physicists in projects such as microwave radio (and
of course the atomic bomb, which he could not yet mention publicly).
He talked of intellectual analysis, both by people and by machines.
Weaver reacted to the success of mathematicians in cryptography.
And he thought in terms of simple exhaustive processing, not
in terms of high-powered intellectual analysis. This is the same
dichotomy familiar in computer chess between the ``all-legal-moves''
style and the ``predicting good moves'' researchers; artificial
intelligence against statistics. Throughout the history of
information retrieval, we have seen both techniques.
The analytical process, the
Bush approach, can either use manual indexing or try for
artificial intelligence programs that will achieve the same
accuracy of information identification. The accumulation of
statistical detail, Weaver's approach, can be done entirely
mechanically with probabilistic retrieval techniques.
Of course, manual indexing pre-dates computers and information
retrieval technology.
Childhood (1945-1955)
Rereading the original Bush paper, and looking at it from today's standpoint,
the hardware seems mostly out of date, but the software goals have not
been achieved. Bush, of course, did not realize the progress that would
be made in digital technology, or in semiconductors, and wrote about
bar coded microfilm and instant photography. We can do better today than
what he described; we have digital recording and Polaroid film. On the software
side, by contrast, he expected speech recognition; we don't have it yet.
Figure 1 shows some of the predictions Bush made in his paper, and where
we now stand.
[FIGURE 1]
Note that we have all the photographic inventions,
but they have been less important than he expected for information
processing. For example, he thought that ultramicrofiche (100X reduction)
would be particularly important. In practice, if a library can save
95% or move of its shelf space costs by going to 20X reduction, and that doesn't
justify microfilming, saving a few more percent isn't going to help.
For example, the physical cost of printing a typical book is about 15% of
its cover price. Even if it could be reduced to zero, that would not permit
a library to double its purchases. And the difference between reducing the
printing cost to 1% of the book cover price and half a percent of the
cover price is insignificant.
Bush also talked about automatic typing from dictation and OCR. Neither
of those has quite been achieved yet. By contrast, most of his
specific predictions for computers were achieved in the 1960s: computers
selecting their own data, performing
complex arithmetic, recording their results for later use, symbolic
mathematics, and computerized telephone switching systems.
Surprisingly, all these organizational goals were achieved before
his specific performance goal, machines doing 10M operations/second.
His storage predictions are not quite met yet. He wrote of a user
inserting 5000 pages per day into a personal repository and it taking
hundreds of years to fill it up. At 30 Kbytes/page (he was talking
photography, after all), with 250 working days per year, in 200 years
the user would have accumulated 7.5 terabytes. As I write, storage
in the software research area at Bellcore is about 2 GB per person;
a particularly profligate user (such as me) has 20 GB, but still well
short of a terabyte. Even if OCR worked, and the 30Kbytes per page were
reduced to 3 Kbytes, we're still not there in terms of storage capacity
per person.
Most important is the design of the user interface that Bush predicted.
He emphasized individual interfaces, with particular trails through
information space that would
be personalized to the user, and traded among people. Until
recently, the entire development of information retrieval was
aimed at uniform interfaces, that reacted the same way to each user,
and which did not depend on an individual's organization of personal
files. This has meant, for example, that most people have much better
searching tools available for the printed literature than for their
own personal notes. By contrast, when most scientists or engineers
want information, they look first in the closest sources. They consult
personal notes, ask the nearest colleague, ask their supervisors or other
local authorities, and only much less frequently do they look in journals.
Asking a librarian is a desperate measure.
[Schuchman 1981].
Bush envisaged a system that first of all, would provide the individual
notes and memories of the scientists. Only later did he suggest that
people would buy conventional publications and insert them into the
Memex. This would have been a better match to what people need than
the direction information services actually went, but it was so difficult
before the 1970s to get information into computers that the kind of system
Bush wanted was not practical.
In the 1950s, however, the Soviet Union sent up the first artificial
Earth satellite. The success of Sputnik I prompted widespread fears that
the United States was falling behind in science, and a realization that there
was relatively little knowledge of Russian science in the United States.
In addition to funding Russian language studies and machine translation
research, Sputnik produced a desire
to improve the efficiency of scientific information transfer.
Stories that some company or other had spent $100,000, or perhaps
$250,000, redoing some research because they could not find the
original circulated as urban legends but helped to justify
research into information retrieval.
The very first systems were built in the 1950s. These included the
invention of KWIC indexes, concordances as used for information
retrieval, by such researchers as H. P. Luhn. Figure 2, for example, is
a part of the list of occurrences of ``train'' in the Sherlock Holmes stories.
[FIGURE 2]
Concordances are still used where a small text is to be
studied intensively; they are too bulky to be printed for any kind of
large collection. However, the simple "every occurrence of every word"
kind of processing they represent survives today in many commercial
retrieval systems.
There were also innovative attempts at alternate kinds of machinery, such as the
use of overlapping codes on edge-notched cards by Calvin Mooers.
The most famous piece of equipment of this period was the WRU Searching
Selector, built by Allen Kent at Western Reserve University (now
Case Western Reserve). All of this special purpose technology was swept away by digital systems,
of course. I could no more find edge-notched cards today than I could
find IBM punched cards.
The Schoolboy (1960s)
Shakespeare's next time of life is the schoolboy, and in fact the late
1950s and 1960s were a time of great experimentation in information
retrieval systems. This period also saw the first large scale information
systems built. Many of the current commercial library systems such
as Dialog and BRS can be traced back to experiments done at this time.
The early 1960s also saw the definition of recall and precision and
the development of the technology for evaluating retrieval systems.
It also saw the separation of the field from the mainstream of computer
science. In fact, the 1960s were a boom time for information retrieval.
More people attended the IR conferences then, for example, than have attended
the SIGIR conferences in the last decade.
The first experiments that were done, with such systems as NASA's RECON,
used straightforward and simple technology. They largely used mechanical
searching of manual indexing, keying in the traditional indexes that had
been printed on paper, but were much easier to search by machine. In
particular, it is difficult with traditional indexes to do the ``and''
operation. Users wishing to find all articles in Chemical Abstracts
which mention two particular compounds must look up both and then try
to find the abstract numbers mentioned in both lists. This is something
computers can do very well compared with people. Computers can also
deal with quantity well. While the number of index terms assigned
to each document in a database which is manually indexed and printed on paper
must be limited to prevent the printed version becoming unwieldy, with
computers this does not matter. Thus, the advent of computerized databases
meant that indexing could become more detailed, although this promptly
meant that indexing could become too expensive to do properly.
Thus, the idea of free-text searching arose. There could be complete
retrieval of any document using a particular word, and there would be
no cost for manual indexing. This idea became and has remained very
popular, but immediately also attracted objections from those who pointed out
that indexing was not just a way of selecting words but was a way of
choosing the right words, either the words that were more important, or
the ones that had been selected as the correct label for a given subject.
The latter was enforced, in many indexing systems, by official vocabularies,
so that people would not be frustrated by finding one document indexed with
the word ``cancer'' while another was indexed ``neoplasms.'' Thesauri
such as MeSH (Medical Subject Headings) or the Inspec Thesaurus had attempted
to standardize nomenclature.
This created several questions. Was free-text searching of acceptable quality?
If controlled vocabularies were preferable, was there a way of translating
to them or creating them automatically? These problems stimulated the
development of evaluation techniques, led by Cyril Cleverdon of the Cranfield
College of Aeronautics (now Cranfield Institute of Technology). Cyril developed
the mathematics of `recall' (fraction of relevant documents retrieved) and
`precision' (fraction of retrieved documents that are relevant) as measures
of IR systems, and built the first test collections for measuring them.
There is a tradeoff between the two measures: if a system simply retrieves
more documents, it is likely to increase recall (with more retrieved documents,
a larger fraction of the relevant ones are likely to appear) but to decrease
precision (since there are also more opportunities in a longer list of
retrieved documents to retrieve non-relevant material). Figure 3 shows
one of the early recall-precision graphs, taken from the Aslib Cranfield
project report.
[Cleverdon 1966].
[FIGURE 3]
A series of experiments followed, run not only by Cyril and his
co-workers (notably E. Michael Keen, now at the College of Librarianship
Wales in Aberystwyth and F. Wilf Lancaster, now at the University of
Illinois) but also by the late Gerard Salton, at both Harvard and Cornell.
They showed that the value of manual indexing was low, if any, at least
on the small test collections being used. Free-text indexing was as
effective, and implicitly a great deal cheaper. Efforts to see whether
terms could be assigned to thesaurus categories automatically or whether
better thesauri could be made, following a variety of methods. A long
series of experiments was run, and in fact the test collections created at
that time are still in use, even though the largest was only 1400 abstracts
of aeronautical documents.
[Cleverdon 1970].
As these experiments were being done, new retrieval techniques were
invented. Key among these was relevance feedback, the idea
of augmenting the user's query by adding terms from known relevant
documents. This was one of the most effective techniques, since a
problem from that day to this in retrieval experiments is that users
often specify very short questions, and the systems can do better with
longer queries. Another early experiment was multi-lingual retrieval,
using a bilingual thesaurus and mapping words from both languages into
the same concepts.
[Salton 1968].
All these experiments started to establish information retrieval as
an unusual software discipline, in which evaluated experiments,
statistical methods, and the other paraphenalia of traditional
science played a major role. In contrast, the 1960s also was
the start of research into natural language question-answering.
Artificial intelligence researchers wondered why retrieval systems
had to be limited to retrieving documents which the user would still
have to read to find the answers to questions. Researchers such as
Terry Winograd, Daniel Bobrow, and others began building systems
in which actual answers were retrieved from databases, while others
looked at the problems of syntactically analyzing documents
to either detect data elements or do phrase matching for
retrieval. At this stage of development, these systems were
fragile and ran only on a few examples. The information retrieval
community had run its own experiments on the use of phrases, for
example, and tended to reject these devices.
Throughout this period of the 1960s, there was relatively little
actual computerized retrieval going on. Most of the work was
still research and learning, and ambitious though some of it was in
principle, there was no access to really large amounts of machine-readable
text with which to build large systems. This was, however, about to change.
Adulthood (1970s)
During the 1970s retrieval began to mature into real systems. Perhaps
the most important cause was the development of computer typesetting, and
later word processing, which meant that lots of text was now available
in machine-readable form. Computer typesetting started in the mid 1960s,
and even earlier than that there were paper-tape driven Monotype machines
whose input tapes could be converted for use by computers, but the large
scale adoption of cold type came in the 1970s. By the end of the decade
both hot lead and conventional typewriting were obsolete, and most material
that was printed was passing through a stage of computer input. This provided
piles of material for retrieval systems.
The other key technology motivator was the availability of time-sharing systems.
Instead of queries being processed in some background batch operation, it
was now possible to present them directly from a terminal and get an
answer immediately. This made retrieval much more practical, and
systems developed to offer such services to librarians. Time-sharing,
in turn, became possible as disk drives became cheaper. Figure 4 shows
the price of disk space by time; as much as anything, this chart has
driven the information retrieval industry.
[FIGURE 4]
Among the earliest large-scale systems were commercial systems such as
Dialog, Orbit and BRS. These fed off the computerization of abstracting
and indexing services, which were early adopters of computer typesetting.
At Chemical Abstracts, for example, the delay of combining many feet
of yearly indexes into a five-yearly cumulative index was becoming
intolerable; done manually, it looked like this might take longer than five
years to do. Computerization made it faster and easier to combine first
the weekly indexes into a yearly cumulation and then the yearly volumes into
the five-yearly Collective indexes. The tapes used to do this could then be
provided to services such as Dialog for access by professional librarians.
Another early system was OCLC, the Online Computer Library Center (although
at first it expanded its acronym into
Ohio College Library Center). Founded by Fred
Kilgour, it used the output of the Library of Congress MARC program for
machine-readable cataloging. The Library of Congress cataloged books;
OCLC acquired tapes of this cataloging; and it then printed catalog cards
for member libraries, saving them the work of preparing these cards
themselves. OCLC also introduced the idea of cooperative work. The Library
of Congress catalogs about 1/3 of the book titles bought by
the OCLC member libraries, and about 2/3 are not cataloged by LC. These
represent foreign publications and items (such as technical reports,
dissertations, and so on) which are not conventional book publications.
Member libraries cataloged these books themselves, provided the cataloging
to OCLC, and other libraries could then access that record. This is, as far as
I know, the first large-scale example of computerized cooperative intellectual
organization of information, as Bush had envisaged.
All of these systems used relatively simple and limited searching systems.
The OCLC search specification, for example, was the first four letters of
the author's name and the first four letters of the title. The online systems
were better, but they still limited themselves to Boolean searching on
text-word matching. Since they often had index terms as their data content,
they were in fact using manual content analysis, although often the abstracts
were searched more than the index terms.
Although most of these systems operated off abstracting or indexing,
this decade saw the start of full-text retrieval systems. Lexis, later
absorbed by Mead Data Central (and more recently by Reed-Elsevier), provided
the complete text of court decisions. Other early systems were derived from
newspapers, among the first non-reference publishers to go to computerized
typesetting: they included the Toronto Globe and Mail and the New York Times
Information Bank.
During this decade, the kind of research that had been done in the previous
decade went into a decline. The 1970s and 1980s were a period when both
databases and office automation research boomed; one might have thought that
information retrieval, situation between the two areas, would have done well,
but in fact it lost out. Part of this represented government politics; NSF,
for example, had funded quite a bit of information retrieval research in the
1960s, but now started to think about what it meant to be responsible for
the availability of
scientific information. Did that mean that NSF should be funding research
in the distribution of information, or the actual distribution of information?
In practice some money was shifted away from research, and in addition there
was some disillusion with promises that had been made in the 1960s and that
did not seem to be coming true fast enough. Arthur Samuel, for example,
wrote in 1964 that by 1984 libraries would have ceased to exist, except
at museums.
[Samuel 1964].
Other promises, such as those associated with machine translation, had
also perhaps encouraged funders to look elsewhere. The
rise of computer science departments also perhaps had negative effects on
IR research, as it often did not fit into the categories of highest
importance in those departments, and in the earlier anarchy it was perhaps
easier for oddball researchers to find a place.
There was some research progress, of course, and perhaps the most important
was the rise of probabilistic information retrieval, led by Keith van Rijsbergen
(now at the University of Glasgow). This involved measuring the frequency
of words in relevant and irrelevant documents, and using term frequency
measures to adjust the weight given to different words. Although these
algorithms had computational problems on the equipment of the day, a simpler
variant (term weighting) was one of the few techniques that seemed to improve
performance over the simple word matching that was prevalent.
What about the AI researchers who were trying to do intellectual analysis
automatically? The problems of overpromising results for machine
translation and computational
linguistics affected them as well, and the key subjects in the 1970s were
speech recognition and the beginning of expert systems. Expert systems,
in particular, occupied the same niche in buzzword space that ``intelligent
agents'' do today. Programs were going to go out and retrieve data
in the way that a librarian did. For example, one author wrote
``The 1980s are very probably going to be the era of the expert system, and
by the end of the decade it is possible that each of us will telephone an
expert system whenever we need to obtain advice and information on a range of
technical, legal, or medical topics.''
[Peat 1985].
In practice, there was a split between the AI community and the IR
community. The AI researchers felt that they were attacking more
fundamental and complex problems, and that there would be inherent
limits in the IR string-searching approach. For example, I have
an easy time looking for my name in retrieval systems; it is not
an English word nor is it a common name, and when I search for it
and find a hit that is not me it is most likely to be either my
brother or my second cousin. But I worked once with a man named
Jim Blue, and you can imagine the problems looking for his name in
a dumb database. Even worse, consider the chemists named Bond or Gold.
The AI researchers felt that without some kind of knowledge understanding
these problems were insoluble. Meanwhile, the IR camp felt the
AI researchers did not do evaluated experiments, and in fact built
only prototypes which were at grave risk of not generalizing.
The best-known demonstrations of the AI systems were from the
group at Yale led by Roger Shank (now at Northwestern University).
These programs mapped information into standard patterns. For example,
a typical medical drug evaluation experiment can be imagined as ``we took
some number of rats, suffering from disease X, and gave them the following
quantities of the drug being tested, and the following numbers of rats
at each dosage were cured.'' Shank's group constructed such schemas
for a number of common activities, e.g. ordering in restaurants, and
then took natural language descriptions of these activities, picked out
the information that appeared to fit slots in the frames, and thus
constructed a semi-formal representation of the information. They could
then take queries about such subjects, e.g. vehicle accidents on the
United Press newswire, and retrieve actual answers. These programs
ran on a restricted set of examples, and produced much argument about
whether they were in the end going to develop into practical retrieval
systems. Some of these systems, such as the LUNAR system of Bill Woods
or the Transformation Question Answerer (TQA) of Stan Petrick and Warren
Plath were evaluated, but they tended to operate off databases rather than
text files.
[Schank 1973].
Part of the work in AI was the elaboration of languages designed for the
precise representation of information, called knowledge representation
languages. These languages attempted to use logical notation to
represent general knowledge, so that it could be processed by programs.
Several such languages were defined, and their proponents suggested
that large amounts of information could be written in them and used
by expert systems. Unfortunately, as with earlier AI work, there
were great arguments as to the reality of these proposals.
Maturity (1980s)
During the 1980s, the steady increase in word processing and the steady
decrease in the price of disk space meant that more and more information
was available in machine-readable form and was kept that way. The
use of online information retrieval expanded in two main ways. One was
the availability of full-text instead of just abstracts and indexing;
the other was the spread of online retrieval into use by non-specialists,
as libraries replaced or supplemented their card catalogs with online public
access catalogs. Meanwhile, the IR research community began to come back,
with new work on term analysis and related areas. Artificial intelligence,
boomed and then crashed. And the development of the CD-ROM
provided an entirely new route for information delivery, which seemed for
a while to threaten the whole idea of cooperative online work.
There was an enormous increase in the number of databases available
on the online systems. Figure 5, from the Gale Research
Directory edited by Martha Williams, shows the
steady increase in the number of online databases. Not only did
the number increase, but new kinds of databases, involving full text
and numerical data, appeared on the online services.
[FIGURE 5]
OPACS also developed during the 1980s. Many libraries, thanks to OCLC
and similar cooperatives such as the Research Libraries Group (RLG),
had full machine-readable records from the mid 1970s. They began making
search systems available, and some even converted their historical catalogs
to machine-readable form and provided full online catalogs. This included
the Library of Congress with the REMARC project, providing an extremely
large and complete book file. By the end of the decade, commercial
vendors provided OPAC software (eg NOTIS and GEAC).
Full text online blossomed in this decade. Many current magazines and
newspapers were now online, albeit with text only. The full text of
American Chemistry Society journals, for example, was provided by
the online system originally named Chemical Journals Online (CJO) and
now called Science and Technology Network (STN).
In the research community, a resurgence of effort on information
retrieval techniques started. The new material which was available,
and the interest in smaller programs that would run on individual workstations,
produced new research in retrieval algorithms in places like the University
of Massachusetts (Bruce Croft) and Virginia Tech (Ed Fox). Although some
were still looking at the mathematics of the vector space model defined by
Salton in the 1960s, there was increasing interest in new kinds of retrieval
methods. For example, research on sense disambiguation using machine-readable
dictionaries, as a way of helping distinguish different meanings of the
same words (the familiar problem of distinguishing ``blind Venetian'' from
``Venetian blind''). There was work on bilingual retrieval, and on
part of speech assignment. In fact, after many years of frustration, it
began to appear that some computational linguistics might be useful in
retrieval, but it was going to be the statistical kind of retrieval,
rather than the AI approach.
Few of these techniques, however, were used in any of the commercial systems.
Those went along with the same kinds of retrieval algorithms that had been used
for decades now. In fact, in some ways the commercial systems appeared not to
care very much about retrieval performance. If one looked at their
advertisements, they boasted primarily about the number of databases they
had, secondarily about their user interface, and not at all about retrieval
performance. Part of the problem was that the whole retrieval evaluation
paradigm was based on small test collections, and there were almost no
evaluations of full-scale systems. In addition, the evaluation work had shown
that there was huge scatter across questions in the performance of retrieval
systems, so that it was not possible to offer any confidence, in talking about
the performance of a system, that any particular question would be close to
some average performance level. Whatever the reasons, the research community
felt frustrated that their techniques were not being employed in the industry
they had started.
The AI community continued expert
systems and knowledge representation languages. One of the most ambitious
plans was the CYC system of Doug Lenat at MCC; this was an attempt to
encode all ``common-sense'' knowledge in a formal structure, so that a
computer could do ordinary reasoning in any subject areas. This was
part of an effort to respond to the ``Fifth Generation'' computer being
studied by the Japanese, a machine that would specialize in logic and
symbolic processing. In the early part of the 1980s there was great
enthusiasm for expert systems and knowledge-based projects. Later in
the decade, however, the general trend away from support for basic
research plus the failure of expert systems to deliver on their initial
promises (or over-promises) caused a movement away from this area, sometimes
called ``AI winter.'' In fact, far from believing that any knowledge expressed
in language can be translated into a single formalism, it is probably now
respectable again to mention the Whorfian hypothesis which suggests that
the specific natural language used to express information constrains the
kind of information which can be described, and is really a part of that
information.
A key technology change was the widespread use of the CD-ROM. By the
end of the decade, most libraries had at least one CD-ROM drive, and
CD-ROMs were being used regularly to distributed information. They
fit the traditional publishing paradigm very well (stamped out, put
in the mail, and hard for the users to copy), so they could be economically
distributed and easily used. Their large size for the time (650 MBytes, with
a very low production cost) meant that big databases, appropriate even
for full-text files, could be distributed. Computer networking also
continued to develop in this decade, but the CD-ROM fit so well with
traditional information publishing economics that it would develop into
a real threat to the online systems that had been growing rapidly for
two decades; see Figure 6, with data collected by Martha Williams.
[FIGURE 6]
On balance, this was a decade in which online information became common,
even if still not used by most people most of the time; and in which
it appeared the Weaver followers, users of statistics, had basically routed
the Bush followers, urging detailed content analysis.
Mid-Life Crisis (1990s)
By 1990 information retrieval could be said to be 45 years old, and it
was time for a mid-life crisis. Things seemed to be progressing well:
more and more text was available online, it was retrieved by full-text
search algorithms, and end-users were using OPACs. Figure 7 shows
the sales of online vendors in 1993.
[FIGURE 7]
Nevertheless, in
some ways the subject was still unfulfilled; it was still an area primarily
of interest to specialists in libraries. A few efforts to sell online
services to ordinary people (BRS After Dark, Dialog Knowledge Index) had
not been big successes. And there was little use of advanced retrieval
algorithms in the commercial systems. Word processing was everywhere,
and yet most of the material keyed in was printed out and the machine-readable
text not used any further.
This decade, however, has produced another technological revolution
in the Internet. Even one year ago, in 1994, people were saying to
me ``only about 15% of home computers have a modem, who cares about
online information?'' Now we see forecasts that within ten years
everyone will be on the net; one million people a month are signing up.
What is remarkable is not that everyone is accessing information, but
that everyone is providing information. For decades, information
distribution had been a few large publishers sending information to
many consumers. Now the consumers are generating their own information
and classifying it themselves, as all sorts of people generate their
own home page and link it to all kinds of resources they care about.
Comparing this with Bush's forecast, we see that his model of information
storage is coming true. Each user is organizing information of personal
interest, and trading this with others. What is also remarkable is that
it is happening entirely on a free basis, without much support from
industry. Admittedly much of the information is of low quality,
but that does not seem to make it less attractive. In addition to the
individual hyperlinks, we also have some classifications (e.g. Yaho,
`yet another hierarchical organization'). There are now
over ten million pages on the net (a guess from the Lycos group). In the
late 1980s it became normal to expect everyone to have a fax number; now
everyone is expected to have a home page. Figure 8 shows the growth of
the Web.
[FIGURE 8]
Although there are many free pages, there are also many groups planning to
sell information on the net (and some already doing it). The Internet
is now a standard medium for publishing, thanks to the development of
Mosaic by Marc Andriessen of NCSA (now at Netscape). The availability
of pictures has provided an attractiveness, and thus a growth rate that
neither ``gopher'' nor WAIS was able to achieve. Pictures do require
a great deal more space, but as the curve of disk prices shown before
in Figure 4
suggests, it is now affordable. This year, in fact, the world's hard
disk industry will ship over 15 petabytes of disk; the equivalent of
2 MB for every person on earth (I can remember when I, as a researcher
at one of the prime industrial labs, was hassled to keep my disk usage
under 200 Kbytes).
Another technology that is lifting off in the 1990s is scanning.
Until Mosaic, most online information was text-only. Publishers
attempting to deal with the graphical content of their publications did not
know what to do, until the advent of cheap disks and still cheaper CD-ROMs
made it possible to simply scan all the pages into images and sell these
images. Many publisher projects and library preservation projects rely
on imaging, since scanning a page costs between a few cents and 25 cents per
page, depending on paper and print quality, while keying it will cost
$2-$3 per page.
Today we see publishers doing both kinds of information sales. Some put
full text into an online system, and some sell page images via CD-ROM.
Sometimes the same journal is available both ways. Even more interesting is
the phenomenon of the electronic which is only published online. There are
several hundred such today, most free but some charged for; OCLC, for example,
is publishing scientific journals for AAAS and other publishers in online
format. The economic advantages of this format may well mean that many
smaller journals either move to this style of net distribution or disappear,
as soon as the universities figure out how to deal with tenure cases based
on online publication.
In the research world, the IR community was suddenly delighted to find
the technologies with which they had experimented for decades actually
in use. Bruce Croft's NSF Center for Intelligent Information
Retrieval at the University
of Massachusetts, for example, provides the software for the
Thomas system at the Library of Congress (the online Congressional
Record). Gerry Salton's software was used to make cross references
for a CD-ROM encyclopedia. And relevance feedback is a key ingredient
in the WAIS software of Brewster Kahle (now bought out by America Online).
There is now an active industry in text retrieval software, as shown
in Figure 9.
[FIGURE 9]
Politically, the opinion of information retrieval boomed as the Clinton
administration started. Vice President Gore had been a long-time supporter
of computer networking, and the new Administration looked at digital
libraries as a way of coping with the needs of America for better
education. Computer networks, the Vice President suggested, would
bring information to every student; he described, for example,
his ``vision of a schoolchild in my home town of Carthage, Tennessee, being
able to come home, turn on her computer and plug into the Library of
Congress.'' In order to produce this result, the Federal government started
a Digital Library research initiative
from NSF, NASA and ARPA.
[Fox 1995].
Under it, six universities are leading consortia to study different
aspects of digital library construction and use.
Figure 10 shows the projects funded by the digital library initiative,
and several other major digital library projects in the United States.
[FIGURE 10]
On the side of retrieval evaluation, there was a sudden jump, after
thirty years, to realistic collections for evaluation. The TREC
conferences, run by Donna Harmon of NIST, distributed a gigabyte
of text with hundreds of queries, as a way of judging the performance
of retrieval systems. TREC results confirmed that there is still
very large scatter in the performance of retrieval systems, not only
by question but even over individual relevant documents within
answer lists. A ``best-hindsight'' system (pick the right answers
from different systems for each query) would outperform any of the
actual systems by a wide margin.
Thus, we still need research to try to understand
how people phrase questions and how we can get questions composed
so that we can provide good answers to them.
Fulfillment (2000s)
What should happen in the next decade, when IR will be 55 to 65?
I have labeled this, optimistically, as `fulfillment;' Shakespeare
had it as an age of a comic character. Which will it be? I believe
that in this decade we will see not just Bush's goal of a 1M book
library, so that most ordinary questions can be answered by reference
to online materials rather than paper materials, but also the routine
offering of new books online, and the routine retrospective conversion of
library collections. We will also have enough guidance companies on
the Web to satisfy anyone, so that the lack of any fundamental advances
in knowledge organization will not matter.
Where we will need research is in the handling of images, sounds and
video. Nearly the search algorithms we have today are based on
text and words; we are all Whorfians with a vengeance, after the
decline of controlled vocabularies and AI research into computational
linguistics. So what will we do with the very large volumes of
pictures and videos that need to be retrieved? Again, there is the
Bush choice in which individuals will make indexes and pointers; and
the Weaver choice, in which projects like the IBM QBIC system will be
expanded and make it possible for us to do content retrieval on
images.
[Flickner 1995].
Looking at Bush's goal of a 1M volume library, we should note that
Mead Data Central today has the equivalent amount of material, if not
in the form of a general book collection. At 1 MB of ascii per book,
the 2.4 TB at Mead already are the equivalent of more than 1 M books.
But for a general collection we need the economic arrangements that
will let publishers and authors provide each new book and journal in
electronic form, perhaps in addition to the paper form for many years.
For academic publishing, this may not be a problem (since the authors
are not paid, and many university presses run at a loss, we may find that
a transition to self-publishing is not very painful). But for commercial
publishing, we need a way for all to be treated fairly: authors, publishers,
illustrators, photographers, and readers. Many companies are working
actively on this; see Figure 11. We desperately need some leadership
in these economics areas; given the number of people working on it,
we'll probably get some.
[FIGURE 11]
In addition to the provision of new material, we are seeing libraries
converting their old material. The Cornell CLASS project set a cost
figure of $30-$40 per book for scanning;
[Kenney 1992].
in addition it costs perhaps
$10 for the disk space to store the converted book. This is now
comparable to the cost per book of new bookstacks built on campus
(the new stack at Cornell is $20/book, and at Berkeley the stack
under construction is $30/book). If several libraries could agree
to scan an out-of-print book and move it to a warehouse, they would
all save money over building stack space for it. Of course, they
would save even more money just by moving the book to an off-campus
warehouse. The online book, however, is likely to be perceived as
much more usable. Experience with online catalogs, for example, is that
when a university has about 1/3 of its books in the OPAC, and 2/3 still
in an old card file, the students start ignoring the old card file,
and some form of retrospective conversion becomes desirable. The Mellon
Foundation is now supporting project JSTOR, which is scanning ten major
economics and history journals back to the beginning as a replacement for
shelf space.
So, we can see how, in the next few years, we'll have a large book
library. What about pictures, sounds and video? All pose problems
in indexing and searching, which may be overcome simply by people
creating their own trails. Video poses, in addition, a storage problem.
Going back thirty years, 2000 80-column punched cards held
160 Kbytes, weighed 10 lbs, and occupied 0.2 cubic foot, a storage
density of 0.016 MB/lb or .8 MB/cu ft.
Exabyte 8mm cartridges hold 5 GB, weigh 3 ounces (0.19 lb) and have a size
of 5.5 cubic inches or 0.003 cubic ft; a storage density of 1100 GB/cu ft
or 27 GB/lb. This means that there is an improvement in bytes/lb of a
factor of 1.6 million and in bytes/cu ft of 1.3; comparing
punched cards with CD-ROMs
instead of Exabyte cartridges,
the improvement is somewhat less, perhaps 300,000 to 500,000.
But look at how much space video takes: a page which might contain 2000
bytes will take about 3 minutes to read aloud. A video of that reading,
even compressed by MPEG-1 will be over 30 Mbytes, and for the quality
imagined for decent video will be perhaps 100 Mbytes, or 50,000 times as
much as the text. So we have given back much of the improvement in
storage space. It will take another generation of storage space cost
decreases before we can move from each researcher having a single, precious
digitized video item to large, useful libraries of digitized video.
When we do have video libraries, we will need to build better systems for
retrieving material from them. In research terms, we will need to move
into more serious image recognition and sound recognition research, areas
which have been more promising than computational linguistics (as one might
expect, given that pigeons can be trained to do image recognition, but
not language understanding). We also can look to some new techniques
in cooperative browsing, in which the opinions of people about videos
or sounds can be used to guide others.
[Hill 1995].
Retirement (2010)
Shakespeare gives the last age as senility, but we'll be more
optimistic. At this point, 65 years after Bush's paper, we can
imagine that the basic job of conversion to machine-readable form
is done. It is not likely to be complete: there are still many
manuscripts in European libraries that have never been printed, and
there are likely to be many books that are never converted to
computerized form. But these will not be the key items, but the
obscure ones. We can imagine, for example, that central library
buildings on campus have been reclaimed for other uses, as students
access all the works they need from dormitory room computer.
A few scholars, as with those who now consult manuscript and archive
collections, go off to read the obscure books.
Most of the students are probably not just reading, anyway. By
this time, multimedia information will be as easy to deal with as
text. People will have been through the picture files, even the
9 million Library of Congress photographs, and noted which are
relevant to which subjects, linking them in to pages of
directory information. Most students, faced with a choice between
reading a book and watching a TV program on a subject, will watch the
TV program. Since there are likely to be fewer TV programs than
books on any given subject, this limits their choice.
The US has about 60,000 academic-type books published each year, while even if
we allow ten primary TV channels (4 network and six cable) broadcasting
4 hours of prime time shows 365 days per year, that is only 20,000 or
so TV shows.
Educators will probably bemoan this process. They will suggest that
students are not getting enough diversity, not only because there is
too much emphasis on multimedia shows, but because universities have been
combining and getting more of their courses by remote learning. None the
less, there will be no stopping it, any more than we have been able to
keep the vaudeville theatres alive in small towns. The economics of university
libraries show that there is a large potential savings if we can deliver
material more directly from authors and publishers to readers; see Figure 12.
[FIGURE 12]
Internationalism will be a major issue by now, if not before. Although
the Internet has been heavily US-based and English-language up through
1995, it will now include the entire world and there will be many multilingual
trails. Methods for dealing with this, improving language teaching, and
otherwise dealing with the fears of countries who think that their culture
will be too peripheral in the Web, will be political issues in
the next century.
As for the researchers, there will be engineering work in improving the
systems, and there will be applications research as we learn new ways to
use our new systems. Perhaps there will even be thriving schools giving
PhDs in the arcane details of probabilistic retrieval. But many of the
scientific researchers, I suspect, will have moved on to biotechnology.
And only a few of us old fogies will remember that we were once promised
fully automatic language analysis and question answering.
What might go wrong?
This is all very optimistic, and suggests a long and happy life for
our imaginary IR person. What are the alternatives? Is there an
equivalent of an automobile accident that would leave IR stuck at
its present state? We all know about quad sound and CB radio; what
might happen to stop progress in online information availability?
The most obvious problems relate to the need to publish current
information with being overrun by illegal copying. This problem
destroyed the computer games industry in the late 1970s, and is
extremely serious for the software industry right now in many countries.
However, there is a great deal of work to sort this out, and it is likely
that acceptable solutions will be found.
The analogy of CB radio, however, is
unfortunately relevant. Many people already complain
that the Internet is so full of trash that it can no longer be used.
Some of this is the same as suburbanites wishing that they had been the
last people allowed to move into their town. But part of it
is indeed a problem with abuse by commercial operators (such as the
infamous Canter and Siegel `Green Card' postings), by the pornographers,
and by people who simply do not think before they clutter a bulletin
board going to many tens of thousands of people. Perhaps the easiest
way to deal with some of these problems would be an ethical position that
anonymous posting should not be allowed. This might remove a great deal of
the junk. A more likely outcome is that moderated bulletin boards become
more popular and unmoderated, unedited newgroups lose their readers.
Another, very serious problem, is that of the copyright law. At the moment
there are no procedures that involve low administrative overhead for clearing
copyright. Proposed revisions to the copyright statutes would extend its
scope, aggravating the issues. It is said, for example, that in preparing
a commemorative CD-ROM for the 500th anniversary of the first Columbus voyage
to America, IBM spent over $1M clearing rights, of which only about $10K went
to the rights holders; everything else went into administrative and legal
fees. It is possible that we could find ourselves in a strange world in which
everything being published now was available, since we imagine that the
economics of current publications is sorted out; and that material printed
before 1920 was available, since it was out of copyright; but that the
material printed between 1920 and 1995 is hard to get. We have to hope
for either administrative or legal progress in these areas.
We might also find that the Net worked, and that people could get commercially
published material, but that the freely contributed work of today was hard
to obtain. For example, suppose we wind up adopting technology, such as that
proposed by some cable companies, which provides a great deal of bandwidth
to residences, but relatively little in the reverse direction. This might
make it difficult for ordinary citizens to put forward their views. The
Electronic Frontier Foundation has been arguing for some years against this
danger and urging symmetric network designs for the NII.
Finally, there are some unfortunate analogies with activities like nuclear
power, childhood vaccines, and birth control technologies. In these areas,
various kinds of legal liability and public policy debates have hamstrung
technological development and availability.
There are undoubtedly readers of this article who
think that for some of these topics, the naysayers are right and that
technology should be stopped. We need to be aware that there are
groups which would like to limit the Internet for a variety of reasons,
whether the fear of net pornography, uncontrolled distribution of
cryptographic technology, or the gains to be made from libel or strict
liability lawsuits. We need to try to stave off the worst of these
attacks on the Web and try to defend freedom of expression and openness
as democratic values. And we must remember that there is no technological
certainty, and that perfectly good technologies can be stopped by legal
and political forces.
What might go right?
Not to end on an unhappy note, remember that it looks like this is all
going to work. Already in 1995, university audiences asked where they
got the answer to the last question they looked up usually say on a screen,
not on paper. Given the number and rates of progress of conversion projects,
it certainly looks like Bush's dream, as far
as the availability of information on screens, will be achieved in one lifetime.
What about his view of the new profession of trailblazer? There were
of course librarians in his day, although they had low status even then.
Remember that many people, with no idea of the job classifications
that exist in libraries, still think that a librarian is somebody who
puts books on shelves in order. Will, in a future world of online information,
the job of organizing information have higher status, whatever it is called?
I am optimistic about this, by analogy with accountancy. Once upon a time
accountants were thought of as people who were good at arithmetic. Nowadays
calculators and computers have made arithmetical skill irrelevant; does this
mean that accountants are unimportant? As we all know, the answer is the
reverse and financial types are more likely to run corporations than before.
So if computers make alphabetizing an irrelevant skill, this may well make
librarians or their successors more important than before. If we think of
information as a sea, the job of the librarian in the future will no longer
be to provide the water, but to navigate the ship.
References
References
[Bush 1945].
Vannevar Bush;
"As We May Think,"
Atlantic Monthly 176 (1) pp. 101-108 (1945).
On the Internet in http://ebbs.english.vt.edu/hthl/As_We_May_Think.htmla and several other sites.
[Cleverdon 1966].
C. W. Cleverdon, J. Mills, and E. M. Keen Factors Determining the Performance of Indexing Systems
ASLIB Cranfield Research Project.
[Cleverdon 1970].
Cyril Cleverdon;
"Progress in documentation, evaluation tests of information retrieval systems,"
J. Documentation 26 (1) pp. 55-67 (March 1970).
[Flickner 1995].
M. Flickner, H. Sawhneyi, W. Niblack, J. Ashley, Qian Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker;
"Query by image and video content: the QBIC system,"
IEEE Computer 28 (9) pp. 23-32 (Sept. 1995).
[Fox 1995].
Ed Fox;
"Special issue on digital libraries.,"
Communications of the ACM, New York (April 1995).
[Hill 1995].
Will Hill, Larry Stead, Mark Rosenstein, and George Furnas;
"Recommending and Evaluating Choices in a Virtual Community of Use,"
CHI 95 pp. 194-201, Denver, Colorado (1995).
[Kenney 1992].
Anne Kenney, and Lynne Personius Joint Study in Digital Preservation
Commission on Preservation and Access
(1992).
ISBN 1-887334-17-3. .
[Peat 1985].
F. David Peat Artificial Intelligence - How Machines Think
Baen Enterprises
(1985).
[Salton 1968].
G. Salton Automatic Information Organization and Retrieval
McGraw-Hill
(1968).
[Samuel 1964].
A. L. Samuel;
"The banishment of paperwork,"
New Scientist 21 (380) pp. 529-530 ((27 February 1964).
[Schank 1973].
R. C. Schank, and K. Colby Computer Models of Thought and Language
W. H. Freeman
(1973).
[Schuchman 1981].
Hedvah L. Schuchman Information transfer in engineeringReport 461-46-27
The Futures Group
(1981).
ISBN 0-9605196-0-2.
[Shakespeare 1599].
William Shakespeare, As You Like It,
Act 2, Scene 7, lines 143-166.
[Weaver 1955].
Warren Weaver; "Translation,"
pages 15-27 in Machine Translation of Languages, eds. W. N. Locke and A. D. Booth, John Wiley, New York (1955). Reprint of 1949 memo.
Latest Revision: June 17, 1995
Copyright © 1995-2000
International Federation of Library Associations and Institutions
www.ifla.org
|
|