| Related sites for http://www.gslis.utexas.edu/~scisco/inel.html |
| Quality_Time Software limits computer usage to particular times of the day, and limits individuals to an amount of time within those periods, warning users before time runs out. | | SonicWALL ICSA certified firewall. Stateful packet inspection, denial of service (DoS) attack prevention, NAT, IPSec virtual private networking, and content filtering. | | One_Touch High volume production fax and email output of documents generated from midrange, mainframe and ERP systems. | | RFC_0770 Assigned Numbers. J. Postel. September 1980. | | Wise_Law_Group,_LLC Marketing and consulting company, focusing on the online promotion of law firms. | | Susie Project information: CVS, changes, downloads, mail list. [SourceForge.net] | | Software_Hypermarket Shareware and freeware downloads for Windows, Macintosh, Palm, Win CE, Linux and Unix OSs. | | Perl_Regular_Expression_Quick_Reference One-page document containing tabulated summaries of common codes, modifiers, and special characters. [PDF] | | Snap-Shot_Wallpaper Free image gallery with over 1,000 pictures on a wide range of themes. | | PCMantra_-_Registry_Cleaner,_Anti_Spyware Offers privacy software for disk management and registry cleaning, anti-spyware and spam filter. Downloads and online sales available. | | SSR_Tech A web application development and publishing engine. Incorporates a modular system with the ability to create dynamic or static pages. | | Intesi!_Resources Provides Online DiSC Behavioral Profiles, personality tests, and other DiSC materials for organizational development, education, training, learning, counseling, career planning, and hiring. | | RFC_2860 Memorandum of Understanding Concerning the Technical Work of the Internet Assigned Numbers Authority. B. Carpenter, F. Baker, M. Roberts. June 2000. | | Switch_Blog Inspired by Apple's Switch Campaign. Contains current news about the campaign, parodies, and tips for switchers. | | MSN_Groups_-_Windows_Script Top tips, downloads and WSH news by Ian Morrish | | ScreensPro Offers high definition animated screen savers for Windows and Mac. [Subscription required for full access] | | JS_Marketing Offers design, domain registration, hosting assistance, shopping carts, and promotion. Located in Oklahoma, United States. | | Wey_of_the_Web Offer web site design services in Weymouth, Dorset, United Kingdom. | | AdveDi_com Offers design, hosting, domain name registration, and promotion, plus database analysis services. Based in Hong Kong. | | Chrono_Game_Central Offers 6 high-speed gaming computers, 8 Linux internet-only computers, refreshments, and an XBox Theatre Room. Located in Muscatine, IA. |
|
Indexing Digital (Electronic) Documents -- It'sNot an Option; Pay Now or Pay (More) Later service bureaus is presented."> Taking Stock | Model | Glossary | Bibliography | Credits Abstract Conversion from paper-based filing to an electronic document management system (EDMS) requires significant planning. Indexing digital documents is not optional. This paper distinguishes between field-based and full-text indexing and recommends a combination of the two. Tangible and intangible organizational benefits of indexing digital documents are outlined. The various costs associated with indexing are detailed, and specific price information from service bureaus is presented. Recommendations for choosing an EDMS are included, as well as a model for assessing the organization's indexing needs.IntroductionOrganizations have traditionally relied on paper filing systems for document storage and retrieval. However, paper records are extremely difficult to manage because they have to be stored in and retrieved from only one place. Electronic document management systems (EDMSs) solve many of the storage and retrieval problems inherent in paper filing systems while simultaneously reducing business costs. EDMSs manage storage and retrieval of many different types of digital documents, including word processing files, spreadsheets, database files, e-mail, voice mail, scanned images, and Internet/intranet HTML documents. While EDMSs provide much faster access to and retrieval of documents (which is a financial benefit in itself), the mere availability of a new technology does not justify its acquisition. The real measure of value "should not be how much faster you are able to respond to a situation with new technology, but rather what value is added to the business process through faster response" (Koulopoulos, 1995). Effective indexing can add value to the organization far beyond mere speed of retrieval by enabling users to retrieve documents in many different ways. Think of business records as part of a hierarchy of "containers" which include Folder, Section, Document, and Page. A folder can have many sections, and sections can contain many documents, and documents can consist of many pages. Yet traditional paper-based filing systems require users to retrieve all information at the "Folder" level of the hierarchy. By contrast, EDMSs allow information to be retrieved at many levels. This retrieval is built on indexing, the bedrock of EDMSs. The accurate and consistent indexing of digital records is absolutely critical to the success of the organization. So what do you need to know about indexing to increase your document retrieval efficiency and save money? There are many factors which affect indexing needs. First, an understanding of the two basic types of indexing is needed.Types of IndexingIndexing can be field-based, full-text, or a combination of the two. Index field data make unique identification of documents possible. For example, the United States Department of Defense is considering the user of a pair of index fields as unique identifier: creation date/time and creator ID. Adding other indexing fields provides additional, controlled ways to access individual records or groups of similar records. Retrieval from index fields is consistent and accurate because it is based on a controlled search vocabulary. Ideally, field indexing is performed at the point when business documents are created. Some field indexing can be done automatically (more on this in the costs analysis), but human indexers are also required.Full-text indexes are created automatically. Computer software reads every word of every document in a database and creates an inverted index of words and their locations in the database. End-users can search the database using any words they want to (this is called "natural language"); the computer will find every match between the search term(s) and the text of the documents. Full-text searching makes it easy to locate documents when users are not exactly sure what they need, but it also finds a high number of irrelevant items (for example, Internet search engines are based on full-text indexes). The organization pays for time employees spend browsing through irrelevant documents (or "misses") to find the relevant ones ("hits"). In the interest of quick and accurate retrieval, some field-based indexing is recommended. Indexing digital documents exclusively with full-text indexes is not recommended. All organizations benefit from some combination of field-based and full-text indexing, but determining what particular combination is most beneficial to a given organization is a very complicated process. Before you choose an EDMS to manage your digital documents, your indexing needs should be weighed against the benefits and costs of indexing. Indexing is not an option with EDMSs--the documents have to be indexed in some way. Different EDMSs offer different types of indexing, and the organization should be aware of their capabilities. Organizations have different indexing needs because their documents and their users vary. This article details the benefits and costs of indexing digital documents and includes a model for assessing the indexing needs of the organization.Organizational Benefits of Indexing Indexing digital documents produces both tangible and intangible benefits to the organization. Tangible benefits include financial, legal, employee, and value-added benefits. Intangible benefits include less concrete measures of success, such as improved perception of the organization by both employees and customers. Combined tangible and intangible benefits result in financial gain for the organization through increased employee productivity, customer service, and competitive advantage in the marketplace.Financial Benefits: Increased production. The speed of many routine office procedures (such as production of statistical reports, records management tasks, access to and retrieval of digital documents, etc.) is increased. Decreased future staff requirements. Increases in production can be handled by current staff. Increased access to current information. Quick and accurate updates of indexes throughout the organization decreases information retrieval time and increases accuracy of information. Improved customer service. Prompt, accurate informationretrieval increases repeat revenue for the organization. Decreases in human filing mistakes. Large legal practices often spend 8 or more hours to locate misfiled documents (Socha , 1996).CASE STUDYA legal firm using an image management system found that their cases could be handled by 2.5 fewer temporary full-time clerks than before they implemented the system. With the previous paper-based system, clerks spent large amounts of time retrieving documents identified in database searches, photocopying the documents, delivering the copies to attorneys and legal assistants, and refiling the originals. The clerks also spent considerable time searching for misfiled originals. 2.5 clerks earning $14/hour for 160 hours/month over 14 months would have cost the firm $78,400 (Socha, 1996).Legal Benefits: Litigation protection. In a lawsuit, records need to be produced very quickly. An indexing system that can identify and retrieve documents needed for litigation can pay for itself if a single multi-million dollar lawsuit is avoided. Response to Rule 26. A new law requires parties involved in a federal lawsuit to identify and produce relevant records within 85 days of the beginning of the litigation (Skupsky, 1995). Quick and accurate retrieval of records is required. Records retention compliance. Federal, state and local governments regulate record retention periods for organizations. There are over 10,000 federal recordkeeping laws alone (Skupsky, 1989). Good indexing systems include indexing fields related to retention (such as creation date, retention period, and disposition date). CASE STUDYAs a result of Rule 26, courts will probably require each party involved in a lawsuit to make a full disclosure of their records in the early stages of the case. Sanctions will follow for parties which fail to produce relevant information. Disorganization of records will not excuse parties from compliance. For instance, in United States v. ABC Sales & Service, the court concluded that "'a business that generates millions of files cannot frustrate discovery by creating an inadequate filing system so that individual files cannot readily be located'" (Skupsky, 1995).Employee Benefits: Currency of business information. New documents can be added to the indexing system quickly, and if documents are indexed when they are created, all users can access them immediately. Employees can do their jobs better. Document version control. Indexing digital documents makes it possible to control which version of a document users can access. Employees don't waste time working on outdated documents, or updating a version that's already been revised. Remote access. An organization-wide standard indexing language allows authorized users to retrieve documents from anywhere in the world. Employees don't have to take their whole office with them when they travel. Simultaneous access. Employees can share a document if it is indexed properly and retrieved from a computer network. The "file folder" is never missing from the file cabinet. Hard copy production and distribution are also eliminated. Decreased training time. New employees become quickly and fully productive in the organization. CASE STUDYWhen the U.S. Patent and Trademark Office (PTO) implemented a new imaging system, its most noticeable benefits involved customer service and employee training. The PTO Commissioner said that new patent examiners learned the business much faster because of the indexing system. The old manual indexing system required about 12 years to master; new examiners trained on the imaging system were up to speed in just a few months (Koulopoulos, 1995).Value-Added Benefits: Customer service improvements. Organizations that provide high levels of service will gain customer loyalty and increase business. Competitive advantage. Organizations that can retrieve information quickly and accurately will be able to accomplish more during the work week. Time is money, and indexing saves time. Perceived excellence. Companies that project an image of excellence will attract more clients and better employees.CASE STUDYPharmaceutical giant Glaxo implemented an EDMS and saved over $1 million per year associated with search and retrieval time. However, financial benefits were not the most valuable benefits realized. Each New Drug Application process requires about 50,000 pages of data preparation and documentation; the EDMS and its indexing system allowed Glaxo to prepare this documentation and receive clearance from the Food and Drug Administration much more quickly than before. Thus, EDMS implementation enabled Glaxo to collapse their business cycle and get their product to market sooner than their competitors (Perkins, 19??).Costs of Indexing digital documents How much will it cost to index your digital documents? One vendor quickly replied, "How much do you have?" But that answer is neither realistic nor helpful. Companies contemplating development of an indexing system for digital documents want to spend as little as possible to obtain a retrieval system that is needed to conduct business. More specifically, they want a system that provides quick and accurate access to frequently-retrieved information and reliable (but not necessarily fast) access to infrequently-retrieved information. Because the types of business documents which meet these criteria in different organizations vary so widely, it is obvious that there is no one "best" indexing scheme. One size will never fit all. Therefore, indexing costs will be detailed in two ways: 1) factors that affect the cost of indexing, and 2) cost information reported in published studies (see Table 1).Factors That Affect the Cost of Indexing: One of the first decisions which must be made is whether documents not currently in digital form will be converted. A paper or mICRofilm document is converted to digital format by scanning it into a computer; OCR/ICR (optical character recognition/intelligent character recognition) software may then be used to convert the document to ASCII text (Thiel, 1992). Documents can be indexed before or after they are scanned. Spencer (1996) estimates that the true cost of batch scanning 10,000 documents is about $.09/page before indexing costs are included. Thus, undertaking a large document conversion project can be costly. DocuCon, a full-service document conversion firm, comments that at least 20% of the documents to be scanned will require special handling (because of size or condition) and that rated equipment speeds are not reliable guides to how long jobs will actually take; special conditions like these further increase the cost of document conversion (Cullen, 1991). Other factors which affect the costs of indexing include the cost of keying index field data, technological costs, retrieval costs, and costs of updating. Manual field indexing of digital documents can be performed when the documents are created or when they are stored. For example, electronic document processing systems often require that employees who produce letters and reports using word processing/spreadsheet software fill some index fields when the document is saved. Although the time required to index a single word-processed document is small, the individuals who do this indexing may be highly paid, which increases the overall cost of indexing digital documents. The most variable (and often the highest) cost associated with indexing is labor. Indexing cost can be minimized by searching for ways to fill index fields from information already contained in existing corporate databases. If manual entry of a customer number allows the system to automatically access name, address, or zipcode, a great deal of manual keying time may be eliminated (Devlin, 1996). Barcoding is a new and cost-effective way to quickly and accurately identify batches of document types or individual documents (Spencer, 1994). For example, if a type of business form is preprinted with a bar code that identifies what type of document it is, the EDMS can automatically populate the "document type" indexing field when the document is scanned and OCRed. No one has to key the document type, which decreases cost. The number of index fields used to identify a particular document is a significant cost factor, especially when indexing is performed manually. A study of indexing projects showed that the average number of index fields is 8-12 (Cisco, 1993). However, an ANSI Technical Report prepared by the Association for Information and Image Management International suggests 50 possible index fields which might be used with electronic image management systems (AIIM, 1995). If the average field contains 12-20 characters, the cost difference between manually keying each additional field must be considered. Sometimes the cost of indexing documents can be reduced or eliminated by using full text retrieval systems which create an additional file (usually called an inverted file) in which each non-trivial word is listed with a locator key (Thiel, 1992). full text retrieval systems also allow users to construct search queries in their own words, rather than having to conform to the restraints of pre-selected terms (Fidel, 1994). However, full-text systems often return an unacceptably low number of relevant documents, fewer than 20% in one study (Blair & Maron, 1985). Some organizations will be unable to afford the cost of not finding relevant documents every time they look for them.Technological Costs Although most organizations are already computerized and the cost of adding computer capability and memory storage is becoming increasingly economical, there still remain technological cost implications in choosing indexing systems. The size of the index itself must be considered. Inverted files (used by full text retrieval systems) are often very large, sometimes requiring more storage space than the documents which they index (Thiel, 1992). Timely document retrieval may require faster processing speeds than the organization presently supports. And if documents are being shared by many users, local area networks may have to be installed. The cost of data migration (which includes index migration) must also be considered. Organizations should appoint an information management professional to administer data migration and indexing so that documents remain accessible as technological change occurs. Many organizations already own systems that contain non-standard or proprietary software which makes integration and migration difficult. Planning for future technological change now will save costs later.Retrieval Costs If minimizing the costs of indexing documents ultimately increases the cost of retrieval, it may be false economy. Kind and Eppendahl (1992) suggest a number of questions which must be asked about document retrieval, including who performs searches, how frequently items are needed, how long each search takes, how quickly the information must be made available, and how often a needed document cannot be found. Answers to such questions have cost implications which must be considered when designing an indexing system. For example, an inexpensive indexing system will require more search and retrieval time than a more expensive one. Can you afford to have your highly-paid employees spend time searching for and retriving documents? If you don't invest in the indexing system, you will pay for it (and pay more for it) in retrieval. Another retrieval cost involves training employees to use the system. The more complicated the indexing scheme, the more time and training will be required before users feel comfortable and confident about their ability to access the information they need.Cost of Updating Two different Kinds of updating costs must be considered. First is updating the documents in the system. If most documents exist in only one version, it may be economically feasible to simply start indexing over each time a document is revised, essentially giving it a new identity. However, if documents are frequently revised or modified, the organization may need to identify the most recent or official version of a document. Additional indexing fields may be needed to ensure that multiple users all have access to the latest version. The index itself must be kept current and updated. Griffiths and King (1993) survey 16 organizations and suggest that direct costs of an "index maintenance" project average $.29 per document (the project included creation and addition of new terms, removal of obsolete terms, and authority and location control work). Index maintenance may cost more that the original cost of indexing documents. Time and effort spent on initial index design may eliminate costly projects to correct or update after the system is in place.Cost of Indexing Table 1 shows examples of costs and ranges found in published studies of indexing projects. Koulopoulos(1995) reports that the time spent designing a typical system is divided among field identification and data standardization (20%), data entry (20%), and system correction and fine-tuning (60%). Initial purchase of digital imaging systems with capacity to process and store 300,000 to 3 million pages per year costs $.15 to $.25 per page, depending on use. Costs reported by companies indexing their documents in-house range from $.12 to $.20 per page (Cisco, 1993). Typical service bureau charges currently range from $.15 to $.30 per page for scanning and indexing (it is not clear how many index fields would be included).Conclusion So how few index fields can your organization get by on? You need at least two fields to ensure data retrieval--one uniquely identifies each document, and another provides an alternate pathway in case the first one fails. W. Wiggins of DocuCon recommends indexing a unique identifier and the document type for each document (personal communication, August 3, 1996). You need additional fields to manage records retention and disposal. You also need to index processing information about the software and hardware used to create each document so that data can be properly migrated when necessary. The Association For Information and Image Management (AIIM) identifies 30 possible processing information fields and 20 possible retrieval information fields (1995). The United States Department of Defense uses 22 records management fields to index their documents (Prescott, Underwood, & Kindl, 1995). Answering the questions in "Taking Stock of Your Company's Indexing Needs: Full-Text, Field or a Combination?" will help you identify what sort of data needs to be stored in index fields. Obviously, we cannot recommend a minimum number of indexing fields needed to effectively retrieve business documents. Each organization has unique requirements that should be thoroughly studied before implementing an indexing system. ____________________________________________________________________________Document DemographicsWhat requirements do your documents fulfill? Business purposes (to make payroll, paybills, write reports, serve customers) Legal purposes (to prepare for litigation, audits, regulatory reporting) Records management purposes (to manage retention, disposition, vital records protection) Archival purposes (to conduct longitudinal studies, genealogical research) What is the condition of the documents and the information contained on the documents? (Are the documents legible enough for more than 90% to be OCRed? Are the documents brittle, torn, stained, or skewed?) Do you have documents that are created electronically? (Example: word-processed documents, IRS income tax returns submitted electronically) What indexing data are already available in existing corporate databases? How accurate, complete, and consistent are the available data?______________________________________________________________________________User Demographics Who are the primary users? How do they askfor the documents? Who are the secondary users (such as outside auditors, strategic business partners, customers)? How do they ask for the documents?______________________________________________________________________________Benchmarking How do other organizations in your industry index documents? How do other organizations outside your industry index documents? Are there industry standards for indexing?______________________________________________________________________________RECOMMENDATIONS The purpose of indexing is retrieval. If documents cannot be found, they may as well not exist. Think about ways people in different parts of the organization might want to retrieve the same information. You may have secondary users with different indexing requirements than the primary users. It is better to purchase and implement an indexing system that fits your users than to try to change your users to fit the system. Make sure that your users are trained to use the indexing system that you have implemented. Ask your vendor if their system includes the costs of training your employees. More is not necessarily better. Six to ten index fields may be as useful to your users as fifty, at much lower cost. However, one or two extra fields may be worth the investment if they help avoid litigation or simplify compliance with regulations. Every organization has different and unique indexing needs. Think about how to use information you already have in digital format to avoid additional indexing and re-keying. You may be able to automatically populate some index fields. Use a full-text index in addition to indexing fields to index documents that are created electronically (such as word-processed documents, online application forms, etc.). To further reduce costs, consider introducing more digital document creation in the future. Include controlled vocabulary indexing fields to standardize indexing terminology. You don't want different departments calling the same kind of document different things (such as "bill," "voucher," and "invoice"). Work with a reliable vendor who uses non-proprietary programming language.COMPLEXITY OF INDEXING NEEDSHow many indexing fields do you need? ... each document fulfills few organizational requirements, then:... each document fulfills many different organizational requirements, then: If users ask for documents in similar ways, and ...Few indexing fields are neededMedium number of indexing fields is neededIf users ask for documents in different ways, and ...Medium number of indexing fields is needed Many indexing fields are neededGlossaryANSI (American National Standards Institute): an institution which develops and publishes standards for use within the United States.Automated Indexing: computerized indexing which doesn't require human decision-making or data entry. Automatic indexing software populates index fields by reading information from bar codes or scanning digital documents which have undergone OCR conversion.Barcode: a sequence of machine-readable lines of varying widths which contain data. Barcodes can be used to facilitate automatic indexing. For example, if standard business forms (such as invoices) are preprinted with a barcode which indicates that the form is an invoice, an indexing system can automatically populate the "document type" field after the paper form is scanned and OCRed. Barcodes also survive fax transmission intact.Batch Processing: a technique by which items to be processed are collected into groups prior to processing. Coding: See Indexing Controlled Vocabulary: set of rules for choosing words and phrases to be used in an indexing system, along with the list of approved or allowed words to be used in the system.Data dictionary : organized collection of information about data. The data dictionary compiles data about data, or metadata. A data dictionary is an automatic component of most database management systems.Data Element : a unit of data that is considered to be indivisible. Data elements are the building blocks for all data processing systems. Examples: document type, creation date, disposition date, Social Security Number, etc.Descriptor : See Field Data Digital Document : document which exists in electronic form inside a computer system.Distillation : the process of elminating, summarizing, or in some other way reducing a body of information to its essential components.Document : 1. any format which contains information. Documents may be word-processing files, e-mail messages, spreadsheets, database tables, voice mail or other audio recordings, faxes, business forms, images, information captured from the Internet, and so forth. Documents are sometimes called "records." 2. According to ANSI/AIIM TR40-1995, a collection of zero or more pages that are related, linked, or bound to each other in some way appropriate to the application. In an electronic image management system, the provision of a zero-page document allows the creation of a document entity prior to capturing and linking its page(s)..Document Classes : types of documents which require similar indexing fields. Examples of document classes: invoices, contracts, timesheets, e-mail messages, and so forth. Often called "document types."Document Life Cycle : the period which includes creation, maintenance, use, and ultimate disposition (destruction) of a document. The records manager needs to know the life cycle of every document in the organization.EDM (Electronic Data Management): application of technology to save paper, speed up communications, and increase the productivity of business processes.EIM (Electronic Image Management ): system which organizes information in all formats for use throughout its life cycle.Field Data : the retrievable information which follows the field name . Example: for the field namedocument type, the field data might be invoice or a code which represents invoice. The field data concept is associated with many terms, including indexing value, term, and structured or unstructured data.Field Name : the name of the field where a specific Kind of information is to be entered. Think of "field name" as a prompt for what Kind of information is stored in the field. Field names must be decided on before any documents are indexed. The information stored in the field is called field data . Example: for the field name document type, the field data might be invoice. The "field name" concept is associated with many terms, including "index key," "key field," "fixed field," and "indexing field."Fixed Field : See Field Name Free Text Searching : See Full-Text Retrieval Full-Text Indexing : indexing method in which the computer creates an alphabetical inverted index consisting of all words (except stop words) in the document along with pointers (locations) to locate the words in the document. Full-Text indexes are inexpensive to create since humans are not needed to define field names or enter indexing values into those fields. Full-Text Retrieval : a type of retrieval process that uses an inverted index to retrieve every document that contains the word or words in the search parameter. This type of searching requires a powerful search engine and is much slower than retrieval processes based on indexing values. It is also much less accurate because it is not based on standardized search terms. For instance, a search that retrieves all documents containing the word "invoice" will miss those which are designated as "bill" or "voucher." However, full-text retrieval systems initially are cheaper to implement because indexing costs are eliminated. Full-text retrieval is sometimes called "free text searching" or "fuzzy searching." Contrast with keyword retrieval . Fuzzy Searching : See Full-Text Retrieval Homonyms : words that are spelled the same but have different meanings. Computers don't recognize homonyms. ICR (Intelligent Character Recognition): a form of OCR (optical character recognifiton) which uses sophisticated lexical tools. ICR is typically used to convert handwritten material to ASCII text. |
|