Remember Me
forgot your password?

Ten Things You Should Know About Document Indexing

It's document indexing that makes the tremendous speeds of document retrievals possible. As you may have noticed, Internet search engines retrieve documents relevant to your specific query from among billions of documents on the Web in less than a second. This would have been simply impossible if they had to search through all the billions in response to each query.

1. Search engines use what is called an inverted list index that lists the documents against each word, instead of words in each document. In response to a query, the engines look up the query words in their index and then list the documents against those words.

2. Typically there will be hundreds of documents, if not thousands, against each word. It then becomes necessary to rank the documents in order of relevance to the query. Relevance is determined by using certain rules set by the engine, and typically involves more than the density of the particular query words in each document.

3. The major search engines do what is known as full-text indexing, i.e. they check all the words in the document’s content, and list it against each of these words (except perhaps too common words like ‘the’).

4. Not all indexing is full-text indexing. Full text indexes tend to be huge, requiring much storage space on their own. Indexing by document meta tags take up much less space. The meta tags provide information about the document that helps retrieve it. For example, a brief note about the content of the document, its date of creation/modification and the author name might be attached as meta tags with each document.

5. Meta tag indexing requires that the user has an idea of what the tags contain so that the person can query using these values. This is typically achieved by having standard practices for describing document contents and document naming. Often, drop-down selection boxes of such descriptions and names are used for manually tagging the document so that different users will use the same terms for similar documents.

6. Indexing is mainly used with unstructured documents, such as correspondence, reports, articles and so on. Structured documents such as transaction records are typically stored in databases, and have unique IDs for each document. Database queries can then bring up the right document in little time (instead of the many documents brought up by search queries).

7. Computer systems typically add certain meta information automatically to each document they create or modify. The date of creation and document author name are examples of such automatically added data. Other data such as document content description can be manually added by the user, or added using such devices as standard-description barcode cards.

8. Indexing can be specialized as when scientific documents are indexed using scientific notation rather than standard words. The key issue is ease of subsequent retrieval. Searchers for scientific documents, for example, will typically find it easier to retrieve documents using the specialized notations.

9. When paper documents are scanned into digital images, they cannot be indexed as such. Instead, the images need to be processed further using such tools as OCR (Optical Character Recognition) software to convert the images of text characters into standard, machine readable ASCII or Unicode characters.

10. Document indexing is not the only way to facilitate their subsequent retrieval. A hierarchical directory structure with meaningfully named folders and subfolders, and proper classification of documents and their storage in relevant subfolders, can enable quick browsing to the correct folder and retrieval. Where necessary, this can be combined with folder-level indexing and search.

Without the facility of indexing the thousands of documents using, say a desktop search facility, businesses might find that retrieving unstructured documents is a tough, and often simply impossible, task. Indexing, full text or meta tag based, changes the situation dramatically making it possible to retrieve even a particular e-mail comparatively quickly. Indexing is thus a powerful business tool.

Manuel J. Montesino

Ademero, Inc. develops document archiving software. Based largely on user experience, the company's flagship product, Content Central™, is a browser-based document management software system created to provide businesses and other organizations with a convenient way to capture, retrieve, and manage information originating in hard copy or digital form. Access a live preview of this document management solution by visiting the Ademero web site.

Rate this Article: 0 / 5 stars - 0 vote(s)
Print Email Re-Publish

Add new Comment



Captcha

  • Latest Software Articles
  • More from Manuel J. Montesino

Mac DVD to MP4 Converter - Convert DVD Movie to MP4 Movie for Mac

By: Susan Lyrics | 26/11/2009
Whether you want to convert DVD Movie into some video format like MP4 to play on your iPod, or to convert your movies to DVD VOB format, what you need is the powerful DVD Video Converter software. We recommend you iSkysoft DVD Ripper for Mac, download it to have a trial!

Discover the Smart Way to Maintain Radeon Mobility 9700 Drivers Right Now

By: Victoria | 26/11/2009
Everyone is giving plenty advice on how to manually update your radeon mobility 9700 drivers, how to get your nose in your windows system and how to manually modify setting and remove drivers from there.

Learn How to Keep the Radeon HD 3850 AGP Driver Working With 3 Smart Tips

By: Victoria | 26/11/2009
If you are one those that are looking for ways to keep the radeon hd 3850 agp driver updated and working, then here I will present you with three tips that any ati users should be aware of before starting to maintain his drivers.

Find Out Why It Is Best To Automatically Update the Radeon HD Graphics Driver

By: Victoria | 26/11/2009
So many ATi owners around the world don’t really know how to care for their video cards.

Four Amazing Tips For a Fast Radeon HD 2600 PCI Driver Update

By: Victoria | 26/11/2009
If you’d like to know how you can do a radeon hd 2600 pci driver updater securely and fast, then you should read the below presented tips.

Exciting Enhancements for TurboMeeting web conference appliance

By: johnmao | 26/11/2009
Larry Dorie, Chief Executive Officer of RHUB, recently announced TurboMeeting, a web conferencing solution, enhancements that will provide a fully comprehensive solution.

Training By Sage

By: Nick Golden | 26/11/2009
Though MAS training differs from ACT training online, since they are both by Sage, you will find that each system’s training course complements the other.

The Sage Support For SMB

By: Nick Golden | 26/11/2009
Together with Sage’s Act CRM software, Sage MAS 200 and MAS90 software, will offer business owners a well-rounded package to cover nearly all aspects of their businesses.

Ten Things You Should Know About Document Retention

By: Manuel J. Montesino | 17/11/2009 | Software
Business documents are retained for several purposes such as complying with statutory requirements, providing decision support information, recording history, demonstrating compliance with regulations and meeting document-discovery needs in litigation.

Ten Things You Should Know About Document Indexing

By: Manuel J. Montesino | 17/11/2009 | Software
It's document indexing that makes the tremendous speeds of document retrievals possible. As you may have noticed, Internet search engines retrieve documents relevant to your specific query from among billions of documents on the Web in less than a second.

Ten Things You Should Know About Document Imaging

By: Manuel J. Montesino | 17/11/2009 | Software
Electronic Document Management Systems (EDMS) provide overwhelming advantages over paper-based document management. It's in this context that document imaging comes into the picture these days, converting remaining paper documents into electronic ones.

Ten Things You Should Know About Document Distribution

By: Manuel J. Montesino | 10/11/2009 | Software
It's distribution that really makes documents powerful. When your prospective customer receives your sales letter, or when warehouse personnel receive the order dispatch advice, or when the shop floor manager receives the day's production schedule, your business gets moving.

Ten Things You Should Know About Document Discovery

By: Manuel J. Montesino | 10/11/2009 | Software
In the U.S., document discovery in litigation has its own practices. Efficient document discovery can save large sums of litigation costs. Even in other countries, efficient document "discovery" can substantially enhance the chances of success in lawsuits.

Ten Things You Should Know About Document Classification Methods

By: Manuel J. Montesino | 10/11/2009 | Software
Document classification means sorting documents in a way that makes it easier to locate them later. For example, you classify a document as a sales order, as an order from a particular client, for a particular product and of a particular date.

Ten Things You Should Know About Document Backup

By: Manuel J. Montesino | 02/11/2009 | Software
What's the difference between backup and archive? The major difference is that an archive consists of primary data while a backup is secondary data. The objective of archiving is preserving original documents while backing up is a precautionary activity aimed at creating a fallback resource for reconstructing original data in case it's lost.

Ten Things You Should Know About Document Storage

By: Manuel J. Montesino | 02/11/2009 | Software
Documents have to be stored not only during their current periods but for years thereafter (forever in some cases). Statutory and litigation requirements and preservation of history, for example, make such storage necessary.

Submit Your Articles Free: Signup
Article Categories




Use of this web site constitutes acceptance of the Terms Of Use and Privacy Policy | User published content is licensed under a Creative Commons License.
Copyright © 2005-2008 Free Articles by ArticlesBase.com, All rights reserved. (0.32, 5, w2)