Last Reviewed: February 6, 2009
Article: DTS0142
Applies to: dtSearch (all versions)
Contents
How can the size of an index be minimized?
What is the effect of index size on searching performance?
What is the effect of index size on indexing performance?
What is the effect of document type on index size?
dtSearch works by building an index of your documents. You can create as many indexes as you want, and in a search you can search any or all of them by clicking on the ones you want to search.
An index is generally about 1/8 to 1/3 the size of the original documents. The ratio of the index size to the size of the original documents depends on the type of documents, the size of the documents, and certain indexing options (discussed below). The ratio is better for word processing documents (which contain less text per kilobyte than plain text files) and large files (which reduce the effect of the per-document overhead).
A dtSearch version 7 index can hold over 1 terabyte. A dtSearch 6 index can hold up to 4-8 gigabytes of documents. If you fill up an index, dtSearch will display an "Index is full" message and stop adding documents to the index. For more information on the version 7 index format, and on upgrading version 6 indexes, see http://www.dtsearch.com/index7.html.
Apart from the 1 terabyte limit on the total amount of text in an index, the maximum number of documents in a single index is 2 billion.
How can the size of an index be minimized?
The following are some option settings that can be used to reduce the size of a dtSearch index.
Binary Files
The binary files option setting controls the way dtSearch treats files in a format that it does not recognize as a document. There are three options: (1) index the files completely, (2) filter out only the text of the files, and (3) skip the files entirely.
Filtering or skipping binary files can greatly reduce index size, and improve indexing speed. Indexing binary files completely can have a large effect on index size because binary files do not contain a normal mix of words. If treated as text, they produce a large number of unique, random text sequences, which then bloat the word list portion of the index disproportionately.
Filtering is also more effective at extracting text than simply indexing binary files entirely. This is because text may be present in a variety of formats (for example, blocks of Unicode text are often mixed with blocks of single-byte text), and the filtering algorithm can identify and decode these segments.
Exclude filters are another good way to minimize the number of binary files included in an index.
Title Size
If the documents being indexed are small, the per-document overhead in an index may be a relatively large portion of the total index size. For each document, dtSearch stores the filename and location, the modification date, size, and other properties, as well as a "title" which is usually the first 80 characters of text from the file. To reduce the size of the index, the title size can be changed to a smaller value. Additionally, reductions in the size of the filenames in the index, including the folder name, will save space.
Noise Words
For text files, a noise word list can reduce the size of an index by eliminating common words like "the" or "if". By default, dtSearch will index documents using a noise word list for the English language. The noise word list is a plain text file named noise.dat that can be edited using Notepad to add additional words.
Numbers
dtSearch has an indexing option to skip indexing numbers. If the documents being indexed contain many numbers, and if these numbers do not have to be searchable, this setting can reduce index size considerably.
Hyphens
The option setting to treat hyphens as spaces produces smaller indexes than any of the other option settings.
What is the effect of index size on searching performance?
A dtSearch search essentially consists of two steps: (1) looking up the words in the search request, and (2) enumerating the documents that match that request.
The word lookup step is usually very quick and takes a small fraction of the total time required for the search. A wildcard at the beginning of a search term (for example, a search for "*abc") can be slow because dtSearch uses letters at the start of a word to implement fast word searches, and dtSearch cannot do this if the start of the search term is unknown.
The time required for the second step, enumerating the documents, depends on the number of documents found rather than the size of the index. (For developers, using the dtsSearchDelayDocInfo flag can minimize the time required for this step, making searches that retrieve many files much faster. For more information, please see "Optimizing search performance with the dtSearch Engine")
What is the effect of index size on indexing performance?
The dtSearch indexer is designed to operate best when indexing large volumes of text at once. Therefore, it is preferable to index data in batches that are as large as possible. Indexing in small batches makes each update slower and also results in a much more fragmented index structure.
What is the effect of document type on index size?
The size of the index as a fraction of the original document size depends on how much text the document contains per kilobyte of data. The more text the document contains, the larger the index. For example, if you index a 20k Microsoft Word document and a 20k text file, the 20k text file will add much more data to the index than the Word document. A 20k Word document will consist mostly of formatting information, leaving only 10k or less of text, so it adds less than half as much data to the index as the text file.
In some cases, it is even possible for the index to be larger than the original documents. This can happen if the documents are in a compressed format such as ZIP archives or a PDF files (PDF files store text in a compressed stream).
Indexes of database files, such as MDB (Microsoft Access) or DBF (XBase) files, are also usually a large fraction of the original document size. dtSearch indexes each record of a database file as a separate document. As a result, the per-document overhead in the index becomes a much larger factor in the index size.