Last Reviewed: May 7, 2007

Article: DTS0111

 

Applies to: dtSearch Engine versions 6.x or 7.x

1.  Indexing SQL Data and other data sources

a.  Sample code and documentation

b.  Incremental updates

c.  Binary documents in fields (BLOB data)

d.  Selecting which columns and fields to index

e.  Highlighting hits

f.  Third-party database indexing tools

2.  Indexing desktop databases

1.  Indexing SQL Data and other data sources

Sample code and documentation

This dtSearch Engine has an API that can index any data you can access from .NET, Visual Basic, Java, or C++. A sample database indexing application included with the dtSearch Engine uses Active Data Objects (ADO.NET) to index any database that can be accessed through ADO.  The same approach will work with other types of non-file data.Source code for the database indexing sample application is installed to these locations:

C:\Program Files\dtSearch Developer\examples\vb.net2\ado_demo (VB.NET 2.0)
C:\Program Files\dtSearch Developer\examples\cs2\ado.net (C# 2.0)
C:\Program Files\dtSearch Developer\examples\vb.net\ado_demo (VB.NET)
C:\Program Files\dtSearch Developer\examples\cs\ado.net (C#)
C:\Program Files\dtSearch Developer\examples\vbasic\ado_demo (Visual Basic 6)

The API's used to implement this feature are documented here:

.NET 2.0: DataSource class, documented in dtSearchNetApi2.chm

.NET 1.0: DataSource class, documented in dtSearchNetApi.chm

C/C++: dtsDataSource in the dtSearch Engine help file.

COM: Indexing COM Data Sources in the dtSearch Engine help file.

Java: DataSource2 class

The ado_demo sample application uses the database schema to determine the tables and fields that are present in the database, and then iterates over every row of every table, indexing the entire database. The source code can be modified to exclude certain tables or fields from the index.

The basic mechanism used to implement data source indexing is the same in all of the APIs:

1.  A data source object is created that iterates over the content to be indexed.  The object implements a "GetNextDoc" method that returns the next document to be indexed.

2.  This object is attached to the IndexJob, and when the IndexJob is executed, the dtSearch Engine calls the GetNextDoc method repeatedly to get data to be added to the index.

3.  For each item returned by GetNextDoc, the data source can return plain text, fielded data, and/or a BLOB containing binary document data (such as a PDF file).  The data source can choose whatever name it wants for the row, as long as the name has the form of a valid Win32 filename. The name will be returned as the document filename in search results.  

Usually the name includes a row ID for the row, making it easier to retrieve a row from the database using the filename returned in search results.

 Incremental Updates

An index of a database that was created using the data source API can be updated incrementally.  To do this,

1.  Set up the IndexJob exactly as it was set up when the index was initially created, but set all of the Action flags (ActionCreate, ActionRemoveDeleted, etc.) to false), except the ActionAdd flag, which must be set to True.  If ActionCreate is true, the index will be cleared at the start of the index update.

2.  Set up your data source so it will only iterate over database rows that are either (a) new, or (b) have been modified since they were last indexed.  If the data source iterates over the whole database, dtSearch will still only index new or modified rows (it checks the modification date and size of each item to see if it has changed before reindexing).   However, incremental updates will be much faster if you just pass dtSearch the rows that need to be indexed.

To support incremental updates, it is often helpful to have a boolean "NeedsIndexing" field in the database indicating that a row requires reindexing.  The field can be set to true when a row is added or modified, and false when a row is indexed.  To perform an incremental update, the data source would select only those rows where NeedsIndexing=true.

Binary documents in fields (BLOB data)

If data in a column consists of binary files such as Word documents or Excel spreadsheets, you can index the contents of these columns along with other fields.  The data source API provides a way to combine (field, value) pairs from other columns of the same row with a binary document such as a Word file into a single logical document for indexing and searching purposes.  dtSearch will index the document just as it would if found on disk, including any fields in the document itself (such as Document Summary Information fields).  For more information on how this is done, see the article, "How to add fields to documents during indexing."

Highlighting hits

To highlight hits in a retrieved document, use the FileConverter object to convert the document to HTML, RTF, or text, with hit highlight markings around the hits.  For documentation on FileConverter, see:

Highlighting Hits - Overview in the dtSearch Engine help file, dtSearchApiRef.chm.

.NET: FileConverter class, documented in dtSearchNetApi.chm

C/C++: DFileConvertJob in dtSearchApiRef.chm.

COM: FileConverter in dtSearchApiRef.chm.

Java:  FileConverter in dtSearchApiRef.chm.

When data has been indexed using a data source, there is often no disk file to use as input so the original document must be reconstructed.  There are two ways this can be done:
(1) Set up the FileConverter with the same data that was supplied to the IndexJob in the DataSource object.  For example, in the .NET API, set FileConverter.InputText, InputFields, and InputBytes to the values returned through the data source as DocText, DocFields, and DocBytes.  

(2) Using the version 7 index format, you can set up the index so it will cache each document as it is indexed.  Then instead of reconstructing the document to highlight hits, you can set FileConverter.Flags = dtsConvertGetFromCache, so the cached copy of the document will be used to highlight hits.

Selecting which tables and columns to index

The ado_demo samples all implement a SampleDataSource class that traverses the entire database, using schema information to get the list of tables and columns in the database.   To modify the sample code so it indexes only certain tables or only certain fields, modify the SampleDataSource class, changing the code that gets the list of tables and columns from the schema to instead use a specific list of tables and columns.

When indexing very large tables, your indexing application may run out of memory if you attempt to select the entire table at once.  If this occurs, you can modify your data source implementation to select batches of rows (i.e., 1-100,000, 100,001-200,000, etc.) to index.

Third-party database-indexing tools

Cybergroup's dbConnector is available from Cybergroup and can be used to index any ODBC-compliant database using a GUI.  For more information, see:
 http://www.dtsearch.com/CS_CybergroupdbConnec.html
 http://www.cybergroup.com/

 2. Indexing desktop databases with the dtSearch Engine and ODBC

dtSearch indexes, searches and displays Access databases using Microsoft ODBC drivers. If the ODBC drivers are installed, dtSearch will index a database file as a series of records. The name of each record includes the filename of the database, the name of the table, and a record identifier. If instead dtSearch indexes a database as a single text file, you probably do not have the ODBC drivers installed.

Database indexing works best when the ODBC drivers are able to use a primary key to access records in the database. A primary key is a field that is unique for each record and that is designated as a primary key in the database.

ODBC indexing of databases is implemented in an external file parser DLL, dtv_odbc.dll, that is included with dtSearch and the dtSearch Engine.  The dtv_odbc.dll file must be located in a subfolder named "viewers" under the dtSearch Engine "Home" directory.  For more information, see the dtSearch Engine help topic, Options and the Home and Private Directories.

dtSearch can also index some database formats directly, without the need for OBDC drivers.   These include:  XML, CSV (comma-separated values), and XBase (FoxPro, dBASE, and other .DBF-compatible formats).

Additional Information

For more information about indexing databases with dtSearch, please see the following topics:

Field Searching in the dtSearch Text Retrieval Engine Programmer's Reference (dtSearchApiRef.chm)

Indexing Databases in the dtSearch Text Retrieval Engine Programmer's Reference (dtSearchApiRef.chm)

How to get field data in search results

How to add fields to documents

Troubleshooting database indexing

How to use dtSearch Web with dynamically-generated content