Home   Order/Download   Products   Projects   Our Technologies   Partnership   Press   Company   Russian 
SearchInform is your power over information!

Speed Tests of Applied Software Based on SoftInform Search Technology

Below you can see the results of the tests for speed of indexing by SearchInform, in particular indexing the information in files of various formats.

Introduction

In order to reflect all aspects of full text search, several tests have been done on various types of data and of various sizes. The most widely used data formats are TXT, HTML, DOC, RTF, PDF.

From the point of view of search technologies, the most correct tests are the tests with simple formats. As a rule, data are stored in some DBMS or data warehouse, and are presented in the search system in the form of pure text.

The test results are indexation duration and index size. It should be noted that an index produced by SearchInform takes about 50 Mb, therefore with little textual data a SearchInform index will still be quite sizeable. With large data volumes the extra 50 megabytes will be unnoticeable.

The tests were run on an average capacity computer. Test computer configuration: CPU AMD Athlon2,2 Hz, RAM 2 Gb DDR400, HDD two IDE hard drives 160 Gb each (the data was stored on one of the HDDs, the index was created on another one). 

Description of Volumes to be Indexed

For testing indexing and search speed we selected several informational volumes of various size containing documents of various formats. The volumes are listed from smallest to largest: each larger volume includes a smaller volume. For example, volume "21.85" includes volume "11.1", and so on.

Note: the volumes are named after the size of information in gigabytes. 

Volumes "11.1", "21.85", "41.17", "83.22" are patents in English in the HTML format. The documents are physically stored in archive ZIP files, 5000 - 10000 files per archive.

In addition to patents, volume "132.26" in HTML taking 83.22 Gb contains the information from test volumes in DOC, RTF, PDF formats, as well as "10.7" texts. 

Indexing Speed Tests

The tests revealed that in terms of indexing speed SearchInform works about 3-4 times faster than its counterparts. This document does not include the results of our competitors, but if you wish to get acquainted with the results, send your request to support@searchinform.com, and our experts will provide you with all the  relevant information. 

Table 1

Test volume

«11.1»

«21.85»

«41.17»

«83.22»

«132.26»

Size of documents

11,1 Gb

21,85 Gb

41,17 Gb

83,22 Gb

132,26 Gb

Documents total

319,695

619,018

1,118,513

1,993,149

2,888,202

Unique words

2,527,473

4,016,495

6,157,339

11,276,270

18,912,257

Pure text size

7,92 Gb

15,5 Gb

28,97 Gb

59,42 Gb

77,57 Gb

Index size

1,76 Gb

3,29 Gb

6,03 Gb

12,12 Gb

16,29 Gb

Indexation duration

30 min 36 sec

59 min 30 sec

1 hour 53 min

3 hours 56 min 15 seconds

6 hours 06 minutes

On average 1 Gb an hour

21.76

21.99

21.72

21.14

21.68

Table 2

 

«10.7»

DOC

RTF

PDF

Size of documents

10,7 Gb

1,9 Gb

325 Mb

5,39 Gb

Documents total

48,222

7,791

769

526

Unique words

4,408,347

439,354

220,262

942,295

Pure text size

9,88 Gb

179 Mb

33,27 Mb

126 Mb

Index size

2,06 Gb

118 Mb

86,91

160

Indexation duration

32 minutes

1,34 minutes

29 seconds

12,05 minutes

On average 1 Gb an hour

20.06

72.7

39.4

26.8

Search Speed Tests

Testing Methods

A special program (PhraseGen) formed a file from the volume of documents on the disk (HTML and DOC). The file is of the following format: 

N = A B C D etc,
where A, B, C, D are the words form a randomly selected phrase in various documents;
N is the number of "garbage" words between the words in the phrase.

This format is recognized by a special test module for the SearchInform system that becomes available in the program after it has been started with the /debug key. Then, by means of the corresponding menu (Debug) the test conditions were specified, and two types of test were performed: by words and by a phrase.

The tests were run with account of morphology and pre-defined number of results at 20,000. The number of queries is 1,000. Two types of tests were performed: by high-frequency and low-frequency words. 

Search by Words Test Results

In an actual system the index is already in use, and its preliminary adaptation does not entail extra time. Therefore, to approximate the test as close to the real conditions as possible, first a search by low-frequency and high-frequency words was performed without generating a report, and then the actual test was done.  

The results of search speed (the time spent on processing 1,000 queries) are presented below:

Volume

Low-frequency words

High-frequency words

«11.1»

97,875 seconds

99,484 seconds

«21.85»

149,516 seconds

147,828 seconds

«41.17»

238,844 seconds

246,922 seconds

«83.22»

365,5 seconds

313,687 seconds

«132.26»

508,062 seconds

341,797 seconds

Search by Words Test Results archive


Test results actually revealed that searching by high-frequency words in this case (search by words only) is faster than that by low-frequency words. Also, it should be noted that as the volume increases, the search speed slows down, but not pro rata.

Search by Phrases with Gaps Test Results

Volume

Low-frequency words

High-frequency words

«11.1»

444,734

591,297

«21.85»

765,515

1 028,406

«41.17»

1 282,219

1 847,375

«83.22»

2 270,047

3 627,172

«132.26»

2 697,906

3 865,531

Search by Phrases with Gaps Test Results archive


In this case (search by phrase) the speed of searching by high-frequency words is about 1,5 times slower than that by low-frequency words. The speed is reduced not on a pro rata basis depending on the size of information in the volume, but much slower.

  
   Press Center
January 10, 2007.
SearchInform Technologies Inc. introduces a new version of SearchInform, a program of full text search and search for documents with similar content, featuring new interface settings as well as an enhanced functionality. Detailed...

» News about our search engine
December 05, 2006
IRP Technology, a large-scale system integrator and SearchInform Technologies, a developer of corporate search solutions, announce a partnership agreement, based on which IRP Technology receives the right to use SearchInform search technologies in any of its projects.
Detailed...
» Press about our search engine
   Search engine information
Check out brand new, stylish demo-movie about SoftInform Search Technology and SearchInform application features.
Download search engine demo movie

Major problems of corporate search solved by SoftInform Search Technology
Download search engine presentation
   Our search engine awards
Best Soft 2005 Award from PCMagazine
Top rated at BrotherSoft.com
View all awards...
   Affiliate program information
We are glad to offer you our affiliate program for our SearchInform application. Start to cooperate with us and you'll receive fee for every copy of our program sold with your help. Fill out this form to join to our affiliate program.
stretcher