Speed Tests of Applied Software Based on SoftInform Search Technology
Below you can see the results of the tests for speed of indexing by SearchInform, in particular indexing the information in files of various formats.
Introduction
In order to reflect all aspects of full text search, several tests have been done on various types of data and of various sizes. The most widely used data formats are TXT, HTML, DOC, RTF, PDF.
From the point of view of search technologies, the most correct tests are the tests with simple formats. As a rule, data are stored in some DBMS or data warehouse, and are presented in the search system in the form of pure text.
The test results are indexation duration and index size. It should be noted that an index produced by SearchInform takes about 50 Mb, therefore with little textual data a SearchInform index will still be quite sizeable. With large data volumes the extra 50 megabytes will be unnoticeable.
The tests were run on an average capacity computer. Test computer configuration: CPU AMD Athlon2,2 Hz, RAM 2 Gb DDR400, HDD two IDE hard drives 160 Gb each (the data was stored on one of the HDDs, the index was created on another one).
Description of Volumes to be Indexed
For testing indexing and search speed we selected several informational volumes of various size containing documents of various formats. The volumes are listed from smallest to largest: each larger volume includes a smaller volume. For example, volume "21.85" includes volume "11.1", and so on.
Note: the volumes are named after the size of information in gigabytes.
Volumes "11.1", "21.85", "41.17", "83.22" are patents in English in the HTML format. The documents are physically stored in archive ZIP files, 5000 - 10000 files per archive.
In addition to patents, volume "132.26" in HTML taking 83.22 Gb contains the information from test volumes in DOC, RTF, PDF formats, as well as "10.7" texts.
Indexing Speed Tests
The tests revealed that in terms of indexing speed SearchInform works about 3-4 times faster than its counterparts. This document does not include the results of our competitors, but if you wish to get acquainted with the results, send your request to support@searchinform.com, and our experts will provide you with all the relevant information.
Table 1
|
Test volume
|
«11.1»
|
«21.85»
|
«41.17»
|
«83.22»
|
«132.26»
|
Size of documents
|
11,1 Gb
|
21,85 Gb
|
41,17 Gb
|
83,22 Gb
|
132,26 Gb
|
|
Documents total
|
319,695
|
619,018
|
1,118,513
|
1,993,149
|
2,888,202
|
|
Unique words
|
2,527,473
|
4,016,495
|
6,157,339
|
11,276,270
|
18,912,257
|
|
Pure text size
|
7,92 Gb
|
15,5 Gb
|
28,97 Gb
|
59,42 Gb
|
77,57 Gb
|
|
Index size
|
1,76 Gb
|
3,29 Gb
|
6,03 Gb
|
12,12 Gb
|
16,29 Gb
|
|
Indexation duration
|
30 min 36 sec
|
59 min 30 sec
|
1 hour 53 min
|
3 hours 56 min 15 seconds
|
6 hours 06 minutes
|
|
On average 1 Gb an hour
|
21.76
|
21.99
|
21.72
|
21.14
|
21.68
|
Table 2
|
|
«10.7»
|
DOC
|
RTF
|
PDF
|
|
Size of documents
|
10,7 Gb
|
1,9 Gb
|
325 Mb
|
5,39 Gb
|
|
Documents total
|
48,222
|
7,791
|
769
|
526
|
|
Unique words
|
4,408,347
|
439,354
|
220,262
|
942,295
|
|
Pure text size
|
9,88 Gb
|
179 Mb
|
33,27 Mb
|
126 Mb
|
|
Index size
|
2,06 Gb
|
118 Mb
|
86,91
|
160
|
|
Indexation duration
|
32 minutes
|
1,34 minutes
|
29 seconds
|
12,05 minutes
|
|
On average 1 Gb an hour
|
20.06
|
72.7
|
39.4
|
26.8
|
Search Speed Tests
Testing Methods
A special program (PhraseGen) formed a file from the volume of documents on the disk (HTML and DOC). The file is of the following format:
N = A B C D etc,
where A, B, C, D are the words form a randomly selected phrase in various documents;
N is the number of "garbage" words between the words in the phrase.
This format is recognized by a special test module for the SearchInform system that becomes available in the program after it has been started with the /debug key. Then, by means of the corresponding menu (Debug) the test conditions were specified, and two types of test were performed: by words and by a phrase.
The tests were run with account of morphology and pre-defined number of results at 20,000. The number of queries is 1,000. Two types of tests were performed: by high-frequency and low-frequency words.
Search by Words Test Results
In an actual system the index is already in use, and its preliminary adaptation does not entail extra time. Therefore, to approximate the test as close to the real conditions as possible, first a search by low-frequency and high-frequency words was performed without generating a report, and then the actual test was done.
The results of search speed (the time spent on processing 1,000 queries) are presented below:
|
Volume
|
Low-frequency words
|
High-frequency words
|
|
«11.1»
|
97,875 seconds
|
99,484 seconds
|
|
«21.85»
|
149,516 seconds
|
147,828 seconds
|
|
«41.17»
|
238,844 seconds
|
246,922 seconds
|
|
«83.22»
|
365,5 seconds
|
313,687 seconds
|
|
«132.26»
|
508,062 seconds
|
341,797 seconds
|
Search by Words Test Results archive
Test results actually revealed that searching by high-frequency words in this case (search by words only) is faster than that by low-frequency words. Also, it should be noted that as the volume increases, the search speed slows down, but not pro rata.
Search by Phrases with Gaps Test Results
|
Volume
|
Low-frequency words
|
High-frequency words
|
|
«11.1»
|
444,734
|
591,297
|
|
«21.85»
|
765,515
|
1 028,406
|
|
«41.17»
|
1 282,219
|
1 847,375
|
|
«83.22»
|
2 270,047
|
3 627,172
|
|
«132.26»
|
2 697,906
|
3 865,531
|
Search by Phrases with Gaps Test Results archive
In this case (search by phrase) the speed of searching by high-frequency words is about 1,5 times slower than that by low-frequency words. The speed is reduced not on a pro rata basis depending on the size of information in the volume, but much slower.
|