Bug 11017 - db load performance
db load performance
Status: RESOLVED WONTFIX
Product: ClamAV
Classification: ClamAV
Component: libclamav
stable
x86_64 GNU/Linux
: P3 normal
: old_bug_triage
Assigned To: ClamAV team
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-05-24 15:12 EDT by Steve Basford
Modified: 2019-06-28 12:21 EDT (History)
4 users (show)

See Also:
QA Contact:


Attachments
database files for re-creating the problem (953.53 KB, application/zip)
2014-05-27 15:26 EDT, Steven Morgan
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Steve Basford 2014-05-24 15:12:55 EDT
On of the databases I mirror jumped in size, due to links from t.co re-director
being added to a database, the size increase was as follows:

24/05/2014  18:16         1,709,810 bofh1.ndb
24/05/2014  19:44         6,393,370 bofh2.ndb

Download both databases here:

https://www.dropbox.com/s/jwhlepzbchchl48/dbtest.zip

As you can see... on the two scans below, there is a huge difference in dbload speed between the two databases...

test 1:

clamscan --database=bofh1.ndb whois.exe
whois.exe: OK

----------- SCAN SUMMARY -----------
Known viruses: 17194
Engine version: 0.98.1
Scanned directories: 0
Scanned files: 1
Infected files: 0
Data scanned: 0.02 MB
Data read: 0.02 MB (ratio 1.00:1)
Time: 0.130 sec (0 m 0 s)

test 2:

clamscan --database=bofh2.ndb whois.exe 
whois.exe: OK

----------- SCAN SUMMARY -----------
Known viruses: 84102
Engine version: 0.98.1
Scanned directories: 0
Scanned files: 1
Infected files: 0
Data scanned: 0.02 MB
Data read: 0.02 MB (ratio 1.00:1)
Time: 57.730 sec (0 m 57 s)


using --debug...

LibClamAV debug: Initialising AC pattern matcher of root[10]
LibClamAV debug: Initializing engine->root[11]
LibClamAV debug: Initialising AC pattern matcher of root[11]
LibClamAV debug: Initializing engine->root[12]
LibClamAV debug: Initialising AC pattern matcher of root[12]

--- Pause happens here ---

LibClamAV debug: bofh2.ndb loaded
LibClamAV debug: Loaded 136 filetype definitions

The issue seems to be partly related to these parts of the signatures,
in the bofh2.ndb database...

Bofhland.Cracked.url.1089596:3:*:(B)742E636F2F68546267664A6F5A334E:48
Bofhland.Cracked.url.1089597:4:*:(B)742E636F2F68546267664A6F5A334E:48
Bofhland.Cracked.url.1089598:3:*:(B)742E636F2F746449454D736B717A6D:48

The repeated sections being:

t.co = 742E636F2F

Just thought I'd report, for comments/thoughts and if the dbload
can be improved?

Cheers,

Steve
Sanesecurity
Comment 1 Steven Morgan 2014-05-27 12:04:45 EDT
Thanks for the report. We'll investigate for the 0.98.6 release.
Comment 2 Steven Morgan 2014-05-27 15:26:28 EDT
Created attachment 6644 [details]
database files for re-creating the problem
Comment 3 Steve Basford 2014-05-28 05:58:29 EDT
Just to add to the mix... 

If you take the very slow loading bofh2.ndb database, replace 2E's with ??... it loads fast... 

sed  "s/742E636F2F/74??636F??/g" bofh2.ndb > bofh3.ndb


clamscan --database=bofh3.ndb testfile
testfile: Bofhland.Cracked.url.1034179.UNOFFICIAL FOUND

----------- SCAN SUMMARY -----------
Known viruses: 84102
Engine version: 0.98.4-rc1
Scanned directories: 0
Scanned files: 1
Infected files: 1
Data scanned: 0.00 MB
Data read: 0.00 MB (ratio 0.00:1)
Time: 0.283 sec (0 m 0 s)
Comment 4 Steven Morgan 2014-05-28 16:33:11 EDT
I've duplicated the behavior, with a quick analysis. You are correct, it is related to the fact that there are so many sigs with the same beginning character sequence. It is searching and adding to a linked list for each of these signatures. I observed one of the linked lists growing to 33454. It is an O(n**2) algorithm under these conditions resulting in the degraded performance. The underlying issue is that the default Aho-Corasick pattern tree depth in ClamAV is 3 (4 characters) and signatures that are not unique in the first 4 characters are added to the linked list. The A-C tree is optimized for search speed, but memory utilization grows exponentially with depth, so increasing the depth is not a viable option (in libclamav/default.h #define CLI_DEFAULT_AC_MAXDEPTH 3, for experimentation). Hopefully a wild card solution will work for now.
Comment 5 Steven Morgan 2014-05-29 11:12:23 EDT
Another thought is to write a bytecode program for this set of signatures. The bytecode logical trigger would be the beginning character sequence and then the suffixes would be represented as an array or list of strings to search in the bytecode entrypoint function. This would at least improve the load time and also provides the opportunity to optimize the run time when the trigger matches by using binary search or some other optimization based on data pattern.
Comment 6 Steve Basford 2014-05-29 11:26:41 EDT
Thanks for the update... 

Just for info really, here a few more tests,
with various impacts on memory, which was quite interesting to see...

normal/very slow: bofh2.ndb: 42E636F2 pattern
LibClamAV debug: pool memory used: 26.230 MB

Different and faster test combination patterns:

test 1/fast:
sed  "s/742E636F2F/74??636F??/g" bofh2.ndb > bofh3.ndb
LibClamAV debug: pool memory used: 35.003 MB

test 2/fast:
sed  "s/742E636F2F/74??636F(2f|2f)/g" bofh2.ndb > bofh3.ndb
LibClamAV debug: pool memory used: 37.777 MB

test 3/fast:
sed  "s/742E636F2F/74??636F/g" bofh2.ndb > bofh3.ndb
LibClamAV debug: pool memory used: 25.851 MB

test 4/fast:
sed  "s/742E636F2F/74(2e|2e)636F(2f|2f)/g" bofh2.ndb > bofh3.ndb
LibClamAV debug: pool memory used: 40.894 MB
Comment 7 Steven Morgan 2015-06-23 15:35:15 EDT
For evaluation for 0.99.1.
Comment 8 Joe McGrath 2018-10-30 15:01:16 EDT
Have tested with the latest release of ClamAV 0.101-beta and found that the bofh2.hdb file takes longer to load than the bofh1.hdb. 

# clamscan -d bofh1.ndb file.txt 
file.txt: Empty file

----------- SCAN SUMMARY -----------
Known viruses: 17194
Engine version: 0.101.0-beta
Scanned directories: 0
Scanned files: 0
Infected files: 0
Data scanned: 0.00 MB
Data read: 0.00 MB (ratio 0.00:1)
Time: 0.055 sec (0 m 0 s)

# clamscan -d bofh2.ndb file.txt 
file.txt: Empty file

----------- SCAN SUMMARY -----------
Known viruses: 84102
Engine version: 0.101.0-beta
Scanned directories: 0
Scanned files: 0
Infected files: 0
Data scanned: 0.00 MB
Data read: 0.00 MB (ratio 0.00:1)
Time: 18.905 sec (0 m 18 s)

Not sure if any improvements were made as my test of 'bofh2.ndb' was quite a bit quicker than the originally reported time. 

Will need to have the developers review.
Comment 9 Micah Snyder 2019-02-26 18:01:01 EST
(In reply to Joe McGrath from comment #8)
> ...
> Not sure if any improvements were made as my test of 'bofh2.ndb' was quite a
> bit quicker than the originally reported time. 
> 
> Will need to have the developers review.

I think test machine is just faster than Steve Basford's was back in 2014.

It looks to me like Steve B's machine loaded bofh2.ndb 
(57.730 sec / 0.130 sec) * 100% = 44407.69% slower than it loaded bofh1.ndb

In your test, bofh.ndb loaded at slightly faster rate faster (57.730 / 0.130) * 100 = (18.905 sec / 0.055 sec) * 100% = 34372.73% slower than it loaded bofh1.ndb, which is still pretty atrocious.

It looks to me like Steve Morgan didn't have any plan to improve the performance for databases that have a massive number of signatures that have the same prefix.  He had suggested that Steve Basford (or the original database owners) rewrite the database as a bytecode signature so it will have better load performance.

I don't know why Steve M indicated that we would reevaluate the ticket for 0.99.1. 
 It clearly didn't happen, and I don't see any nor do I have any recommendations on ways to improve this load performance outside of rewriting the signatures a different way.

I am curious to know what Steve Basford or the original signature authors ended up doing, and if this is still a concern at all.
Comment 10 Micah Snyder 2019-06-28 12:21:07 EDT
Closing this ticket due to lack of activity, and no clear path for improvement.