Bugzilla – Bug 11017
db load performance
Last modified: 2019-06-28 12:21:07 EDT
On of the databases I mirror jumped in size, due to links from t.co re-director being added to a database, the size increase was as follows: 24/05/2014 18:16 1,709,810 bofh1.ndb 24/05/2014 19:44 6,393,370 bofh2.ndb Download both databases here: https://www.dropbox.com/s/jwhlepzbchchl48/dbtest.zip As you can see... on the two scans below, there is a huge difference in dbload speed between the two databases... test 1: clamscan --database=bofh1.ndb whois.exe whois.exe: OK ----------- SCAN SUMMARY ----------- Known viruses: 17194 Engine version: 0.98.1 Scanned directories: 0 Scanned files: 1 Infected files: 0 Data scanned: 0.02 MB Data read: 0.02 MB (ratio 1.00:1) Time: 0.130 sec (0 m 0 s) test 2: clamscan --database=bofh2.ndb whois.exe whois.exe: OK ----------- SCAN SUMMARY ----------- Known viruses: 84102 Engine version: 0.98.1 Scanned directories: 0 Scanned files: 1 Infected files: 0 Data scanned: 0.02 MB Data read: 0.02 MB (ratio 1.00:1) Time: 57.730 sec (0 m 57 s) using --debug... LibClamAV debug: Initialising AC pattern matcher of root[10] LibClamAV debug: Initializing engine->root[11] LibClamAV debug: Initialising AC pattern matcher of root[11] LibClamAV debug: Initializing engine->root[12] LibClamAV debug: Initialising AC pattern matcher of root[12] --- Pause happens here --- LibClamAV debug: bofh2.ndb loaded LibClamAV debug: Loaded 136 filetype definitions The issue seems to be partly related to these parts of the signatures, in the bofh2.ndb database... Bofhland.Cracked.url.1089596:3:*:(B)742E636F2F68546267664A6F5A334E:48 Bofhland.Cracked.url.1089597:4:*:(B)742E636F2F68546267664A6F5A334E:48 Bofhland.Cracked.url.1089598:3:*:(B)742E636F2F746449454D736B717A6D:48 The repeated sections being: t.co = 742E636F2F Just thought I'd report, for comments/thoughts and if the dbload can be improved? Cheers, Steve Sanesecurity
Thanks for the report. We'll investigate for the 0.98.6 release.
Created attachment 6644 [details] database files for re-creating the problem
Just to add to the mix... If you take the very slow loading bofh2.ndb database, replace 2E's with ??... it loads fast... sed "s/742E636F2F/74??636F??/g" bofh2.ndb > bofh3.ndb clamscan --database=bofh3.ndb testfile testfile: Bofhland.Cracked.url.1034179.UNOFFICIAL FOUND ----------- SCAN SUMMARY ----------- Known viruses: 84102 Engine version: 0.98.4-rc1 Scanned directories: 0 Scanned files: 1 Infected files: 1 Data scanned: 0.00 MB Data read: 0.00 MB (ratio 0.00:1) Time: 0.283 sec (0 m 0 s)
I've duplicated the behavior, with a quick analysis. You are correct, it is related to the fact that there are so many sigs with the same beginning character sequence. It is searching and adding to a linked list for each of these signatures. I observed one of the linked lists growing to 33454. It is an O(n**2) algorithm under these conditions resulting in the degraded performance. The underlying issue is that the default Aho-Corasick pattern tree depth in ClamAV is 3 (4 characters) and signatures that are not unique in the first 4 characters are added to the linked list. The A-C tree is optimized for search speed, but memory utilization grows exponentially with depth, so increasing the depth is not a viable option (in libclamav/default.h #define CLI_DEFAULT_AC_MAXDEPTH 3, for experimentation). Hopefully a wild card solution will work for now.
Another thought is to write a bytecode program for this set of signatures. The bytecode logical trigger would be the beginning character sequence and then the suffixes would be represented as an array or list of strings to search in the bytecode entrypoint function. This would at least improve the load time and also provides the opportunity to optimize the run time when the trigger matches by using binary search or some other optimization based on data pattern.
Thanks for the update... Just for info really, here a few more tests, with various impacts on memory, which was quite interesting to see... normal/very slow: bofh2.ndb: 42E636F2 pattern LibClamAV debug: pool memory used: 26.230 MB Different and faster test combination patterns: test 1/fast: sed "s/742E636F2F/74??636F??/g" bofh2.ndb > bofh3.ndb LibClamAV debug: pool memory used: 35.003 MB test 2/fast: sed "s/742E636F2F/74??636F(2f|2f)/g" bofh2.ndb > bofh3.ndb LibClamAV debug: pool memory used: 37.777 MB test 3/fast: sed "s/742E636F2F/74??636F/g" bofh2.ndb > bofh3.ndb LibClamAV debug: pool memory used: 25.851 MB test 4/fast: sed "s/742E636F2F/74(2e|2e)636F(2f|2f)/g" bofh2.ndb > bofh3.ndb LibClamAV debug: pool memory used: 40.894 MB
For evaluation for 0.99.1.
Have tested with the latest release of ClamAV 0.101-beta and found that the bofh2.hdb file takes longer to load than the bofh1.hdb. # clamscan -d bofh1.ndb file.txt file.txt: Empty file ----------- SCAN SUMMARY ----------- Known viruses: 17194 Engine version: 0.101.0-beta Scanned directories: 0 Scanned files: 0 Infected files: 0 Data scanned: 0.00 MB Data read: 0.00 MB (ratio 0.00:1) Time: 0.055 sec (0 m 0 s) # clamscan -d bofh2.ndb file.txt file.txt: Empty file ----------- SCAN SUMMARY ----------- Known viruses: 84102 Engine version: 0.101.0-beta Scanned directories: 0 Scanned files: 0 Infected files: 0 Data scanned: 0.00 MB Data read: 0.00 MB (ratio 0.00:1) Time: 18.905 sec (0 m 18 s) Not sure if any improvements were made as my test of 'bofh2.ndb' was quite a bit quicker than the originally reported time. Will need to have the developers review.
(In reply to Joe McGrath from comment #8) > ... > Not sure if any improvements were made as my test of 'bofh2.ndb' was quite a > bit quicker than the originally reported time. > > Will need to have the developers review. I think test machine is just faster than Steve Basford's was back in 2014. It looks to me like Steve B's machine loaded bofh2.ndb (57.730 sec / 0.130 sec) * 100% = 44407.69% slower than it loaded bofh1.ndb In your test, bofh.ndb loaded at slightly faster rate faster (57.730 / 0.130) * 100 = (18.905 sec / 0.055 sec) * 100% = 34372.73% slower than it loaded bofh1.ndb, which is still pretty atrocious. It looks to me like Steve Morgan didn't have any plan to improve the performance for databases that have a massive number of signatures that have the same prefix. He had suggested that Steve Basford (or the original database owners) rewrite the database as a bytecode signature so it will have better load performance. I don't know why Steve M indicated that we would reevaluate the ticket for 0.99.1. It clearly didn't happen, and I don't see any nor do I have any recommendations on ways to improve this load performance outside of rewriting the signatures a different way. I am curious to know what Steve Basford or the original signature authors ended up doing, and if this is still a concern at all.
Closing this ticket due to lack of activity, and no clear path for improvement.