Bug 12389 - Speedup Aho-Corasick for DB loading
Speedup Aho-Corasick for DB loading
Status: RESOLVED FIXED
Product: ClamAV
Classification: ClamAV
Component: libclamav
ALL
All All
: P3 enhancement
: 0.101.0
Assigned To: ClamAV team
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-09-13 10:13 EDT by Markus Kolb
Modified: 2019-11-08 17:15 EST (History)
3 users (show)

See Also:
QA Contact:


Attachments
fast AC sig load (10.07 KB, patch)
2019-10-16 11:36 EDT, aCaB
no flags Details | Diff
flame graph showing long function calls (92.67 KB, image/svg+xml)
2019-10-16 16:02 EDT, Micah Snyder
no flags Details
proper AC load patch (11.09 KB, patch)
2019-10-23 05:42 EDT, aCaB
no flags Details | Diff
tree dumping (5.61 KB, patch)
2019-10-23 05:45 EDT, aCaB
no flags Details | Diff
0.102 aCaB's fast loading patch (11.64 KB, patch)
2019-11-06 11:56 EST, Micah Snyder
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Markus Kolb 2019-09-13 10:13:11 EDT
Hi,

I'm currently developing some file caching for the startup.
It is not so easy to implement.
The cache also needs to become invalid if there is a DB update or some relevant config change.
Config changes could be handled by multiple file caches.
Maybe later there could be introduced some kind of patching of compiled rules to circumvent the cache invalidation. But at the moment I don't have a complete overview of the existing code. All the pointer linking in data structs. OMG.

The basic idea is to write the data structs after db parsing and with some memory mapping for pointers to a file. On startup the file content is loaded back into data structs and the pointers get fixed.

I would appreciate if someone with knowledge in developing C can jump in to speed up the process.
There are some nice parts to split up.
If there is interest I'll create a Github repo.
Comment 1 Micah Snyder 2019-09-13 10:37:12 EDT
This is definitely one idea we've discussed in the office as well. You're quite right that it is a hard problem because of all the pointers linking various structures.  It absolutely could speed up start time for clamd and clamscan if done right.
Comment 2 Markus Kolb 2019-09-17 07:44:54 EDT
Is it possible that
cli_matcher->soff
is not used currently?

I don't find code for it...
Comment 3 Micah Snyder 2019-09-17 18:38:07 EDT
It appears that neither of these are used:

    uint32_t *soff, soff_len;
Comment 4 aCaB 2019-10-16 11:36:26 EDT
Created attachment 7594 [details]
fast AC sig load

Hi folks,
over the past three months we've observed the load time increase by 300% on our typical scanning servers.
I had just started toying around when i've stumbled upon the ML thread and hence this ticket.
The root of the issue is caused by quadratic time in the insert_list function which is part of the AC sigs loader.
The purpose of the function is to place patterns into two distinct lists departing from each AC node. This is achieved by comparing each nerwly loaded pattern against the previously inserted patterns until the proper spot is found. This has O(N^2) complexity.
With the recent surge in released LDB signatures, in particular those with common subparts, the LDB load time has become a serious problem.

The attached patch addresses the issue in the following steps:
1. Loaded list items are allocated but left unattached; only a node reference is set on them for further processing. This is done with no increase of memory usage. See changes in insert_list and matcher-ac.h
2. Before the tries are built, the whole list of entries is sorted by node, then by pattern, then by partno. This requires O(N log(N)) time.
3. The list is processed linearly, one node at a time and the `next_same` chain is built. Each next_same chain head is also extracted. This requires O(N) time.
4. The list of heads is sorted by partno. This requires O(M log(M)) time on average with M<=N.
5. The list of heads is processed linearly and the `next` chain is built. This has O(M) complexity.


Important notes:
0. I haven't yet fully tested the patch. I intend to do it over the course of the following weeks but i thought i'd post it here anyway in case someone else was looking at it or is willing to test it further. I don't expect any change in the number of detections, however certain files might be detected under a different name (based on the sig load order) in case two or more sigs hit on it.
1. Patch is against 101, i can port it to 102 if there's interest.
2. While working on the patch i think i've encoutered a bug where `->len` is compared: https://github.com/Cisco-Talos/clamav-devel/blob/dev/0.101/libclamav/matcher-ac.c#L152
   Based on the fact that len is an array and based on the memcmp 5 lines below i believe the intended check was on `->len[0]`.
   That's what's used in my patch.
Comment 5 Micah Snyder 2019-10-16 13:07:11 EDT
aCaB,

This is awesome!  Thanks so much for working on it.  Mickey and I will both investigate and test this right away.  

I have been working on ways to refactor the signature database for a few weeks off and on.  I had written up an email for the mailing list to describe my findings, but was holding off on sending it until I had a more concrete solution. 

I found 2 causes for slow signature load & scan time.  The issue you're addressing in the patch, and another due to ignored signatures. The routines to ignore signatures (the .ign and .ign2 database files) are also extraordinarily inefficient.  As the ignore list has grown, so has load time.  Currently we ignore 1343 signatures.  Pruning the ignored signatures from main.cvd, and trimming the .ign/ign2 ignore files will have an immediate load time benefit for most users.

This will be updated very soon. 

I'd also like to optimize the code around ignoring signatures, but I also see no reason why we can't use cdiff's to update main.cvd going forward, so optimizing the ignore feature may be a waste of time when we can just drop signatures from main.cvd instead.
 
To add more detail to your description of the load time issue for signatures with common subparts (what I refer to as the "common pattern prefix issue"...

We found that large number of signature patterns in .ldb and .ndb database files share the same first 3 or more bytes though they have different content later.  Specifically, about half of daily.ldb is affected by this issue.  This “common pattern prefix” isn’t very observable when it’s only a few signatures.  In higher quantities the effect is pronounced and is the primary cause of higher load times this past year.  

I had hoped that I could "optimize" the signatures, by separating the common pattern prefixes into unique patterns with a wildcard, the {#} and * special characters.  I did some extensive testing on about 10% of the affected signatures using the malware samples those signatures were designed to detect. In so doing, I realized that my "optimization" in fact defers the problem to the other list (the ->same_next pointer), which results in terrible scan times for files containing a large number of such patterns.  Load time would go down a tonne, but scan time of files like Windows DLL's containing a large number of functions goes way up.

I have also come to the conclusion that there is no way to "optimize" the signature database without losing efficacy, and without a patch to libclamav. 

This morning I had an idea to pre-allocate branches of the trie used to store known common patterns, which would be a band-aid on the issue we would likely have to adjust from time to time.  I was just about to start coding that solution when you posted the above patch.  Your patch sounds like it is a much better option. 
 
Regards,
Micah
Comment 6 Micah Snyder 2019-10-16 16:02:15 EDT
Created attachment 7595 [details]
flame graph showing long function calls

I made the attached flame graph to show function run time during load.  The flamegraph file should render in Chrome/chromium. 

Note that about half is attributed to insert_list (the issue addressed by aCaB's patch), and about half due due to cli_bm_scanbuff, called when checking if signatures should be ignored.

I did some initial testing on my laptop to compare load & scan time for:
A) 0.101.4
B) 0.101.4 after removing daily.ign, daily.ign2
C) 0.101.4 after the patch
D) 0.101.4 after the patch + after removing daily.ign, daily.ign2

The scan target for my test was libclamav.dll (from 0.102.0, x64), chosen because it has many functions (many occurrences of the most common pattern prefix in daily.ldb).

The results are:

A) 0.101.4:
load time: 67.14786195755005
scan time: 2.4904208183288574

B) 0.101.4 after removing daily.ign, daily.ign2
load time: 42.72711801528931
scan time: 2.5286037921905518)

C) 0.101.4 after the patch
load time: 39.68687582015991
scan time: 2.8117897510528564

D) 0.101.4 after the patch + after removing daily.ign, daily.ign2
load time: 15.633795738220215
scan time: 2.5904250144958496

In total, load time with both changes goes from 67.1 seconds down to 15.6 seconds. This is roughly identical from what I was seeing when I tried separating the common pattern prefixes with wildcards, which is excellent news. 
Scan time is not affected (for this file), which makes sense and is also excellent news.

Basic testing of an LDB database load under address sanitization (ASAN) went well.  `make check VG=1`, and Valgrind testing of freshclam and a basic clamscan also went well.

I'm currently setting up a larger regression test (for efficacy) that will take a day or two to run, and will update the ticket when we have results.
Comment 7 aCaB 2019-10-17 12:17:31 EDT
(In reply to Micah Snyder from comment #5)
> I found 2 causes for slow signature load & scan time.  The issue you're
> addressing in the patch, and another due to ignored signatures. The routines
> to ignore signatures (the .ign and .ign2 database files) are also
> extraordinarily inefficient.

Hi Micah,
i didn't notice the ign issue as we repack all the sigs into a single cld with the ignored sigs already stripped off.
From a quick look chkign is also O(N^2) and despite being efficient in that it uses BM for the lookup, it'll always suffer from the exponential complexity.
I'd suggest using a cli_hashtable (IIRC cli_hashset only allows for integer keys) for storing the ignored sigs which should have O(1) access time.

> We found that large number of signature patterns in .ldb and .ndb database
> files share the same first 3 or more bytes though they have different
> content later.  Specifically, about half of daily.ldb is affected by this
> issue.  This “common pattern prefix” isn’t very observable when it’s only a
> few signatures.  In higher quantities the effect is pronounced and is the
> primary cause of higher load times this past year.  

Looking at the scan time performance (i.e. disregarding the load time issue which should be addressed by the patch), as a general rule the AC tries do a fairly good job at arranging entries.
That said, if you are looking at nasty patterns, you should probably start from the many cases of AC patterns starting with "0000". AFAIR these tend to cause bad scantime slowdowns. Last time i've checked i think i've counted about 500 of these.

> The results are:
> 
> A) 0.101.4:
> load time: 67.14786195755005
> scan time: 2.4904208183288574
[...]
> C) 0.101.4 after the patch
> load time: 39.68687582015991
> scan time: 2.8117897510528564

Weird, I get a much better improvement from the patch...
FWIW we don't load pua, ign, safebrowsing, phishing, nor re-based sigs and at ~6.5M sigs the load time is reduced to about 15% (39 vs 6 seconds).


> Basic testing of an LDB database load under address sanitization (ASAN) went
> well.  `make check VG=1`, and Valgrind testing of freshclam and a basic
> clamscan also went well.
> 
> I'm currently setting up a larger regression test (for efficacy) that will
> take a day or two to run, and will update the ticket when we have results.

This is great news. Can't wait to see the results.

BTW any comment on that len[0] thing?
Comment 8 Micah Snyder 2019-10-21 13:27:34 EDT
(In reply to aCaB from comment #7)

> i didn't notice the ign issue as we repack all the sigs into a single cld
> with the ignored sigs already stripped off.
> From a quick look chkign is also O(N^2) and despite being efficient in that
> it uses BM for the lookup, it'll always suffer from the exponential
> complexity.
> I'd suggest using a cli_hashtable (IIRC cli_hashset only allows for integer
> keys) for storing the ignored sigs which should have O(1) access time.

cli_hashtable might be a good option, but it's hard to dedicate time to try it out if we can stop ignoring sigs and start droppings sigs.  Of course, I also don't know how many community members curate large ignore lists.  Hopefully that's not a thing.

> Looking at the scan time performance (i.e. disregarding the load time issue
> which should be addressed by the patch), as a general rule the AC tries do a
> fairly good job at arranging entries.
> That said, if you are looking at nasty patterns, you should probably start
> from the many cases of AC patterns starting with "0000". AFAIR these tend to
> cause bad scantime slowdowns. Last time i've checked i think i've counted
> about 500 of these.

"0000" anywhere in a signature has been a culprit in the past.  At present, the two main offenders are:

Pattern prefix: 558bec - bytes for a Windows PE file function prolog, exceeding 25,000 patterns.

and:

Pattern prefix: 56423521 - bytes relating to identifying VBA scripts (afaik), exceeding 10,000 patterns.


> > The results are:
> > 
> > A) 0.101.4:
> > load time: 67.14786195755005
> > scan time: 2.4904208183288574
> [...]
> > C) 0.101.4 after the patch
> > load time: 39.68687582015991
> > scan time: 2.8117897510528564
> 
> Weird, I get a much better improvement from the patch...
> FWIW we don't load pua, ign, safebrowsing, phishing, nor re-based sigs and
> at ~6.5M sigs the load time is reduced to about 15% (39 vs 6 seconds).

I would expect that if you pre-drop the ignored signatures, your time (on my laptop), would go from 42.7 seconds to 15.3 seconds, reduced to about 36% - not 15%.  I'm definitely not seeing quite that much improvement - though still pretty fantastic.

> > Basic testing of an LDB database load under address sanitization (ASAN) went
> > well.  `make check VG=1`, and Valgrind testing of freshclam and a basic
> > clamscan also went well.
> > 
> > I'm currently setting up a larger regression test (for efficacy) that will
> > take a day or two to run, and will update the ticket when we have results.
> 
> This is great news. Can't wait to see the results.

I ran a regression test against 5 million samples from our malware zoo.  I'm seeing a lot of (over 5000) detection "changes", "false negatives", and "errors".  I've downloaded a few of the affected samples to try to reproduce the issues.  It seems to be less actual errors errors and more timeouts cause by extremely slow scan times for certain files.  I suspect the slow scan times are related to the common-pattern-prefixes above.

I previously suspected that the slow load time would also result in slow scan time because of really long linked lists attached to the A-C trie.  It seems like that is the case with some of these files, and that after the patch the scan time has about doubled for some. 

Take for example this (signed, non-malicious) "outlook.exe" file.  Scanning this file before and after the patch results in a slower scan, but if you remove every instance of `;558bec` with `; in daily.ldb, the scan time becomes a fraction of what it was:

https://www.virustotal.com/gui/file/3f569882de1db39db8be06077737ba6828af44eac22e4aa1bccfcaaa4cf08e7d/details

Before patch, using just the subset of signatures containing `;558bec`:
 Time: 87.151 sec (1 m 27 s)


After patch, using just the subset of signatures containing `;558bec`:
 Time: 553.293 sec (9 m 13 s)

Before patch, using that subset of signatures with `;558bec` removed:
 Time: 5.415 sec (0 m 5 s)

After patch, using that subset of signatures with `;558bec` removed:
 Time: 5.081 sec (0 m 5 s)

As you can see, there's definitely something different about the structure of the A-C trie or lists causing fairly significantly slower scan times for some files.  aCaB, do you have any idea what might be going on?  I can use Google Drive to share samples with you for testing. 

> BTW any comment on that len[0] thing?

I agree it looks like it was a bug that's been around for at least 12 years.  Changing it to [0] seems like the right thing.  Comparing len would just check if the two pointers were the same, not the actual length.  

Micah
Comment 9 aCaB 2019-10-21 15:31:25 EDT
Hi Micah,

the original patch wasn't much tested and in fact it contains a subtle bug which results in extra "next" branches being created in place of "next_same" entries, which results in extra scan time.

i did notice the slowdown too and wrote a tree dumper and comparison script.

i'll do some perf test tonight and provide you with a final patch tomorrow - as well as the trie dumping patch and the small python thing to compare them.

sorry for the trouble and thanks again for the tests.
Comment 10 Micah Snyder 2019-10-21 18:12:24 EDT
(In reply to aCaB from comment #9)
> Hi Micah,
> 
> the original patch wasn't much tested and in fact it contains a subtle bug
> which results in extra "next" branches being created in place of "next_same"
> entries, which results in extra scan time.
> 
> i did notice the slowdown too and wrote a tree dumper and comparison script.
> 
> i'll do some perf test tonight and provide you with a final patch tomorrow -
> as well as the trie dumping patch and the small python thing to compare them.
> 
> sorry for the trouble and thanks again for the tests.

No problem!  That's really cool that you've found the source of the issue so fast.  Your work is much appreciated. 

Once I have the update, I'll re-run testing over here.
Comment 11 aCaB 2019-10-23 05:42:24 EDT
Created attachment 7596 [details]
proper AC load patch

Hi Micah,
sorry it took longer than expected but after fixing the bug in the previous patch i was still left with a huge performance loss.

Figuring that out required a long investigation and writing code to dump and compare the tries.
The frustrating part was that with the bug fixed the tries looked functionally equivalent (i.e. they both satisfied the matcher requirements in terms of anchoring and order), yet the one without the patch was a lot more performant.
In the end the culprit was the memory, in particular the effect of caching and the result of faults. The solution was to simply avoid scattering close patterns apart.
By doing that i can achieve better-than-before scan times (in a more scientific/deterministic way too) while maintaining linear load times.
Please consider this patch semi-final. I'll be testing it further but i'm relatively confident now.
Also please let me know of any issue you may encounter.
Comment 12 aCaB 2019-10-23 05:45:15 EDT
Created attachment 7597 [details]
tree dumping

Hi this is the patch and python script i've used to check the tries.
After compiling with the patch applied run clamscan --no-summary /nosuchfile > trie.dump
Comment 13 Micah Snyder 2019-10-23 15:31:14 EDT
(In reply to aCaB from comment #11)
> Created attachment 7596 [details]
> proper AC load patch
> 
> Hi Micah,
> sorry it took longer than expected but after fixing the bug in the previous
> patch i was still left with a huge performance loss.
> 
> Figuring that out required a long investigation and writing code to dump and
> compare the tries.
> The frustrating part was that with the bug fixed the tries looked
> functionally equivalent (i.e. they both satisfied the matcher requirements
> in terms of anchoring and order), yet the one without the patch was a lot
> more performant.
> In the end the culprit was the memory, in particular the effect of caching
> and the result of faults. The solution was to simply avoid scattering close
> patterns apart.
> By doing that i can achieve better-than-before scan times (in a more
> scientific/deterministic way too) while maintaining linear load times.
> Please consider this patch semi-final. I'll be testing it further but i'm
> relatively confident now.
> Also please let me know of any issue you may encounter.

That sounds outstanding.  Thanks for the in depth investigation for the fix.  I'll begin testing your updated patch immediately. 

Micah
Comment 14 Micah Snyder 2019-11-05 15:24:00 EST
aCaB,

Sorry about the long delay getting back to you about test results. 

There was a significant reduction in scan timeout issues after the 2nd patch but for reasons unknown to me, engine B (post patch) still had a handful of samples where it exceeded the scan timeout in our regression test system while engine A (before the patch) did not.   

I downloaded and tested a handful of these on mac, Ubuntu, and FreeBSD (said regression tests run on FreeBSD) to see what load and scan time differences were like before and after the patch. I found no significant or consistent difference in scan time between pre and post patch scan on any of the samples on any of the above operating systems. 

I also ran the valgrind and address sanitizer tests as before with no issue.  Load time as before is reduced about the same amount (which is awesome). 

I'm looking forward to merging this patch.  I have already ported this to branches for 0.102.1 and 0.103 (in addition to 0.101.5) and placed internal pull requests to be reviewed and merged by a team member.
Comment 15 Markus Kolb 2019-11-06 04:33:07 EST
Hi Micah,
hi aCaB,

I'd like to rename the bug title to "Speedup Aho-Corasick for DB parsing" or some matching title. Ok?

I think, it is great that you are trying to fix it that way.

My effort on pure-memory-caching got stuck a few weeks ago.
Found that I need to write memory of compiled regex (re-lib) stuff and that put in some difficulty.
I've also seen during testing that my PoC doesn't speed as fast as I've assumed. So not sure if this would be a big step at the end.
And overall I've missed time by need to do something different.

Micah, can you please upload a complete patch for the 0.102-branch?
I would like to give it a try.

Thanks for your work.
Markus
Comment 16 Micah Snyder 2019-11-06 11:56:59 EST
Created attachment 7610 [details]
0.102 aCaB's fast loading patch

Sure makes sense to me.  Here's the 0.102 variant of aCaB's patch.
Comment 17 Micah Snyder 2019-11-08 17:15:12 EST
We just merged this patch into dev/0.103, here: 
https://github.com/Cisco-Talos/clamav-devel/commit/3dd506a7eecf07d5642072bbdb088fb06b8097ee

It has also been merged into the private branches where we're prepping the 0.102.1 and 0.101.5 security patch releases to be published in the next week or two.