Bugzilla – Bug 12389
Speedup Aho-Corasick for DB loading
Last modified: 2019-11-08 17:15:12 EST
Hi, I'm currently developing some file caching for the startup. It is not so easy to implement. The cache also needs to become invalid if there is a DB update or some relevant config change. Config changes could be handled by multiple file caches. Maybe later there could be introduced some kind of patching of compiled rules to circumvent the cache invalidation. But at the moment I don't have a complete overview of the existing code. All the pointer linking in data structs. OMG. The basic idea is to write the data structs after db parsing and with some memory mapping for pointers to a file. On startup the file content is loaded back into data structs and the pointers get fixed. I would appreciate if someone with knowledge in developing C can jump in to speed up the process. There are some nice parts to split up. If there is interest I'll create a Github repo.
This is definitely one idea we've discussed in the office as well. You're quite right that it is a hard problem because of all the pointers linking various structures. It absolutely could speed up start time for clamd and clamscan if done right.
Is it possible that cli_matcher->soff is not used currently? I don't find code for it...
It appears that neither of these are used: uint32_t *soff, soff_len;
Created attachment 7594 [details] fast AC sig load Hi folks, over the past three months we've observed the load time increase by 300% on our typical scanning servers. I had just started toying around when i've stumbled upon the ML thread and hence this ticket. The root of the issue is caused by quadratic time in the insert_list function which is part of the AC sigs loader. The purpose of the function is to place patterns into two distinct lists departing from each AC node. This is achieved by comparing each nerwly loaded pattern against the previously inserted patterns until the proper spot is found. This has O(N^2) complexity. With the recent surge in released LDB signatures, in particular those with common subparts, the LDB load time has become a serious problem. The attached patch addresses the issue in the following steps: 1. Loaded list items are allocated but left unattached; only a node reference is set on them for further processing. This is done with no increase of memory usage. See changes in insert_list and matcher-ac.h 2. Before the tries are built, the whole list of entries is sorted by node, then by pattern, then by partno. This requires O(N log(N)) time. 3. The list is processed linearly, one node at a time and the `next_same` chain is built. Each next_same chain head is also extracted. This requires O(N) time. 4. The list of heads is sorted by partno. This requires O(M log(M)) time on average with M<=N. 5. The list of heads is processed linearly and the `next` chain is built. This has O(M) complexity. Important notes: 0. I haven't yet fully tested the patch. I intend to do it over the course of the following weeks but i thought i'd post it here anyway in case someone else was looking at it or is willing to test it further. I don't expect any change in the number of detections, however certain files might be detected under a different name (based on the sig load order) in case two or more sigs hit on it. 1. Patch is against 101, i can port it to 102 if there's interest. 2. While working on the patch i think i've encoutered a bug where `->len` is compared: https://github.com/Cisco-Talos/clamav-devel/blob/dev/0.101/libclamav/matcher-ac.c#L152 Based on the fact that len is an array and based on the memcmp 5 lines below i believe the intended check was on `->len[0]`. That's what's used in my patch.
aCaB, This is awesome! Thanks so much for working on it. Mickey and I will both investigate and test this right away. I have been working on ways to refactor the signature database for a few weeks off and on. I had written up an email for the mailing list to describe my findings, but was holding off on sending it until I had a more concrete solution. I found 2 causes for slow signature load & scan time. The issue you're addressing in the patch, and another due to ignored signatures. The routines to ignore signatures (the .ign and .ign2 database files) are also extraordinarily inefficient. As the ignore list has grown, so has load time. Currently we ignore 1343 signatures. Pruning the ignored signatures from main.cvd, and trimming the .ign/ign2 ignore files will have an immediate load time benefit for most users. This will be updated very soon. I'd also like to optimize the code around ignoring signatures, but I also see no reason why we can't use cdiff's to update main.cvd going forward, so optimizing the ignore feature may be a waste of time when we can just drop signatures from main.cvd instead. To add more detail to your description of the load time issue for signatures with common subparts (what I refer to as the "common pattern prefix issue"... We found that large number of signature patterns in .ldb and .ndb database files share the same first 3 or more bytes though they have different content later. Specifically, about half of daily.ldb is affected by this issue. This “common pattern prefix” isn’t very observable when it’s only a few signatures. In higher quantities the effect is pronounced and is the primary cause of higher load times this past year. I had hoped that I could "optimize" the signatures, by separating the common pattern prefixes into unique patterns with a wildcard, the {#} and * special characters. I did some extensive testing on about 10% of the affected signatures using the malware samples those signatures were designed to detect. In so doing, I realized that my "optimization" in fact defers the problem to the other list (the ->same_next pointer), which results in terrible scan times for files containing a large number of such patterns. Load time would go down a tonne, but scan time of files like Windows DLL's containing a large number of functions goes way up. I have also come to the conclusion that there is no way to "optimize" the signature database without losing efficacy, and without a patch to libclamav. This morning I had an idea to pre-allocate branches of the trie used to store known common patterns, which would be a band-aid on the issue we would likely have to adjust from time to time. I was just about to start coding that solution when you posted the above patch. Your patch sounds like it is a much better option. Regards, Micah
Created attachment 7595 [details] flame graph showing long function calls I made the attached flame graph to show function run time during load. The flamegraph file should render in Chrome/chromium. Note that about half is attributed to insert_list (the issue addressed by aCaB's patch), and about half due due to cli_bm_scanbuff, called when checking if signatures should be ignored. I did some initial testing on my laptop to compare load & scan time for: A) 0.101.4 B) 0.101.4 after removing daily.ign, daily.ign2 C) 0.101.4 after the patch D) 0.101.4 after the patch + after removing daily.ign, daily.ign2 The scan target for my test was libclamav.dll (from 0.102.0, x64), chosen because it has many functions (many occurrences of the most common pattern prefix in daily.ldb). The results are: A) 0.101.4: load time: 67.14786195755005 scan time: 2.4904208183288574 B) 0.101.4 after removing daily.ign, daily.ign2 load time: 42.72711801528931 scan time: 2.5286037921905518) C) 0.101.4 after the patch load time: 39.68687582015991 scan time: 2.8117897510528564 D) 0.101.4 after the patch + after removing daily.ign, daily.ign2 load time: 15.633795738220215 scan time: 2.5904250144958496 In total, load time with both changes goes from 67.1 seconds down to 15.6 seconds. This is roughly identical from what I was seeing when I tried separating the common pattern prefixes with wildcards, which is excellent news. Scan time is not affected (for this file), which makes sense and is also excellent news. Basic testing of an LDB database load under address sanitization (ASAN) went well. `make check VG=1`, and Valgrind testing of freshclam and a basic clamscan also went well. I'm currently setting up a larger regression test (for efficacy) that will take a day or two to run, and will update the ticket when we have results.
(In reply to Micah Snyder from comment #5) > I found 2 causes for slow signature load & scan time. The issue you're > addressing in the patch, and another due to ignored signatures. The routines > to ignore signatures (the .ign and .ign2 database files) are also > extraordinarily inefficient. Hi Micah, i didn't notice the ign issue as we repack all the sigs into a single cld with the ignored sigs already stripped off. From a quick look chkign is also O(N^2) and despite being efficient in that it uses BM for the lookup, it'll always suffer from the exponential complexity. I'd suggest using a cli_hashtable (IIRC cli_hashset only allows for integer keys) for storing the ignored sigs which should have O(1) access time. > We found that large number of signature patterns in .ldb and .ndb database > files share the same first 3 or more bytes though they have different > content later. Specifically, about half of daily.ldb is affected by this > issue. This “common pattern prefix” isn’t very observable when it’s only a > few signatures. In higher quantities the effect is pronounced and is the > primary cause of higher load times this past year. Looking at the scan time performance (i.e. disregarding the load time issue which should be addressed by the patch), as a general rule the AC tries do a fairly good job at arranging entries. That said, if you are looking at nasty patterns, you should probably start from the many cases of AC patterns starting with "0000". AFAIR these tend to cause bad scantime slowdowns. Last time i've checked i think i've counted about 500 of these. > The results are: > > A) 0.101.4: > load time: 67.14786195755005 > scan time: 2.4904208183288574 [...] > C) 0.101.4 after the patch > load time: 39.68687582015991 > scan time: 2.8117897510528564 Weird, I get a much better improvement from the patch... FWIW we don't load pua, ign, safebrowsing, phishing, nor re-based sigs and at ~6.5M sigs the load time is reduced to about 15% (39 vs 6 seconds). > Basic testing of an LDB database load under address sanitization (ASAN) went > well. `make check VG=1`, and Valgrind testing of freshclam and a basic > clamscan also went well. > > I'm currently setting up a larger regression test (for efficacy) that will > take a day or two to run, and will update the ticket when we have results. This is great news. Can't wait to see the results. BTW any comment on that len[0] thing?
(In reply to aCaB from comment #7) > i didn't notice the ign issue as we repack all the sigs into a single cld > with the ignored sigs already stripped off. > From a quick look chkign is also O(N^2) and despite being efficient in that > it uses BM for the lookup, it'll always suffer from the exponential > complexity. > I'd suggest using a cli_hashtable (IIRC cli_hashset only allows for integer > keys) for storing the ignored sigs which should have O(1) access time. cli_hashtable might be a good option, but it's hard to dedicate time to try it out if we can stop ignoring sigs and start droppings sigs. Of course, I also don't know how many community members curate large ignore lists. Hopefully that's not a thing. > Looking at the scan time performance (i.e. disregarding the load time issue > which should be addressed by the patch), as a general rule the AC tries do a > fairly good job at arranging entries. > That said, if you are looking at nasty patterns, you should probably start > from the many cases of AC patterns starting with "0000". AFAIR these tend to > cause bad scantime slowdowns. Last time i've checked i think i've counted > about 500 of these. "0000" anywhere in a signature has been a culprit in the past. At present, the two main offenders are: Pattern prefix: 558bec - bytes for a Windows PE file function prolog, exceeding 25,000 patterns. and: Pattern prefix: 56423521 - bytes relating to identifying VBA scripts (afaik), exceeding 10,000 patterns. > > The results are: > > > > A) 0.101.4: > > load time: 67.14786195755005 > > scan time: 2.4904208183288574 > [...] > > C) 0.101.4 after the patch > > load time: 39.68687582015991 > > scan time: 2.8117897510528564 > > Weird, I get a much better improvement from the patch... > FWIW we don't load pua, ign, safebrowsing, phishing, nor re-based sigs and > at ~6.5M sigs the load time is reduced to about 15% (39 vs 6 seconds). I would expect that if you pre-drop the ignored signatures, your time (on my laptop), would go from 42.7 seconds to 15.3 seconds, reduced to about 36% - not 15%. I'm definitely not seeing quite that much improvement - though still pretty fantastic. > > Basic testing of an LDB database load under address sanitization (ASAN) went > > well. `make check VG=1`, and Valgrind testing of freshclam and a basic > > clamscan also went well. > > > > I'm currently setting up a larger regression test (for efficacy) that will > > take a day or two to run, and will update the ticket when we have results. > > This is great news. Can't wait to see the results. I ran a regression test against 5 million samples from our malware zoo. I'm seeing a lot of (over 5000) detection "changes", "false negatives", and "errors". I've downloaded a few of the affected samples to try to reproduce the issues. It seems to be less actual errors errors and more timeouts cause by extremely slow scan times for certain files. I suspect the slow scan times are related to the common-pattern-prefixes above. I previously suspected that the slow load time would also result in slow scan time because of really long linked lists attached to the A-C trie. It seems like that is the case with some of these files, and that after the patch the scan time has about doubled for some. Take for example this (signed, non-malicious) "outlook.exe" file. Scanning this file before and after the patch results in a slower scan, but if you remove every instance of `;558bec` with `; in daily.ldb, the scan time becomes a fraction of what it was: https://www.virustotal.com/gui/file/3f569882de1db39db8be06077737ba6828af44eac22e4aa1bccfcaaa4cf08e7d/details Before patch, using just the subset of signatures containing `;558bec`: Time: 87.151 sec (1 m 27 s) After patch, using just the subset of signatures containing `;558bec`: Time: 553.293 sec (9 m 13 s) Before patch, using that subset of signatures with `;558bec` removed: Time: 5.415 sec (0 m 5 s) After patch, using that subset of signatures with `;558bec` removed: Time: 5.081 sec (0 m 5 s) As you can see, there's definitely something different about the structure of the A-C trie or lists causing fairly significantly slower scan times for some files. aCaB, do you have any idea what might be going on? I can use Google Drive to share samples with you for testing. > BTW any comment on that len[0] thing? I agree it looks like it was a bug that's been around for at least 12 years. Changing it to [0] seems like the right thing. Comparing len would just check if the two pointers were the same, not the actual length. Micah
Hi Micah, the original patch wasn't much tested and in fact it contains a subtle bug which results in extra "next" branches being created in place of "next_same" entries, which results in extra scan time. i did notice the slowdown too and wrote a tree dumper and comparison script. i'll do some perf test tonight and provide you with a final patch tomorrow - as well as the trie dumping patch and the small python thing to compare them. sorry for the trouble and thanks again for the tests.
(In reply to aCaB from comment #9) > Hi Micah, > > the original patch wasn't much tested and in fact it contains a subtle bug > which results in extra "next" branches being created in place of "next_same" > entries, which results in extra scan time. > > i did notice the slowdown too and wrote a tree dumper and comparison script. > > i'll do some perf test tonight and provide you with a final patch tomorrow - > as well as the trie dumping patch and the small python thing to compare them. > > sorry for the trouble and thanks again for the tests. No problem! That's really cool that you've found the source of the issue so fast. Your work is much appreciated. Once I have the update, I'll re-run testing over here.
Created attachment 7596 [details] proper AC load patch Hi Micah, sorry it took longer than expected but after fixing the bug in the previous patch i was still left with a huge performance loss. Figuring that out required a long investigation and writing code to dump and compare the tries. The frustrating part was that with the bug fixed the tries looked functionally equivalent (i.e. they both satisfied the matcher requirements in terms of anchoring and order), yet the one without the patch was a lot more performant. In the end the culprit was the memory, in particular the effect of caching and the result of faults. The solution was to simply avoid scattering close patterns apart. By doing that i can achieve better-than-before scan times (in a more scientific/deterministic way too) while maintaining linear load times. Please consider this patch semi-final. I'll be testing it further but i'm relatively confident now. Also please let me know of any issue you may encounter.
Created attachment 7597 [details] tree dumping Hi this is the patch and python script i've used to check the tries. After compiling with the patch applied run clamscan --no-summary /nosuchfile > trie.dump
(In reply to aCaB from comment #11) > Created attachment 7596 [details] > proper AC load patch > > Hi Micah, > sorry it took longer than expected but after fixing the bug in the previous > patch i was still left with a huge performance loss. > > Figuring that out required a long investigation and writing code to dump and > compare the tries. > The frustrating part was that with the bug fixed the tries looked > functionally equivalent (i.e. they both satisfied the matcher requirements > in terms of anchoring and order), yet the one without the patch was a lot > more performant. > In the end the culprit was the memory, in particular the effect of caching > and the result of faults. The solution was to simply avoid scattering close > patterns apart. > By doing that i can achieve better-than-before scan times (in a more > scientific/deterministic way too) while maintaining linear load times. > Please consider this patch semi-final. I'll be testing it further but i'm > relatively confident now. > Also please let me know of any issue you may encounter. That sounds outstanding. Thanks for the in depth investigation for the fix. I'll begin testing your updated patch immediately. Micah
aCaB, Sorry about the long delay getting back to you about test results. There was a significant reduction in scan timeout issues after the 2nd patch but for reasons unknown to me, engine B (post patch) still had a handful of samples where it exceeded the scan timeout in our regression test system while engine A (before the patch) did not. I downloaded and tested a handful of these on mac, Ubuntu, and FreeBSD (said regression tests run on FreeBSD) to see what load and scan time differences were like before and after the patch. I found no significant or consistent difference in scan time between pre and post patch scan on any of the samples on any of the above operating systems. I also ran the valgrind and address sanitizer tests as before with no issue. Load time as before is reduced about the same amount (which is awesome). I'm looking forward to merging this patch. I have already ported this to branches for 0.102.1 and 0.103 (in addition to 0.101.5) and placed internal pull requests to be reviewed and merged by a team member.
Hi Micah, hi aCaB, I'd like to rename the bug title to "Speedup Aho-Corasick for DB parsing" or some matching title. Ok? I think, it is great that you are trying to fix it that way. My effort on pure-memory-caching got stuck a few weeks ago. Found that I need to write memory of compiled regex (re-lib) stuff and that put in some difficulty. I've also seen during testing that my PoC doesn't speed as fast as I've assumed. So not sure if this would be a big step at the end. And overall I've missed time by need to do something different. Micah, can you please upload a complete patch for the 0.102-branch? I would like to give it a try. Thanks for your work. Markus
Created attachment 7610 [details] 0.102 aCaB's fast loading patch Sure makes sense to me. Here's the 0.102 variant of aCaB's patch.
We just merged this patch into dev/0.103, here: https://github.com/Cisco-Talos/clamav-devel/commit/3dd506a7eecf07d5642072bbdb088fb06b8097ee It has also been merged into the private branches where we're prepping the 0.102.1 and 0.101.5 security patch releases to be published in the next week or two.