Swift: Improve SensitiveExprs.qll Heuristics#13354
Conversation
|
DCA shows a bit of a slowdown but it doesn't appear to be focussed on the |
|
The new run appears to be a bit clearer - showing a genuine (?) slowdown of the |
|
The performance issue is a bit of a mystery as the only thing I've changed is some regular expressions, and the change doesn't look structurally very significant. We will find more matches, and that is expected to slow things down a little. The other thing that stood out was the expression term I will merge main and do another DCA run... |
|
Still a 19% analysis slowdown and still the |
|
Further investigation:
I'm running a DCA experiment with some |
|
On the latest DCA run the slowdown has reduced to 15.3%, and the I have just added a regex term for an alternative spelling of "licence" ("license"). so I'll kick off another DCA run both to check that and to build confidence in where we are right now. I'm ready to hear someone else's opinion on the slowdown at this point. |
|
Latest DCA run shows 16.7% slowdown, no particularly obvious blame. |
|
I think the slowdown is coming from the cartesian product in the characteristic predicate for [2023-07-11 17:24:19] Evaluated non-recursive predicate SensitiveExprs#7562cd45::SensitiveVarDecl#ff@76ec8b1e in 30877ms (size: 7106).
Evaluated relational algebra for predicate SensitiveExprs#7562cd45::SensitiveVarDecl#ff@76ec8b1e with tuple counts:
8030520 ~3% {4} r1 = JOIN VarDecl#914e0d1e::Generated::VarDecl::getName#0#dispred#ff WITH SensitiveExprs#7562cd45::SensitiveDataType::getRegexp#0#dispred#ff CARTESIAN PRODUCT OUTPUT Lhs.0, Lhs.1, Rhs.0, Rhs.1
7106 ~0% {4} r2 = JOIN r1 WITH PRIMITIVE regexpMatch#bb ON Lhs.1,Lhs.3
7106 ~0% {2} r3 = SCAN r2 OUTPUT In.0, In.2
return r3and on the HEAD of this branch it takes ~50 seconds: [2023-07-11 15:15:39] Evaluated non-recursive predicate SensitiveExprs#7562cd45::SensitiveVarDecl#ff@3934bbu8 in 52220ms (size: 7806).
Evaluated relational algebra for predicate SensitiveExprs#7562cd45::SensitiveVarDecl#ff@3934bbu8 with tuple counts:
1546962 ~17% {2} r1 = SCAN var_decls OUTPUT In.0, In.1
1338420 ~1% {2} r2 = STREAM DEDUP r1
1338420 ~4% {2} r3 = JOIN r2 WITH Synth#5f134a93::Synth::convertVarDeclToRaw#1#ff_10#join_rhs ON FIRST 1 OUTPUT Lhs.1, Rhs.1
8030520 ~0% {4} r4 = JOIN r3 WITH SensitiveExprs#7562cd45::SensitiveDataType::getRegexp#0#dispred#ff CARTESIAN PRODUCT OUTPUT Lhs.0, Lhs.1, Rhs.0, Rhs.1
7814 ~2% {4} r5 = JOIN r4 WITH PRIMITIVE regexpMatch#bb ON Lhs.0,Lhs.3
7814 ~0% {2} r6 = SCAN r5 OUTPUT In.1, In.2
return r6Since this problem is already present on |
|
Hmm. In principle this (what we do on looks like what we want - to test each variable name with each of the (currently exactly 5) sensitive data regexps. I do have plans to combine some of them, but since they're shared with other languages it will have to be done carefully. This PR doesn't actually change that count, only the complexity of two of those regexps, so I'm quite surprised evaluation has changed. I will look deeper... |
|
I've just pushed a fix that substantially improves performance, by forcing the sensitive expression checks to occur in an organized fashion. Merge main and new DCA run to follow... |
|
First DCA run after the fix - analysis was 36% faster (than baseline that is; it's an even bigger improvement from the earlier version of this PR). Second DCA run after the fix (with 2 x Ready to merge, I think. |
Improvements to the
SensitiveExprs.qllprivate information heuristics. I've been collecting a list of ideas for this for a while, though I had to filter them down considerably to just ones that will be reliable (short strings and general terms tend to have too many false matches, e.g. "age" would match "widget_age" and "image", neither of which are likely to be private information). This has been tested on MRVA fairly extensively alongside #13190 and (less extensively) on it's own. And I'll do a DCA run...