Skip to content

Swift: Improve SensitiveExprs.qll Heuristics#13354

Merged
geoffw0 merged 13 commits intogithub:mainfrom
geoffw0:sharedsensitive2
Jul 17, 2023
Merged

Swift: Improve SensitiveExprs.qll Heuristics#13354
geoffw0 merged 13 commits intogithub:mainfrom
geoffw0:sharedsensitive2

Conversation

@geoffw0
Copy link
Copy Markdown
Contributor

@geoffw0 geoffw0 commented Jun 2, 2023

Improvements to the SensitiveExprs.qll private information heuristics. I've been collecting a list of ideas for this for a while, though I had to filter them down considerably to just ones that will be reliable (short strings and general terms tend to have too many false matches, e.g. "age" would match "widget_age" and "image", neither of which are likely to be private information). This has been tested on MRVA fairly extensively alongside #13190 and (less extensively) on it's own. And I'll do a DCA run...

@geoffw0 geoffw0 added the Swift label Jun 2, 2023
@geoffw0 geoffw0 requested a review from a team as a code owner June 2, 2023 09:53
@geoffw0
Copy link
Copy Markdown
Contributor Author

geoffw0 commented Jun 5, 2023

DCA shows a bit of a slowdown but it doesn't appear to be focussed on the SensitiveExprs.qll library or queries that use it. I think it's just wobble, but I'll run it again to be safe.

@geoffw0
Copy link
Copy Markdown
Contributor Author

geoffw0 commented Jun 6, 2023

The new run appears to be a bit clearer - showing a genuine (?) slowdown of the SensitiveExprs.qll using queries on DanielZSY__RxCommonKit, signalapp__Signal-iOS and Thomvis__Construct (see the "Stage timings" table), fitting with an overall 27% increase in analysis time. Even if we're finding more sensitive expressions on those projects this is a bigger slowdown than I'd expect. I need to investigate...

@geoffw0
Copy link
Copy Markdown
Contributor Author

geoffw0 commented Jun 19, 2023

The performance issue is a bit of a mystery as the only thing I've changed is some regular expressions, and the change doesn't look structurally very significant. We will find more matches, and that is expected to slow things down a little. The other thing that stood out was the expression term e.?mail, which might cause excessive work every time an e is found with any other character after it. I've updated this to email|e_mail (I believe _ is the only punctuation mark that's allowed in Swift variable names).

I will merge main and do another DCA run...

@geoffw0
Copy link
Copy Markdown
Contributor Author

geoffw0 commented Jun 21, 2023

Still a 19% analysis slowdown and still the SensitiveExprs.qll queries (generally named swift/cleartext-*) are showing up on stage timings. And I can reproduce the slowdown locally, even in projects with no new sensitive expressions...

@geoffw0
Copy link
Copy Markdown
Contributor Author

geoffw0 commented Jun 21, 2023

Further investigation:

  • the slowness is caused by the sheer number of terms in the regexp alternation * the large number of potential sensitive entities in the database. There is no especially expensive term. I am surprised regex alternation scales poorly TBH.
  • the high % slowdown is in part because the rest of Swift analysis is still really quite fast at the moment.

I'm running a DCA experiment with some cached annotations added, which has improved matters locally.

@geoffw0
Copy link
Copy Markdown
Contributor Author

geoffw0 commented Jun 22, 2023

On the latest DCA run the slowdown has reduced to 15.3%, and the swift/cleartext-* queries are now associated with a speedup, rather than a slowdown, in stage timing. There are some new stage timing rows for a few parts of the SensitiveExprs library (that are now cached), but they don't appear to add up to much. I'm not sure what the remaing 15.3% really is.

I have just added a regex term for an alternative spelling of "licence" ("license"). so I'll kick off another DCA run both to check that and to build confidence in where we are right now.

I'm ready to hear someone else's opinion on the slowdown at this point.

@geoffw0
Copy link
Copy Markdown
Contributor Author

geoffw0 commented Jun 22, 2023

Latest DCA run shows 16.7% slowdown, no particularly obvious blame.

@MathiasVP
Copy link
Copy Markdown
Contributor

MathiasVP commented Jul 11, 2023

I think the slowdown is coming from the cartesian product in the characteristic predicate for SensitiveVarDecl. On the base branch this takes ~30 seconds on my machine:

[2023-07-11 17:24:19] Evaluated non-recursive predicate SensitiveExprs#7562cd45::SensitiveVarDecl#ff@76ec8b1e in 30877ms (size: 7106).
Evaluated relational algebra for predicate SensitiveExprs#7562cd45::SensitiveVarDecl#ff@76ec8b1e with tuple counts:
  8030520  ~3%    {4} r1 = JOIN VarDecl#914e0d1e::Generated::VarDecl::getName#0#dispred#ff WITH SensitiveExprs#7562cd45::SensitiveDataType::getRegexp#0#dispred#ff CARTESIAN PRODUCT OUTPUT Lhs.0, Lhs.1, Rhs.0, Rhs.1
     7106  ~0%    {4} r2 = JOIN r1 WITH PRIMITIVE regexpMatch#bb ON Lhs.1,Lhs.3
     7106  ~0%    {2} r3 = SCAN r2 OUTPUT In.0, In.2
                  return r3

and on the HEAD of this branch it takes ~50 seconds:

[2023-07-11 15:15:39] Evaluated non-recursive predicate SensitiveExprs#7562cd45::SensitiveVarDecl#ff@3934bbu8 in 52220ms (size: 7806).
Evaluated relational algebra for predicate SensitiveExprs#7562cd45::SensitiveVarDecl#ff@3934bbu8 with tuple counts:
  1546962  ~17%    {2} r1 = SCAN var_decls OUTPUT In.0, In.1
  1338420   ~1%    {2} r2 = STREAM DEDUP r1
  1338420   ~4%    {2} r3 = JOIN r2 WITH Synth#5f134a93::Synth::convertVarDeclToRaw#1#ff_10#join_rhs ON FIRST 1 OUTPUT Lhs.1, Rhs.1
  8030520   ~0%    {4} r4 = JOIN r3 WITH SensitiveExprs#7562cd45::SensitiveDataType::getRegexp#0#dispred#ff CARTESIAN PRODUCT OUTPUT Lhs.0, Lhs.1, Rhs.0, Rhs.1
     7814   ~2%    {4} r5 = JOIN r4 WITH PRIMITIVE regexpMatch#bb ON Lhs.0,Lhs.3
     7814   ~0%    {2} r6 = SCAN r5 OUTPUT In.1, In.2
                   return r6

Since this problem is already present on main we probably don't need to let that block this PR. But we should fix it before coming up with more sensitive expression heuristics as this is a ticking bomb.

@geoffw0
Copy link
Copy Markdown
Contributor Author

geoffw0 commented Jul 12, 2023

Hmm. In principle this (what we do on main):

JOIN VarDecl#914e0d1e::Generated::VarDecl::getName#0#dispred#ff WITH SensitiveExprs#7562cd45::SensitiveDataType::getRegexp#0#dispred#ff CARTESIAN PRODUCT OUTPUT Lhs.0, Lhs.1, Rhs.0, Rhs.1

looks like what we want - to test each variable name with each of the (currently exactly 5) sensitive data regexps. I do have plans to combine some of them, but since they're shared with other languages it will have to be done carefully. This PR doesn't actually change that count, only the complexity of two of those regexps, so I'm quite surprised evaluation has changed. I will look deeper...

@geoffw0
Copy link
Copy Markdown
Contributor Author

geoffw0 commented Jul 12, 2023

I've just pushed a fix that substantially improves performance, by forcing the sensitive expression checks to occur in an organized fashion. Merge main and new DCA run to follow...

Comment thread swift/ql/lib/codeql/swift/security/SensitiveExprs.qll Outdated
@geoffw0
Copy link
Copy Markdown
Contributor Author

geoffw0 commented Jul 14, 2023

First DCA run after the fix - analysis was 36% faster (than baseline that is; it's an even bigger improvement from the earlier version of this PR).

Second DCA run after the fix (with 2 x cached removed) - analysis was 34% faster. One build failed, as they sometimes do. Locally I didn't observe any difference in performance with cached removed, so I interpret the difference between 36% and 34% as just wobble.

Ready to merge, I think.

Copy link
Copy Markdown
Contributor

@MathiasVP MathiasVP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@geoffw0 geoffw0 merged commit 69b98c7 into github:main Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants