Swift: Improve SensitiveExprs.qll Heuristics by geoffw0 · Pull Request #13354 · github/codeql

geoffw0 · 2023-06-02T09:52:59Z

Improvements to the SensitiveExprs.qll private information heuristics. I've been collecting a list of ideas for this for a while, though I had to filter them down considerably to just ones that will be reliable (short strings and general terms tend to have too many false matches, e.g. "age" would match "widget_age" and "image", neither of which are likely to be private information). This has been tested on MRVA fairly extensively alongside #13190 and (less extensively) on it's own. And I'll do a DCA run...

geoffw0 · 2023-06-05T12:19:19Z

DCA shows a bit of a slowdown but it doesn't appear to be focussed on the SensitiveExprs.qll library or queries that use it. I think it's just wobble, but I'll run it again to be safe.

geoffw0 · 2023-06-06T17:03:01Z

The new run appears to be a bit clearer - showing a genuine (?) slowdown of the SensitiveExprs.qll using queries on DanielZSY__RxCommonKit, signalapp__Signal-iOS and Thomvis__Construct (see the "Stage timings" table), fitting with an overall 27% increase in analysis time. Even if we're finding more sensitive expressions on those projects this is a bigger slowdown than I'd expect. I need to investigate...

geoffw0 · 2023-06-19T15:23:00Z

The performance issue is a bit of a mystery as the only thing I've changed is some regular expressions, and the change doesn't look structurally very significant. We will find more matches, and that is expected to slow things down a little. The other thing that stood out was the expression term e.?mail, which might cause excessive work every time an e is found with any other character after it. I've updated this to email|e_mail (I believe _ is the only punctuation mark that's allowed in Swift variable names).

I will merge main and do another DCA run...

geoffw0 · 2023-06-21T08:27:37Z

Still a 19% analysis slowdown and still the SensitiveExprs.qll queries (generally named swift/cleartext-*) are showing up on stage timings. And I can reproduce the slowdown locally, even in projects with no new sensitive expressions...

geoffw0 · 2023-06-21T16:59:32Z

Further investigation:

the slowness is caused by the sheer number of terms in the regexp alternation * the large number of potential sensitive entities in the database. There is no especially expensive term. I am surprised regex alternation scales poorly TBH.
the high % slowdown is in part because the rest of Swift analysis is still really quite fast at the moment.

I'm running a DCA experiment with some cached annotations added, which has improved matters locally.

geoffw0 · 2023-06-22T07:57:38Z

On the latest DCA run the slowdown has reduced to 15.3%, and the swift/cleartext-* queries are now associated with a speedup, rather than a slowdown, in stage timing. There are some new stage timing rows for a few parts of the SensitiveExprs library (that are now cached), but they don't appear to add up to much. I'm not sure what the remaing 15.3% really is.

I have just added a regex term for an alternative spelling of "licence" ("license"). so I'll kick off another DCA run both to check that and to build confidence in where we are right now.

I'm ready to hear someone else's opinion on the slowdown at this point.

geoffw0 · 2023-06-22T15:48:58Z

Latest DCA run shows 16.7% slowdown, no particularly obvious blame.

MathiasVP · 2023-07-11T16:32:41Z

I think the slowdown is coming from the cartesian product in the characteristic predicate for SensitiveVarDecl. On the base branch this takes ~30 seconds on my machine:

[2023-07-11 17:24:19] Evaluated non-recursive predicate SensitiveExprs#7562cd45::SensitiveVarDecl#ff@76ec8b1e in 30877ms (size: 7106).
Evaluated relational algebra for predicate SensitiveExprs#7562cd45::SensitiveVarDecl#ff@76ec8b1e with tuple counts:
  8030520  ~3%    {4} r1 = JOIN VarDecl#914e0d1e::Generated::VarDecl::getName#0#dispred#ff WITH SensitiveExprs#7562cd45::SensitiveDataType::getRegexp#0#dispred#ff CARTESIAN PRODUCT OUTPUT Lhs.0, Lhs.1, Rhs.0, Rhs.1
     7106  ~0%    {4} r2 = JOIN r1 WITH PRIMITIVE regexpMatch#bb ON Lhs.1,Lhs.3
     7106  ~0%    {2} r3 = SCAN r2 OUTPUT In.0, In.2
                  return r3

and on the HEAD of this branch it takes ~50 seconds:

[2023-07-11 15:15:39] Evaluated non-recursive predicate SensitiveExprs#7562cd45::SensitiveVarDecl#ff@3934bbu8 in 52220ms (size: 7806).
Evaluated relational algebra for predicate SensitiveExprs#7562cd45::SensitiveVarDecl#ff@3934bbu8 with tuple counts:
  1546962  ~17%    {2} r1 = SCAN var_decls OUTPUT In.0, In.1
  1338420   ~1%    {2} r2 = STREAM DEDUP r1
  1338420   ~4%    {2} r3 = JOIN r2 WITH Synth#5f134a93::Synth::convertVarDeclToRaw#1#ff_10#join_rhs ON FIRST 1 OUTPUT Lhs.1, Rhs.1
  8030520   ~0%    {4} r4 = JOIN r3 WITH SensitiveExprs#7562cd45::SensitiveDataType::getRegexp#0#dispred#ff CARTESIAN PRODUCT OUTPUT Lhs.0, Lhs.1, Rhs.0, Rhs.1
     7814   ~2%    {4} r5 = JOIN r4 WITH PRIMITIVE regexpMatch#bb ON Lhs.0,Lhs.3
     7814   ~0%    {2} r6 = SCAN r5 OUTPUT In.1, In.2
                   return r6

Since this problem is already present on main we probably don't need to let that block this PR. But we should fix it before coming up with more sensitive expression heuristics as this is a ticking bomb.

geoffw0 · 2023-07-12T17:48:39Z

Hmm. In principle this (what we do on main):

JOIN VarDecl#914e0d1e::Generated::VarDecl::getName#0#dispred#ff WITH SensitiveExprs#7562cd45::SensitiveDataType::getRegexp#0#dispred#ff CARTESIAN PRODUCT OUTPUT Lhs.0, Lhs.1, Rhs.0, Rhs.1

looks like what we want - to test each variable name with each of the (currently exactly 5) sensitive data regexps. I do have plans to combine some of them, but since they're shared with other languages it will have to be done carefully. This PR doesn't actually change that count, only the complexity of two of those regexps, so I'm quite surprised evaluation has changed. I will look deeper...

…ation).

geoffw0 · 2023-07-12T19:46:03Z

I've just pushed a fix that substantially improves performance, by forcing the sensitive expression checks to occur in an organized fashion. Merge main and new DCA run to follow...

geoffw0 · 2023-07-14T15:12:27Z

First DCA run after the fix - analysis was 36% faster (than baseline that is; it's an even bigger improvement from the earlier version of this PR).

Second DCA run after the fix (with 2 x cached removed) - analysis was 34% faster. One build failed, as they sometimes do. Locally I didn't observe any difference in performance with cached removed, so I interpret the difference between 36% and 34% as just wobble.

Ready to merge, I think.

MathiasVP

LGTM!

geoffw0 added 3 commits June 2, 2023 10:42

Swift: Add some more test cases.

bc57e46

Swift: Add more heuristic expressions.

00cd347

Swift: Change note.

d7e4c5f

geoffw0 added the Swift label Jun 2, 2023

geoffw0 requested a review from a team as a code owner June 2, 2023 09:53

geoffw0 added 2 commits June 2, 2023 12:02

Swift: Autoformat.

8e8696a

Swift: Put the change note in the right place.

14d1933

github-actions Bot added the documentation label Jun 2, 2023

Swift: Attempt to improve performance.

b50801b

Merge branch 'main' into sharedsensitive2

2ba2315

geoffw0 added 2 commits June 21, 2023 16:17

Swift: Small possible improvements.

bc16fb5

Swift: Add some cached annotations for sensitive exprs.

2b99344

Swift: License/licence.

f52f450

Swift: Compute sensitive strings centrally (much more efficient evalu…

6d4c831

…ation).

Merge branch 'main' into sharedsensitive2

0bd522a

MathiasVP reviewed Jul 13, 2023

View reviewed changes

Comment thread swift/ql/lib/codeql/swift/security/SensitiveExprs.qll Outdated

Swift: Remove 'cached' annotations.

1274393

MathiasVP approved these changes Jul 17, 2023

View reviewed changes

geoffw0 merged commit 69b98c7 into github:main Jul 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swift: Improve SensitiveExprs.qll Heuristics#13354

Swift: Improve SensitiveExprs.qll Heuristics#13354
geoffw0 merged 13 commits intogithub:mainfrom
geoffw0:sharedsensitive2

geoffw0 commented Jun 2, 2023

Uh oh!

geoffw0 commented Jun 5, 2023

Uh oh!

geoffw0 commented Jun 6, 2023

Uh oh!

geoffw0 commented Jun 19, 2023

Uh oh!

geoffw0 commented Jun 21, 2023

Uh oh!

geoffw0 commented Jun 21, 2023

Uh oh!

geoffw0 commented Jun 22, 2023

Uh oh!

geoffw0 commented Jun 22, 2023

Uh oh!

MathiasVP commented Jul 11, 2023 •

edited

Loading

Uh oh!

geoffw0 commented Jul 12, 2023

Uh oh!

geoffw0 commented Jul 12, 2023 •

edited

Loading

Uh oh!

Uh oh!

geoffw0 commented Jul 14, 2023

Uh oh!

MathiasVP left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

geoffw0 commented Jun 2, 2023

Uh oh!

geoffw0 commented Jun 5, 2023

Uh oh!

geoffw0 commented Jun 6, 2023

Uh oh!

geoffw0 commented Jun 19, 2023

Uh oh!

geoffw0 commented Jun 21, 2023

Uh oh!

geoffw0 commented Jun 21, 2023

Uh oh!

geoffw0 commented Jun 22, 2023

Uh oh!

geoffw0 commented Jun 22, 2023

Uh oh!

MathiasVP commented Jul 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geoffw0 commented Jul 12, 2023

Uh oh!

geoffw0 commented Jul 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

geoffw0 commented Jul 14, 2023

Uh oh!

MathiasVP left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MathiasVP commented Jul 11, 2023 •

edited

Loading

geoffw0 commented Jul 12, 2023 •

edited

Loading