Python: Clean up LocalSourceNode charpred#6204
Conversation
This gets rid of a large number of nodes that seemingly have no impact.
This results in the same set of nodes, but is a bit more clear about the reasons why. For instance, `ModuleVariableNode`s are included directly, and not in a roundabout way by virtue of not having flow to them. This should hopefully be a bit more robust as well.
|
Unsurprisingly, this seems to be a bigger improvement on bigger databases. Detailed results here |
RasmusWL
left a comment
There was a problem hiding this comment.
Nice 👍
There is a little thing around ModuleVariableNode that I would like to understand better though.
| // Module variable nodes must be local source nodes, otherwise type trackers cannot step through | ||
| // them. | ||
| this instanceof ModuleVariableNode | ||
| or | ||
| // We explicitly include any read of a global variable, as some of these may have local flow going | ||
| // into them. | ||
| this = any(ModuleVariableNode mvn).getARead() |
There was a problem hiding this comment.
Thanks for adding these comments, makes it much easier 👍
Are there any instances where having ModuleVariableNode provides any value instead of just having any(ModuleVariableNode mvn).getARead()? That is, could we write this part as:
// Module variable nodes must be local source nodes, otherwise type trackers cannot
// step through them. We explicitly include any read of a global variable, as some
// of these may have local flow going into them.
this = any(ModuleVariableNode mvn).getARead()I tried running the following query on a few databases I had lying around (salt, nova, peewee), but it did not give any results.
import python
import semmle.python.dataflow.new.internal.LocalSources
import semmle.python.dataflow.new.internal.DataFlowPublic
private import semmle.python.dataflow.new.internal.DataFlowPrivate
from ModuleVariableNode mvn, Node flowsToNode
where
not mvn = flowsToNode and
mvn.(LocalSourceNode).flowsTo(flowsToNode) and
not mvn.getARead().(LocalSourceNode).flowsTo(flowsToNode)
select mvn, flowsToNodeI guess I could just try to run our tests with that change, so I've just started doing that 😛
There was a problem hiding this comment.
I just tried out running the tests, and some of them failed. For example
and the query above doesn't give any results either, so clearly that query is not good enough 😓
There was a problem hiding this comment.
The trouble arises when you want to type track from some_global = tracked to sink(some_global). In practice this happens in two steps: first we go from tracked to ModuleVariableNode for some_global and then from this node to the read.
However, since type trackers (in their usual formulation) step from LocalSourceNode to LocalSourceNode, we'll lose the first step unless the ModuleVariableNode is also a LocalSourceNode.
On the other hand, we should eventually be able to get rid of the getARead bit. It's only there right now because we have local flow into global variables, which I think we should get rid of (but we do want it in the global scope, so this is slightly tricky).
For instance we currently have local flow from source to sink in the following bit of code:
def make_safe():
global g
g = NON_SOURCE
def foo():
global g
g = SOURCE
make_safe()
sink(g)Now, we'll actually have this regardless (unless we put some effort into tracking what globals are modified by what functions, which we might), since there will be the aforementioned step through the module variable node. But we shouldn't have local flow here.
I think based on this discussion that I should maybe expand the comment for ModuleVariableNode a bit. 😅
There was a problem hiding this comment.
#6218 (which is based on this branch) clarifies this situation a bit by creating a specific TypeTrackingNode class that encompasses both LocalSourceNode and ModuleVariableNode. That way, we don't have the somewhat confusing inclusion of ModuleVariableNode as a LocalSourceNode.
There was a problem hiding this comment.
Should this PR be closed then, and we focus on the other one instead?
There was a problem hiding this comment.
I think it makes more sense to merge this one as is (assuming you're happy with it), as the changes it makes have been tested (performance and otherwise), whereas the performance tests for the other PR are still ongoing. Also, the other changes will affect the shared type tracking library, so are perhaps a bit more controversial.
This should be a straight-up improvement where performance is concerned, though in practice many of the removed nodes did not have any flow to begin with.
I'll be running a detailed evaluation to see if it actually improves performance noticeably. (Overall it seems promising: https://github.com/dsp-testing/tausbn-dca/issues/42)