What is data flow analysis in SAST?

Data flow analysis is a static analysis technique that tracks how data values move through a program from their entry points to their use sites. In a security context, it identifies untrusted inputs (sources), dangerous operations (sinks), and the propagator operations (assignments, method calls, collection operations) that move data between them. When a path exists from a source to a sink without passing through a sanitizer, the analysis reports a potential vulnerability.

How does data flow analysis differ from pattern matching?

Pattern matching operates on syntactic structure: it looks for known dangerous shapes in the code, like a string concatenation directly inside a SQL execution call. It catches the trivial cases but breaks down as soon as the dangerous data crosses a method boundary, passes through a builder, or flows through a framework abstraction. Data flow analysis tracks the value semantically across the whole program, so the same vulnerability remains detectable even when no individual line of code looks suspicious.

What is taint propagation?

Taint propagation is the model used by data flow analysis to describe how untrusted input spreads through a program. When a value enters from a source, it is marked as tainted. Operations that derive new values from a tainted value (string concatenation, method return, field assignment, type conversion) propagate the taint to the result. The taint is cleared when data passes through a recognized sanitizer. A reported vulnerability is a tainted value reaching a sink without ever being cleared.

Can data flow analysis find SQL injection?

Yes. SQL injection is the canonical example: HTTP request parameters are sources, calls like Statement.executeQuery and similar database execution methods are sinks, and string concatenation, format strings, and StringBuilder operations are propagators. PreparedStatement parameter binding and recognized escaping libraries act as sanitizers. The analysis follows tainted user input through whatever helper methods, builders, and framework abstractions the code uses, and reports any path that reaches the database without parameterization or escaping.

Why is interprocedural analysis better than intraprocedural?

Intraprocedural analysis examines each function in isolation, so it cannot follow data through method calls. Real applications constantly cross function boundaries: arguments become parameters, return values flow back, fields are read and written across methods of the same class, callbacks and lambdas dispatch dynamically. Interprocedural analysis builds a call graph and tracks data flow across those boundaries, which is how it finds vulnerabilities like Spring4Shell where the dangerous flow crosses many abstraction layers between the source and the sink.

Why Data Flow Analysis Is the Gold Standard for Vulnerability Detection

Most static analysis tools start with pattern matching. They search the AST for known dangerous patterns: a call to Runtime.exec() with a string concatenation argument, or an SQL query built with + instead of parameterized statements. This approach works for the trivial cases. The problem is that trivial cases represent a fraction of real-world vulnerabilities. The bugs that survive code review, that persist through multiple release cycles, and that end up in CVE databases are the ones where the danger is not locally visible.

The Limits of Pattern Matching

Pattern-based detection operates on syntactic structure. It sees tokens and tree shapes. Consider a straightforward SQL injection pattern:

String query = "SELECT * FROM users WHERE id = " + request.getParameter("id");
Statement stmt = connection.createStatement();
stmt.executeQuery(query);

A regex or AST pattern can flag this easily. But now refactor the code slightly:

public class UserRepository {
    private final QueryBuilder builder;

    public User findById(String identifier) {
        return builder.execute(
            builder.select("users").where("id", identifier)
        );
    }
}

// In the controller, three files away:
String userId = request.getParameter("id");
User user = userRepository.findById(userId);

The vulnerability is identical -- unsanitized user input reaches a database query. But now the tainted data crosses a method boundary, passes through a constructor-injected dependency, and enters a builder pattern. No single line of code looks dangerous. A pattern matcher sees builder.select() and has no idea whether identifier came from user input or a hardcoded constant.

What Data Flow Analysis Actually Does

Data flow analysis -- specifically taint propagation analysis -- solves this by tracking the lifecycle of data values through the entire program. The process involves three core concepts:

Sources: Entry points where untrusted data enters the application. HTTP request parameters, file reads, database results from external systems, deserialized objects, environment variables.
Sinks: Operations where tainted data becomes dangerous. SQL execution, OS command execution, file path construction, HTML rendering, LDAP queries, XML parsing.
Propagators: Operations that transfer taint from one variable to another. String concatenation, collection operations, method return values, field assignments, type conversions.

The analysis engine builds a graph of all possible data flows from every source to every sink. When a path exists from a source to a sink without passing through a sanitizer, that path represents a potential vulnerability. This is the fundamental approach behind CWE-89 (SQL Injection), CWE-78 (OS Command Injection), CWE-79 (Cross-Site Scripting), and the entire family of injection vulnerabilities cataloged by MITRE.

Interprocedural Analysis: Following Data Across Boundaries

The critical capability that separates serious static analysis from grep-with-extra-steps is interprocedural analysis. Real applications do not contain vulnerabilities within a single function. Data flows through:

Method calls: Arguments become parameters in the callee. Return values flow back to the caller.
Object fields: A tainted value stored in this.name in one method is still tainted when read in another method of the same class.
Inheritance hierarchies: A method override in a subclass may introduce a sink that the base class never anticipated.
Callback patterns: Data passed to a lambda or functional interface flows into whatever implementation is bound at runtime.
Framework conventions: Spring MVC maps @RequestParam annotations to HTTP parameters. The analysis must understand that a method parameter annotated this way is a source.

Building an accurate call graph is one of the hardest problems in static analysis. For languages with dynamic dispatch (virtual methods in Java, duck typing in Python, prototype chains in JavaScript), the analysis must approximate which concrete method will be invoked at each call site. Techniques like Class Hierarchy Analysis (CHA), Rapid Type Analysis (RTA), and points-to analysis each offer different precision/performance tradeoffs.

Sanitizers and the Completeness Problem

Detecting that tainted data reaches a sink is only half the problem. Equally important is recognizing when data has been properly sanitized. Consider:

String userInput = request.getParameter("search");
String sanitized = ESAPI.encoder().encodeForSQL(
    new OracleCodec(), userInput
);
String query = "SELECT * FROM products WHERE name LIKE '%" + sanitized + "%'";

This code is safe. The ESAPI encoder neutralizes SQL metacharacters before the value enters the query. A data flow engine that does not model sanitizers will report this as a vulnerability -- a false positive. The analysis must maintain a catalog of known sanitization functions and understand their semantics: encodeForSQL neutralizes SQL injection but not XSS. encodeForHTML neutralizes XSS but not SQL injection. Context matters.

Custom sanitizers present an additional challenge. Many organizations implement their own validation libraries. The analysis engine needs a mechanism -- whether through configuration, annotations, or heuristic detection -- to recognize that a function named validateAndEscapeInput() is acting as a sanitizer for specific vulnerability categories.

Why This Matters: Real-World Impact

The OWASP Top 10 has listed injection as either the number one or top three risk category for over a decade. The reason injection persists is not that developers are unaware of it. It persists because the vulnerable patterns are often non-obvious. Consider CVE-2022-22965 (Spring4Shell): the vulnerability existed in the data binding mechanism of Spring Framework, where user-supplied HTTP parameters could manipulate class-level properties through nested property paths. The tainted data flowed through the framework's own reflection-based property accessor, across multiple abstraction layers, before reaching a dangerous sink.

No pattern matcher would have caught this. The vulnerability required understanding how Spring's BeanWrapper resolves nested property names, how ClassLoader access through class.module.classLoader traversal could lead to arbitrary file writes via Tomcat's logging configuration. This is exactly the kind of multi-hop, cross-boundary data flow that taint analysis is designed to find.

Performance Considerations

Full interprocedural data flow analysis is computationally expensive. For a codebase with millions of lines, a naive implementation would need to explore an exponential number of paths. Production-grade engines use several techniques to make this tractable:

Demand-driven analysis: Instead of computing all data flows upfront, start from known sinks and trace backward to find reachable sources.
Summary-based analysis: Pre-compute summaries of how data flows through each method (inputs to outputs, inputs to fields, etc.) and compose these summaries during interprocedural analysis.
Incremental analysis: When a file changes, reanalyze only the affected methods and the transitive closure of their callers and callees, rather than the entire codebase.
Bounded context sensitivity: Limit how many levels of calling context the analysis tracks. A 2-CFA (2-level Call-site Sensitive Analysis) provides good precision without the cost of fully context-sensitive analysis.

The Bottom Line

Pattern matching is a starting point, not a destination. For any organization serious about finding vulnerabilities before they reach production, data flow analysis is not optional -- it is the minimum standard. The technique directly maps to how injection vulnerabilities actually work: untrusted data flowing through a program to reach a sensitive operation. When your analysis models that flow faithfully, including across method boundaries, through framework abstractions, and past sanitization points, you get results that correspond to real exploitable conditions rather than syntactic coincidences.

The investment in understanding and deploying data flow analysis pays for itself the first time it catches a vulnerability that a pattern matcher would have missed -- which, in any non-trivial codebase, happens on the very first scan.