Skip to content

Spark 4.1: bloom filter result mismatch (might_contain returns wrong answers) #4193

@andygrove

Description

@andygrove

Sub-issue of #4098.

Description

Two tests fail in `Spark 4.1, JDK 17/auto [exec]`:

  • `test BloomFilterMightContain from random input`
  • `bloom_filter_agg`

Comet and Spark produce different `might_contain` results for the same input.

Suspected root cause

Spark 4.1 likely changed the bloom filter binary layout, hash seed, or default false-positive probability. Diff `BloomFilterImpl` / `BloomFilterAggregate` between 4.0 and 4.1, then mirror in Comet's bloom filter code in `native/spark-expr`.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions