Skip to content

Named Polars DataFrames with deeply nested struct columns become extremely slow when registered as datasets #9378

@wirhabenzeit

Description

@wirhabenzeit

Summary

A Polars DataFrame with a deeply nested Struct/List column renders fine when displayed directly in a marimo cell. However, if the same dataframe is assigned to a normal top-level variable like x, marimo becomes progressively slower as nesting depth increases. If the dataframe is assigned to an underscore-prefixed variable like _x, the slowdown does not occur.

This suggests the issue is in marimo's handling of exported variables / dataset registration rather than Polars display itself.

Related warning / preview behavior

Once the dataframe is registered as a dataset, previewing the nested payload column also attempts chart generation and logs warnings like:

[W ... preview_column:113] Failed to get chart for column payload in table x
ValueError: Unexpected DtypeKind: Struct(...)

This looks like a second issue:

  • nested / unknown column types should probably skip chart generation early

Suspected root cause

The expensive path appears to be recursive sample-value serialization for dataset metadata.

get_sample_values() recursively stringifies nested Python list/dict values without a depth cap, which becomes pathological for recursive struct/list payloads.

Notes

I validated a local patch that:

  • caps nested sample serialization depth
  • skips chart generation for unknown nested column types

That removes the pathological slowdown in my local tests, but I’m filing this issue first
because the performance regression itself seems worth discussing independently of the
exact fix.

Will you submit a PR?

  • Yes

Environment

{
"marimo": "0.23.3",
"editable": true,
"location": "/Users/.../marimo",
"OS": "Darwin",
"OS Version": "23.5.0",
"Processor": "arm",
"Python Version": "3.13.12",
"Locale": "C/en_US",
"Binaries": {
"Browser": "147.0.7727.116",
"Node": "v24.13.1",
"uv": "0.10.3 (c75a0c625 2026-02-16)"
},
"Dependencies": {
"click": "8.2.1",
"docutils": "0.22.4",
"itsdangerous": "2.2.0",
"jedi": "0.19.2",
"markdown": "3.10.2",
"narwhals": "2.20.0",
"packaging": "26.2",
"psutil": "7.2.2",
"pygments": "2.20.0",
"pymdown-extensions": "10.21.2",
"pyyaml": "6.0.3",
"starlette": "1.0.0",
"tomlkit": "0.14.0",
"typing-extensions": "4.15.0",
"uvicorn": "0.46.0",
"websockets": "16.0"
},
"Optional Dependencies": {
"altair": "6.1.0",
"anywidget": "0.9.21",
"basedpyright": "1.39.3",
"duckdb": "1.5.2",
"ibis-framework": "12.0.0",
"loro": "1.10.3",
"mcp": "1.27.0",
"nbformat": "5.10.4",
"openai": "2.32.0",
"pandas": "3.0.2",
"polars": "1.40.1",
"pyarrow": "24.0.0",
"pytest": "9.0.3",
"python-lsp-server": "1.14.0",
"ruff": "0.15.12",
"sqlglot": "30.6.0",
"vegafusion": "2.0.3"
},
"Experimental Flags": {
"multi_column": true,
"cache_panel": true,
"isolate_apps": true
}
}

Code to reproduce

import marimo
__generated_with = "0.23.1"
app = marimo.App(width="medium")

@app.cell
def _():
    import polars as pl
    return (pl,)

@app.cell
def _(pl):
    def make_dummy_df(nesting_depth: int = 1, rows: int = 3) -> pl.DataFrame:
        def build_payload(row_idx: int, depth: int):
            base = {
                "kind": chr(65 + (row_idx % 26)),
                "scores": [row_idx + 1, row_idx + 2, row_idx + 3],
                "meta": {
                    "city": ["Zurich", "Bern", "Geneva", "Basel"][row_idx % 4],
                    "active": row_idx % 2 == 0,
                },
            }
            if depth == 0:
                return base
            return {
                "level": depth,
                "items": [
                    base,
                    {"branch": row_idx, "child": build_payload(row_idx, depth - 1)},
                ],
                "summary": {"row": row_idx, "depth": depth},
            }
        return pl.DataFrame(
            {
                "row_id": list(range(1, rows + 1)),
                "name": [f"row_{i}" for i in range(1, rows + 1)],
                "value": [round(10.0 + i * 1.25, 2) for i in range(rows)],
                "payload": [build_payload(i, nesting_depth) for i in range(rows)],
            }
        )
    return (make_dummy_df,)

@app.cell
def _(make_dummy_df):
    x = make_dummy_df(nesting_depth=7, rows=3)
    x
    return

if __name__ == "__main__":
    app.run()

Control cases

These are fast:

make_dummy_df(nesting_depth=7, rows=3)
_x = make_dummy_df(nesting_depth=7, rows=3)
_x

This becomes very slow:

x = make_dummy_df(nesting_depth=7, rows=3)
x

Observed behavior

With my repro:

  • depth 1-5: fine
  • depth 6: noticeably slow
  • depth 7: very slow
  • depth 8+: effectively hangs / UI becomes unusable

Internal timings

I benchmarked the relevant internal paths directly. create_variable_value("x", df) stays flat. The slowdown is in dataset registration, specifically get_datasets_from_variables() and then NarwhalsTableManager.get_sample_values(). Observed timings for get_datasets_from_variables([("x", df)]):

  • depth 5: 0.0105s
  • depth 6: 0.0818s
  • depth 7: 0.6633s
  • depth 8: 5.5272s

Breaking that down further, the hotspot is get_sample_values() on the nested payload
column:

  • depth 5: 0.0107s
  • depth 6: 0.0791s
  • depth 7: 0.6322s
  • depth 8: 5.3794s

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions