Skip to content

Cytomining readiness#8

Merged
d33bs merged 6 commits intomainfrom
cytomining
Apr 8, 2026
Merged

Cytomining readiness#8
d33bs merged 6 commits intomainfrom
cytomining

Conversation

@d33bs
Copy link
Copy Markdown
Member

@d33bs d33bs commented Apr 5, 2026

Summary by CodeRabbit

  • New Features

    • Ingest multiple image stores into Iceberg catalogs; export Cytomining‑compatible Parquet warehouses with manifests, append/overwrite modes, profile export, and programmatic APIs/CLI.
  • Documentation

    • New Cytomining guide, API docs, and runnable example notebooks demonstrating end‑to‑end ingestion/export and joining workflows.
  • Tests

    • Added unit and integration tests covering ingestion, export, joins, manifest validation, and CLI flows.
  • Chores

    • Project status bumped Alpha → Beta; CI and local checks now build/validate distributions and sync/validate example notebooks; workflow trigger narrowed for pushes.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 5, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Cytomining-compatible Parquet warehouse export and ingestion across API, CLI, and integrations; implements profile-column alias resolution and optional profile_dataset_id propagation; adds namespace-fallback/legacy handling, manifest management, notebook build checks, models, examples, and tests.

Changes

Cohort / File(s) Summary
CI & Build
​.github/workflows/run-tests.yml, CONTRIBUTING.md, pyproject.toml
Narrowed workflow push triggers to main; add Poe tasks to build (uv build), validate distributions (uvx twine check dist/*), and enforce notebook sync (jupytext --sync + git diff --exit-code). Bumped Trove classifier and added Ruff per-file ignores.
Docs & Examples
README.md, docs/src/cytomining.md, docs/src/workflow.md, docs/src/duckdb.md, docs/src/python-api.md, docs/src/index.md, docs/src/conf.py, docs/src/examples/*
Add Cytomining documentation and examples (notebooks + jupytext .py), update TOC and Sphinx exclude patterns, document new CLI/API ingest/export flows and DuckDB alias behavior.
Top-level API Exports
src/iceberg_bioimage/__init__.py
Re-export new ingestion/export functions, manifest/validation helpers, and warehouse datamodels on the package public API.
Core API
src/iceberg_bioimage/api.py
Add ingest_scan_results_to_warehouse and ingest_stores_to_warehouse; normalize namespace inputs; add optional profile_dataset_id param to join functions and propagate it into join logic.
CLI
src/iceberg_bioimage/cli.py
Add ingest and export-cytomining* subcommands and handlers; add --profile-dataset-id option; handlers print JSON results.
Integrations — Cytomining
src/iceberg_bioimage/integrations/cytomining.py, src/iceberg_bioimage/integrations/__init__.py
New module implementing Parquet-backed Cytomining warehouse exports (scan/store/catalog/profiles → parquet), table writer with overwrite/append semantics, manifest loader/updater, profile normalization, and public exports.
Integrations — Catalog & DuckDB
src/iceberg_bioimage/integrations/catalog.py, src/iceberg_bioimage/integrations/duckdb.py
Catalog load/list use namespace-fallback helpers; DuckDB join builds alias-aware profiles projection, can inject dataset_id from profile_dataset_id, and ensures owned connections are closed on errors.
Publishing & Namespace Handling
src/iceberg_bioimage/publishing/image_assets.py
Add namespace-candidate resolution, legacy cytotable fallback for load/create, warnings when resolved namespace differs, and robust namespace creation helpers.
Validation & Aliases
src/iceberg_bioimage/validation/contracts.py
Add built-in microscopy alias map, alias resolution utilities, TOML alias loader, extend validation to accept alias maps, emit alias-normalization warnings, and add validate_warehouse_manifest.
Models
src/iceberg_bioimage/models/scan_result.py
Add dataclasses for warehouse flows: WarehouseIngestResult, CytominingWarehouseResult, WarehouseTableManifestEntry, WarehouseManifest, WarehouseValidationResult with serialization helpers.
Tests & Fakes
tests/fakes.py, tests/*
Extend test fakes/catalogs for multi-part namespaces and cytotable layout; add unit and integration tests for ingestion, cytomining exports, alias handling, manifest validation, DuckDB joins, and CLI; update existing tests to expect new namespace/warning behavior.
Other config
.pre-commit-config.yaml
Bump pre-commit hook revisions.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API as iceberg_bioimage.api
    participant Publishing as Publishing Layer
    participant Catalog as Iceberg Catalog
    participant Warehouse as Parquet Warehouse

    rect rgba(100,200,100,0.5)
    Note over Client,Warehouse: Warehouse Ingestion Flow
    Client->>API: ingest_stores_to_warehouse(uris, catalog, namespace)
    API->>API: scan_store(uri)
    API->>Publishing: publish_image_assets(scan_result,...)
    Publishing->>Catalog: load_table / create_table (namespace candidates)
    Catalog->>Warehouse: append image_assets rows
    API->>Publishing: publish_chunk_index(scan_result,...)
    Publishing->>Catalog: load_table / create_table
    Catalog->>Warehouse: append chunk_index rows
    API-->>Client: WarehouseIngestResult
    end
Loading
sequenceDiagram
    participant Client
    participant Cytom as integrations.cytomining
    participant Profiles as Profile Resolver
    participant Parquet as Parquet Writer
    participant Manifest as Manifest Manager

    rect rgba(100,150,200,0.5)
    Note over Client,Parquet: Cytomining Export Flow
    Client->>Cytom: export_store_to_cytomining_warehouse(uri, warehouse_root,...)
    Cytom->>Cytom: scan_store(uri)
    Cytom->>Parquet: export_table_to_cytomining_warehouse(image_assets)
    Parquet->>Parquet: write parquet files under warehouse_root/image_assets
    alt profiles provided
        Cytom->>Profiles: _normalize_profiles_table(profiles, profile_dataset_id)
        Profiles->>Profiles: resolve_microscopy_profile_columns(alias_map)
        Profiles-->>Cytom: normalized profiles table
        Cytom->>Cytom: join profiles with image_assets
        Cytom->>Parquet: export_table_to_cytomining_warehouse(joined_profiles)
    end
    Cytom->>Manifest: _update_manifest(warehouse_manifest.json)
    Manifest-->>Client: CytominingWarehouseResult(manifest_path, tables_written, row_counts)
    end
Loading
sequenceDiagram
    participant Client
    participant DuckDB as join_image_assets_with_profiles
    participant Resolver as resolve_microscopy_profile_columns
    participant SQL as SQL Engine

    rect rgba(200,150,100,0.5)
    Note over Client,SQL: Profile Join with Alias Resolution
    Client->>DuckDB: join_image_assets_with_profiles(profiles, join_keys, profile_dataset_id?)
    DuckDB->>Resolver: resolve_microscopy_profile_columns(profile_columns, alias_map)
    Resolver-->>DuckDB: mapping dict (canonical->source)
    DuckDB->>DuckDB: build profiles projection (aliasing, inject dataset_id if needed)
    DuckDB->>SQL: CREATE VIEW profiles AS (projection)
    DuckDB->>SQL: INNER JOIN image_assets USING (join_keys)
    SQL-->>DuckDB: joined result
    DuckDB-->>Client: normalized joined table
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • d33bs/iceberg-bioimage#7 — Related changes to join/profile APIs and DuckDB join helpers; overlaps with alias handling and profile_dataset_id propagation.
  • d33bs/iceberg-bioimage#1 — Foundational PR introducing earlier package surfaces and CI/docs scaffolding extended by this work.

Poem

🐰 In Parquet rows my whiskers write,

Namespaces tumble then settle right,
Aliases hum, dataset ids bloom,
From store to warehouse tables zoom—
I hop, I sync, and tuck them tight.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Cytomining readiness' clearly summarizes the main change: adding Cytomining interoperability as a first-class feature with warehouse export, ingestion APIs, CLI commands, documentation, and supporting infrastructure.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cytomining

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/iceberg_bioimage/integrations/duckdb.py (1)

87-124: ⚠️ Potential issue | 🟡 Minor

Close owned DuckDB connections on every exit path.

_register_profiles_source() adds new failure points before Line 123, so an internally created connection can stay open on bad profile inputs. Wrap the body after _get_connection() in try/finally and only close when owns_connection is true.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/integrations/duckdb.py` around lines 87 - 124, The code
may leak an internally created DuckDB connection returned by _get_connection
when later calls like _register_profiles_source or _register_source fail; wrap
the logic after calling _get_connection(…) in a try/finally and in the finally
block close duckdb_connection only if owns_connection is True so the connection
is always closed on every exit path; ensure this covers all calls including
_register_source, _register_profiles_source, building/executing the query and
calling _as_arrow_table so any exception still triggers the conditional close.
🧹 Nitpick comments (4)
pyproject.toml (1)

142-144: Consider making tasks.notebooks fail when sync changes files.

Right now it always syncs but doesn’t enforce a clean post-sync state, which weakens drift detection for contributors running this task locally.

♻️ Suggested enhancement
 tasks.notebooks.shell = """
 python -m jupytext --sync docs/src/examples/*.ipynb
+git diff --exit-code -- docs/src/examples
 """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 142 - 144, Update the tasks.notebooks.shell
entry so that after running the jupytext sync it verifies the repository is
unchanged and fails if any files were modified: run the existing jupytext --sync
step (the tasks.notebooks shell command), then perform a git check for
uncommitted/changed files and if any are found print a clear message and exit
with a non-zero status so the task fails; this enforces drift detection for
contributors running tasks.notebooks locally.
src/iceberg_bioimage/integrations/catalog.py (1)

128-128: Document the new profile_dataset_id parameter.

Line 128 adds a public keyword, but the docstring still stops at chunk_index_scan_options. Please add it so help() and generated docs stay complete.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/integrations/catalog.py` at line 128, The public
parameter profile_dataset_id was added to the signature (profile_dataset_id: str
| None = None) but its description is missing from the function/class docstring;
update the docstring where parameters are listed (near chunk_index_scan_options)
to add a brief entry for profile_dataset_id describing its type (str | None),
default (None), and purpose (which dataset id to use for profiling), so help()
and generated docs include it; ensure formatting matches the existing docstring
style and ordering.
src/iceberg_bioimage/integrations/cytomining.py (2)

303-317: Consider using Literal type for the mode parameter.

Using Literal["overwrite", "append"] instead of str would provide better IDE autocompletion and type checking support across all functions accepting this parameter.

💡 Suggested type improvement
+from typing import Literal
+
+WriteMode = Literal["overwrite", "append"]
+
 def _write_parquet_dataset(
     table: pa.Table,
     dataset_path: Path,
     *,
-    mode: str,
+    mode: WriteMode,
 ) -> None:

Then update the type hint for mode in the public functions as well.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/integrations/cytomining.py` around lines 303 - 317,
Change the mode parameter's type from plain str to a Literal union to get better
type-checking and IDE help: update the signature of _write_parquet_dataset to
use typing.Literal["overwrite", "append"] for mode and import Literal, and
propagate the same Literal["overwrite","append"] annotation to all public
functions that accept the mode parameter so callers and static checkers see the
narrowed type.

398-402: Use pa.repeat() for memory-efficient array construction with scalar values.

For large tables, creating an intermediate Python list with [profile_dataset_id] * normalized.num_rows allocates unnecessary memory. PyArrow provides pa.repeat() which directly repeats a scalar value:

pa.repeat(pa.scalar(profile_dataset_id), normalized.num_rows)

This avoids the intermediate list allocation and is more efficient for large tables. PyArrow 18+ (your project's minimum version) supports this function.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/integrations/cytomining.py` around lines 398 - 402,
Replace the intermediate Python list used when appending the dataset_id column
with a PyArrow repeated scalar to avoid large-list allocation: in the block that
checks `if canonical == "dataset_id" and profile_dataset_id is not None`, change
the value passed to `normalized.append_column` so it uses
`pa.repeat(pa.scalar(profile_dataset_id), normalized.num_rows)` instead of
`[profile_dataset_id] * normalized.num_rows`, keeping the same call to
`normalized.append_column`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/src/examples/basic-workflow.py`:
- Line 42: The long import line containing ingest_stores_to_warehouse,
join_profiles_with_store, scan_store, summarize_store (and other
over-88-character import lines in this file) must be wrapped so no line exceeds
88 characters to satisfy Ruff; break the import across multiple lines (one
symbol per line or grouped) in the Python file
(docs/src/examples/basic-workflow.py) and then regenerate or sync the
corresponding .ipynb cell so both sources match; locate the import statements by
the function names ingest_stores_to_warehouse, join_profiles_with_store,
scan_store, and summarize_store to apply the change consistently.

In `@pyproject.toml`:
- Around line 110-115: The lint.per-file-ignores block in pyproject.toml is not
ordered as expected by pyproject-fmt; reorder the entries so the keys are sorted
lexicographically (e.g. ensure "src/iceberg_bioimage/cli.py" comes in correct
alphabetical position relative to "src/iceberg_bioimage/integrations/..." and
"src/iceberg_bioimage/publishing/..."), preserve the exact key quoting and value
arrays (PLC0415, PLR0915) and formatting style used elsewhere in the file, then
run the formatter/hook to confirm the section is stable.

In `@README.md`:
- Around line 79-80: Update the examples that pass the namespace string to use
the canonical "bioimage.cytotable" instead of the legacy "bioimage" (e.g.,
replace occurrences of "bioimage" in the arrays shown and in the surrounding
register examples), or explicitly mark those examples as using the legacy
fallback; ensure all examples that call register use "bioimage.cytotable" so the
new resolver won't emit warnings (apply the same change to the occurrences noted
around lines 79-80 and 97-98).
- Around line 77-82: The README example calls warehouse.to_dict() but the result
type WarehouseIngestResult only exposes to_json(); update the quickstart to call
warehouse.to_json() (or, alternatively, add a to_dict method on
WarehouseIngestResult that returns the same dict as to_json()). Locate the
example using ingest_stores_to_warehouse and replace the to_dict() call with
to_json() to match the existing WarehouseIngestResult API (or implement a
to_dict wrapper in class ScanResult/WarehouseIngestResult to forward to_json()).

In `@src/iceberg_bioimage/integrations/catalog.py`:
- Around line 72-77: The public function load_catalog_table declares it returns
SupportsIcebergTable but currently returns _load_table_with_namespace_fallback
which is typed to return SupportsAppend; fix by making
_load_table_with_namespace_fallback generic over the table type (e.g., TypeVar T
bounded to table protocol) and accept a catalog type parameter bounded to the
matching SupportsCatalog/SupportsScanCatalog so its load_table() returns the
same T, or alternatively add a thin type-correct wrapper that calls the existing
helper and casts/validates the result to SupportsIcebergTable before returning;
update the annotations for _load_table_with_namespace_fallback,
load_catalog_table, and any intermediate variables so load_table() signatures
(SupportsCatalog vs SupportsScanCatalog) and methods (append vs scan) line up
with the declared return type.

In `@src/iceberg_bioimage/publishing/image_assets.py`:
- Around line 301-306: The current code around catalog.create_namespace()
catches all exceptions and only re-raises NoSuchNamespaceError, which hides real
errors; change the try/except to catch specifically NamespaceAlreadyExistsError
(from pyiceberg) and suppress it (no-op), remove the broad "except Exception"
block and any special-case re-raising of NoSuchNamespaceError so that any other
exceptions (auth/permission/backend failures) propagate normally; update imports
or references to ensure NamespaceAlreadyExistsError is available where
create_namespace() is called.

In `@src/iceberg_bioimage/validation/contracts.py`:
- Around line 157-175: The code currently replaces
MICROSCOPY_PROFILE_COLUMN_ALIASES when alias_map is provided, losing default
aliases; instead merge them: start from a copy of
MICROSCOPY_PROFILE_COLUMN_ALIASES and for each key in alias_map append/extend
its aliases (preserving order) so both built-in and custom aliases are
considered; then use that merged "aliases" map in the existing resolution loop
(symbols to update: alias_map, MICROSCOPY_PROFILE_COLUMN_ALIASES, aliases,
resolved, MICROSCOPY_REQUIRED_JOIN_KEYS, MICROSCOPY_RECOMMENDED_JOIN_KEYS).
- Around line 203-208: The code currently calls load_warehouse_manifest(root)
without handling exceptions, so a malformed or invalid JSON/schema will raise
and crash instead of returning a WarehouseValidationResult; wrap the call to
load_warehouse_manifest(root) in a try/except, catch the specific exceptions it
can raise (e.g., JSONDecodeError, validation exceptions) and append an
appropriate error to result.errors (or populate result with failure info), then
return result; ensure you still return the validated manifest when no exception
occurs so callers of manifest_path, load_warehouse_manifest, and the enclosing
function get a proper WarehouseValidationResult on both success and failure.

In `@tests/test_cytomining_integration.py`:
- Around line 72-74: Tests use zarr.Group.create_array which is not available in
zarr 2.18.4; replace calls to root.create_array("0", data=data, chunks=(2, 2))
with root.create_dataset("0", data=data, chunks=(2, 2)) (and do the same for
other occurrences around lines 101-113) so the test fixture uses the documented
zarr 2.x API; update all instances where create_array is used (e.g., the
variables root, store_path, data) to call create_dataset with the same
arguments.

---

Outside diff comments:
In `@src/iceberg_bioimage/integrations/duckdb.py`:
- Around line 87-124: The code may leak an internally created DuckDB connection
returned by _get_connection when later calls like _register_profiles_source or
_register_source fail; wrap the logic after calling _get_connection(…) in a
try/finally and in the finally block close duckdb_connection only if
owns_connection is True so the connection is always closed on every exit path;
ensure this covers all calls including _register_source,
_register_profiles_source, building/executing the query and calling
_as_arrow_table so any exception still triggers the conditional close.

---

Nitpick comments:
In `@pyproject.toml`:
- Around line 142-144: Update the tasks.notebooks.shell entry so that after
running the jupytext sync it verifies the repository is unchanged and fails if
any files were modified: run the existing jupytext --sync step (the
tasks.notebooks shell command), then perform a git check for uncommitted/changed
files and if any are found print a clear message and exit with a non-zero status
so the task fails; this enforces drift detection for contributors running
tasks.notebooks locally.

In `@src/iceberg_bioimage/integrations/catalog.py`:
- Line 128: The public parameter profile_dataset_id was added to the signature
(profile_dataset_id: str | None = None) but its description is missing from the
function/class docstring; update the docstring where parameters are listed (near
chunk_index_scan_options) to add a brief entry for profile_dataset_id describing
its type (str | None), default (None), and purpose (which dataset id to use for
profiling), so help() and generated docs include it; ensure formatting matches
the existing docstring style and ordering.

In `@src/iceberg_bioimage/integrations/cytomining.py`:
- Around line 303-317: Change the mode parameter's type from plain str to a
Literal union to get better type-checking and IDE help: update the signature of
_write_parquet_dataset to use typing.Literal["overwrite", "append"] for mode and
import Literal, and propagate the same Literal["overwrite","append"] annotation
to all public functions that accept the mode parameter so callers and static
checkers see the narrowed type.
- Around line 398-402: Replace the intermediate Python list used when appending
the dataset_id column with a PyArrow repeated scalar to avoid large-list
allocation: in the block that checks `if canonical == "dataset_id" and
profile_dataset_id is not None`, change the value passed to
`normalized.append_column` so it uses `pa.repeat(pa.scalar(profile_dataset_id),
normalized.num_rows)` instead of `[profile_dataset_id] * normalized.num_rows`,
keeping the same call to `normalized.append_column`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e122b7c1-ff8d-417b-823a-39824120bb0f

📥 Commits

Reviewing files that changed from the base of the PR and between 0f36373 and 430ed65.

⛔ Files ignored due to path filters (2)
  • tests/data/profiles_cosmicqc.parquet is excluded by !**/*.parquet
  • tests/data/profiles_pycytominer.parquet is excluded by !**/*.parquet
📒 Files selected for processing (33)
  • .github/workflows/run-tests.yml
  • CONTRIBUTING.md
  • README.md
  • docs/src/conf.py
  • docs/src/cytomining.md
  • docs/src/duckdb.md
  • docs/src/examples/basic-workflow.ipynb
  • docs/src/examples/basic-workflow.py
  • docs/src/examples/metadata-workflow.ipynb
  • docs/src/examples/metadata-workflow.py
  • docs/src/index.md
  • docs/src/python-api.md
  • docs/src/workflow.md
  • pyproject.toml
  • src/iceberg_bioimage/__init__.py
  • src/iceberg_bioimage/api.py
  • src/iceberg_bioimage/cli.py
  • src/iceberg_bioimage/integrations/__init__.py
  • src/iceberg_bioimage/integrations/catalog.py
  • src/iceberg_bioimage/integrations/cytomining.py
  • src/iceberg_bioimage/integrations/duckdb.py
  • src/iceberg_bioimage/models/scan_result.py
  • src/iceberg_bioimage/publishing/image_assets.py
  • src/iceberg_bioimage/validation/contracts.py
  • tests/fakes.py
  • tests/test_api.py
  • tests/test_catalog_integration.py
  • tests/test_chunk_index.py
  • tests/test_cli.py
  • tests/test_cytomining_integration.py
  • tests/test_duckdb.py
  • tests/test_publishing.py
  • tests/test_validation.py

Comment thread docs/src/examples/basic-workflow.py Outdated
Comment thread pyproject.toml
Comment thread README.md
Comment thread README.md Outdated
Comment thread src/iceberg_bioimage/integrations/catalog.py
Comment thread src/iceberg_bioimage/publishing/image_assets.py Outdated
Comment thread src/iceberg_bioimage/validation/contracts.py Outdated
Comment thread src/iceberg_bioimage/validation/contracts.py Outdated
Comment thread tests/test_cytomining_integration.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
src/iceberg_bioimage/integrations/duckdb.py (1)

83-85: Consider documenting the new profile_dataset_id parameter.

The new parameter enables profile joining when profiles lack a dataset_id column, but the docstring doesn't describe its purpose or behavior.

📝 Suggested docstring update
 def join_image_assets_with_profiles(  # noqa: PLR0913
     image_assets: MetadataSource,
     profiles: MetadataSource,
     *,
     join_keys: Sequence[str] = DEFAULT_JOIN_KEYS,
     chunk_index: MetadataSource | None = None,
     connection: DuckDBPyConnection | None = None,
     profile_dataset_id: str | None = None,
 ) -> pa.Table:
-    """Join image metadata to a profile table using the canonical join keys."""
+    """Join image metadata to a profile table using the canonical join keys.
+
+    Parameters
+    ----------
+    profile_dataset_id : str | None
+        Fallback dataset identifier to use when the profiles source lacks a
+        ``dataset_id`` column and no alias can be resolved.
+    """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/integrations/duckdb.py` around lines 83 - 85, The
docstring for the function that "Join image metadata to a profile table using
the canonical join keys" is missing documentation for the new profile_dataset_id
parameter; update the function's docstring to describe profile_dataset_id (type:
str | None, default None), explain that when provided it will be used to
populate or override the profile's dataset_id for performing the join when the
profile table lacks a dataset_id column, and state that if None the function
expects dataset_id to exist on the profile; reference the parameter name
profile_dataset_id and keep the description concise and include any effects on
join behavior or preconditions.
src/iceberg_bioimage/publishing/image_assets.py (2)

114-145: Make _load_or_create_table reuse the load-fallback helper.

Lines 117-130 duplicate the candidate iteration and warning path already implemented in _load_table_with_namespace_fallback(). Keeping both copies aligned will be easy to miss the next time namespace resolution changes.

♻️ Suggested simplification
 def _load_or_create_table(
     catalog_or_name: str | SupportsCatalog,
     namespace: str | Iterable[str],
     table_name: str,
@@
 ) -> SupportsAppend:
     catalog = _resolve_catalog(catalog_or_name)
     requested_namespace = _normalize_namespace(namespace)
     candidate_namespaces = _namespace_candidates(requested_namespace)
 
-    for resolved_namespace in candidate_namespaces:
-        identifier = (*resolved_namespace, table_name)
-        try:
-            table = catalog.load_table(identifier)
-        except NoSuchTableError:  # pragma: no cover - depends on active catalog backend
-            continue
-
-        _warn_for_namespace_resolution(
-            requested_namespace,
-            resolved_namespace,
-            table_name,
-            operation="publishing",
-        )
-        return table
+    try:
+        return _load_table_with_namespace_fallback(
+            catalog,
+            requested_namespace,
+            table_name,
+            operation="publishing",
+        )
+    except NoSuchTableError:
+        pass
 
     build_schema = (
         _build_image_assets_schema if schema_builder is None else schema_builder
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/publishing/image_assets.py` around lines 114 - 145,
Replace the duplicated candidate-iteration and warning logic in
_load_or_create_table with a call to the existing helper
_load_table_with_namespace_fallback: call
_load_table_with_namespace_fallback(catalog, requested_namespace, table_name)
(or the appropriate signature) to attempt to load the table and capture its
returned table and resolved_namespace when present; if it returns a table,
return it immediately, otherwise proceed to use candidate_namespaces[0] as
preferred_namespace, call _warn_for_namespace_resolution with creating=True,
ensure the namespace with _ensure_namespace_exists, and create the table with
catalog.create_table(identifier, schema=build_schema()) (where build_schema is
_build_image_assets_schema or schema_builder) so all namespace resolution logic
is centralized in _load_table_with_namespace_fallback.

307-319: Cover the create_namespace fallback with a dedicated fake.

tests/fakes.py:19-51 only implements create_namespace_if_not_exists, so the fallback branch here and the NamespaceAlreadyExistsError suppression can regress without test feedback.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/publishing/image_assets.py` around lines 307 - 319, The
fallback branch in _ensure_namespace_exists that calls catalog.create_namespace
and catches NamespaceAlreadyExistsError isn’t covered by tests because
tests/fakes.py only implements create_namespace_if_not_exists; update
tests/fakes.py to add a fake method named create_namespace (and storage to track
namespaces) that creates the namespace or raises the same
NamespaceAlreadyExistsError used in
src/iceberg_bioimage/publishing/image_assets.py when the namespace already
exists, and add a test case that exercises the create_namespace fallback path to
ensure the exception suppression behavior is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/iceberg_bioimage/publishing/image_assets.py`:
- Around line 339-348: The warning message can produce a leading dot when
_normalize_namespace() collapses to an empty tuple; update the message
construction so it conditionally joins namespace and table without a leading '.'
when the namespace is empty. Specifically, change how expected_namespace (from
_namespace_candidates(requested_namespace)[0]) and resolved_namespace_name are
combined with table_name: compute expected_full and resolved_full that use
"namespace + '.' + table_name" only if namespace is non-empty, otherwise just
use table_name, and then use those variables in the warnings.warn call
(referencing expected_namespace_parts, expected_namespace,
resolved_namespace_name, requested_namespace, table_name, and the existing
action/operation variables).
- Around line 286-300: The loop over _namespace_candidates currently calls
_warn_for_namespace_resolution each time identifiers are found, producing
multiple UserWarnings for a single catalog.list_tables call; introduce a flag
(e.g., warned_for_list = False) before the for loop and, inside the loop where
identifiers is truthy, call _warn_for_namespace_resolution only if
warned_for_list is False, then set the flag to True so subsequent matching
resolved_namespace iterations for the same list_tables call do not emit
additional warnings; reference the existing variables/functions: discovered,
_namespace_candidates(requested_namespace), catalog.list_tables,
_warn_for_namespace_resolution, requested_namespace, resolved_namespace.

---

Nitpick comments:
In `@src/iceberg_bioimage/integrations/duckdb.py`:
- Around line 83-85: The docstring for the function that "Join image metadata to
a profile table using the canonical join keys" is missing documentation for the
new profile_dataset_id parameter; update the function's docstring to describe
profile_dataset_id (type: str | None, default None), explain that when provided
it will be used to populate or override the profile's dataset_id for performing
the join when the profile table lacks a dataset_id column, and state that if
None the function expects dataset_id to exist on the profile; reference the
parameter name profile_dataset_id and keep the description concise and include
any effects on join behavior or preconditions.

In `@src/iceberg_bioimage/publishing/image_assets.py`:
- Around line 114-145: Replace the duplicated candidate-iteration and warning
logic in _load_or_create_table with a call to the existing helper
_load_table_with_namespace_fallback: call
_load_table_with_namespace_fallback(catalog, requested_namespace, table_name)
(or the appropriate signature) to attempt to load the table and capture its
returned table and resolved_namespace when present; if it returns a table,
return it immediately, otherwise proceed to use candidate_namespaces[0] as
preferred_namespace, call _warn_for_namespace_resolution with creating=True,
ensure the namespace with _ensure_namespace_exists, and create the table with
catalog.create_table(identifier, schema=build_schema()) (where build_schema is
_build_image_assets_schema or schema_builder) so all namespace resolution logic
is centralized in _load_table_with_namespace_fallback.
- Around line 307-319: The fallback branch in _ensure_namespace_exists that
calls catalog.create_namespace and catches NamespaceAlreadyExistsError isn’t
covered by tests because tests/fakes.py only implements
create_namespace_if_not_exists; update tests/fakes.py to add a fake method named
create_namespace (and storage to track namespaces) that creates the namespace or
raises the same NamespaceAlreadyExistsError used in
src/iceberg_bioimage/publishing/image_assets.py when the namespace already
exists, and add a test case that exercises the create_namespace fallback path to
ensure the exception suppression behavior is exercised.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 161a9419-87fb-4e2f-b9d0-406a41f475bc

📥 Commits

Reviewing files that changed from the base of the PR and between 430ed65 and b341887.

📒 Files selected for processing (10)
  • README.md
  • docs/src/examples/basic-workflow.ipynb
  • docs/src/examples/basic-workflow.py
  • pyproject.toml
  • src/iceberg_bioimage/integrations/catalog.py
  • src/iceberg_bioimage/integrations/cytomining.py
  • src/iceberg_bioimage/integrations/duckdb.py
  • src/iceberg_bioimage/publishing/image_assets.py
  • src/iceberg_bioimage/validation/contracts.py
  • tests/test_cytomining_integration.py
✅ Files skipped from review due to trivial changes (4)
  • docs/src/examples/basic-workflow.ipynb
  • docs/src/examples/basic-workflow.py
  • tests/test_cytomining_integration.py
  • src/iceberg_bioimage/validation/contracts.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/iceberg_bioimage/integrations/catalog.py
  • README.md
  • src/iceberg_bioimage/integrations/cytomining.py

Comment thread src/iceberg_bioimage/publishing/image_assets.py
Comment thread src/iceberg_bioimage/publishing/image_assets.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/iceberg_bioimage/integrations/duckdb.py (1)

172-184: Correct approach, with a minor inefficiency.

The pattern of registering the raw source then creating a normalized view is clean. However, for file-path sources, the schema is read twice: once by _register_source (via from_parquet) and again by _columns_for_source (via ds.dataset). This is a minor inefficiency but not a correctness issue.

💡 Optional optimization to read schema once

Consider caching the relation or extracting columns from the already-registered view:

 def _register_profiles_source(
     connection: DuckDBPyConnection,
     source: MetadataSource,
     *,
     dataset_id: str | None = None,
 ) -> None:
     _register_source(connection, "profiles_source", source)
-    columns = _columns_for_source(source)
+    columns = connection.execute(
+        "SELECT * FROM profiles_source LIMIT 0"
+    ).columns
     projection = _profile_projection(columns, dataset_id=dataset_id)
     connection.execute(
         "CREATE OR REPLACE VIEW profiles AS "
         f"SELECT {projection} FROM profiles_source"
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/integrations/duckdb.py` around lines 172 - 184, The
current implementation reads file-path schemas twice: once in _register_source
and again in _columns_for_source; to avoid this, change _register_source to
return the registered relation identifier or a DuckDB relation object (instead
of only registering) and update _register_profiles_source to pass that relation
into _columns_for_source (or use it directly to build the projection) so
_columns_for_source/_profile_projection can reuse the already-loaded schema;
update function signatures for _register_source and _columns_for_source
accordingly and ensure _register_profiles_source calls the new return value when
building projection.
tests/fakes.py (1)

54-92: Consider extracting a base class to reduce duplication.

FakeCreateNamespaceCatalog duplicates load_table, create_table, and list_tables implementations verbatim from FakeCatalog. While acceptable in test code, a base class could reduce maintenance burden if these stubs evolve.

Sketch of potential refactor
class _BaseFakeCatalog:
    """Shared catalog stub behavior."""

    def __init__(
        self,
        table: FakeTable | None = None,
        tables: dict[tuple[str, ...], FakeTable] | None = None,
    ) -> None:
        self.table = table
        self.tables = {} if tables is None else dict(tables)
        self.created_identifiers: list[tuple[str, ...]] = []
        self.created_namespaces: list[tuple[str, ...]] = []

    def load_table(self, identifier: tuple[str, ...]) -> FakeTable:
        if identifier in self.tables:
            return self.tables[identifier]
        if self.table is None:
            raise NoSuchTableError(f"Missing table: {identifier!r}")
        return self.table

    def create_table(self, identifier: tuple[str, ...], schema: object) -> FakeTable:
        self.created_identifiers.append(identifier)
        self.table = FakeTable()
        self.tables[identifier] = self.table
        return self.table

    def list_tables(self, namespace: tuple[str, ...]) -> list[tuple[str, ...]]:
        return [
            identifier for identifier in self.tables if identifier[:-1] == namespace
        ]


class FakeCatalog(_BaseFakeCatalog):
    def create_namespace_if_not_exists(self, namespace: tuple[str, ...]) -> None:
        self.created_namespaces.append(namespace)


class FakeCreateNamespaceCatalog(_BaseFakeCatalog):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.namespaces: set[tuple[str, ...]] = set()

    def create_namespace(self, namespace: tuple[str, ...]) -> None:
        if namespace in self.namespaces:
            raise NamespaceAlreadyExistsError(
                f"Namespace already exists: {namespace!r}"
            )
        self.namespaces.add(namespace)
        self.created_namespaces.append(namespace)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fakes.py` around lines 54 - 92, Extract the shared table handling into
a small base class (e.g. _BaseFakeCatalog) and move the duplicated
implementations of load_table, create_table, list_tables and the common __init__
fields (table, tables, created_identifiers, created_namespaces) there; then have
FakeCatalog and FakeCreateNamespaceCatalog inherit from _BaseFakeCatalog, keep
FakeCatalog.create_namespace_if_not_exists as-is, and in
FakeCreateNamespaceCatalog keep the specialized __init__ (to init namespaces
set) and create_namespace (raising NamespaceAlreadyExistsError) but remove the
duplicated methods. Reference classes/methods: FakeCreateNamespaceCatalog,
FakeCatalog, _BaseFakeCatalog, load_table, create_table, list_tables,
create_namespace.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_catalog_integration.py`:
- Around line 315-323: The test stub's parameter type for join_keys is
incorrect: change the type annotation on the stub function parameter join_keys
from tuple[str, ...] to a compatible type (preferably
collections.abc.Sequence[str] or list[str]) so it matches the actual value
produced by _normalize_columns (which returns a list); also add the necessary
import for Sequence if you choose that option and update any references to
join_keys in the stub to reflect the new Sequence typing.

---

Nitpick comments:
In `@src/iceberg_bioimage/integrations/duckdb.py`:
- Around line 172-184: The current implementation reads file-path schemas twice:
once in _register_source and again in _columns_for_source; to avoid this, change
_register_source to return the registered relation identifier or a DuckDB
relation object (instead of only registering) and update
_register_profiles_source to pass that relation into _columns_for_source (or use
it directly to build the projection) so _columns_for_source/_profile_projection
can reuse the already-loaded schema; update function signatures for
_register_source and _columns_for_source accordingly and ensure
_register_profiles_source calls the new return value when building projection.

In `@tests/fakes.py`:
- Around line 54-92: Extract the shared table handling into a small base class
(e.g. _BaseFakeCatalog) and move the duplicated implementations of load_table,
create_table, list_tables and the common __init__ fields (table, tables,
created_identifiers, created_namespaces) there; then have FakeCatalog and
FakeCreateNamespaceCatalog inherit from _BaseFakeCatalog, keep
FakeCatalog.create_namespace_if_not_exists as-is, and in
FakeCreateNamespaceCatalog keep the specialized __init__ (to init namespaces
set) and create_namespace (raising NamespaceAlreadyExistsError) but remove the
duplicated methods. Reference classes/methods: FakeCreateNamespaceCatalog,
FakeCatalog, _BaseFakeCatalog, load_table, create_table, list_tables,
create_namespace.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 830e9caa-72c3-413e-ac7e-7297a2068238

📥 Commits

Reviewing files that changed from the base of the PR and between b341887 and 7b2885f.

📒 Files selected for processing (5)
  • src/iceberg_bioimage/integrations/duckdb.py
  • src/iceberg_bioimage/publishing/image_assets.py
  • tests/fakes.py
  • tests/test_catalog_integration.py
  • tests/test_publishing.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/test_publishing.py
  • src/iceberg_bioimage/publishing/image_assets.py

Comment thread tests/test_catalog_integration.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tests/test_publishing.py (1)

93-126: Consider adding assertion for created_namespaces.

Unlike test_publish_image_assets_creates_missing_table (line 85), this test doesn't assert on fake_catalog.created_namespaces. Adding an explicit assertion would document whether namespaces are created when using an existing legacy table, making the expected behavior clearer.

📝 Proposed addition
     assert row_count == 1
+    assert fake_catalog.created_namespaces == []  # or expected value if namespaces are created
     assert fake_catalog.created_identifiers == []
     assert len(legacy_table.appends) == 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_publishing.py` around lines 93 - 126, In
test_publish_image_assets_uses_existing_legacy_table, add an assertion that
fake_catalog.created_namespaces is empty (e.g., assert
fake_catalog.created_namespaces == []) to document and verify that no new
namespaces are created when publishing to an existing legacy table; place this
assertion after the existing assertions on created_identifiers and
legacy_table.appends in the test function.
tests/fakes.py (1)

54-58: Remove redundant method override.

FakeCatalog.create_namespace_if_not_exists is identical to the inherited _BaseFakeCatalog.create_namespace_if_not_exists (lines 45-46). This override can be removed.

🔧 Proposed fix
 class FakeCatalog(_BaseFakeCatalog):
     """Minimal catalog stub."""
-
-    def create_namespace_if_not_exists(self, namespace: tuple[str, ...]) -> None:
-        self.created_namespaces.append(namespace)
+    pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fakes.py` around lines 54 - 58, The FakeCatalog class contains a
redundant override of create_namespace_if_not_exists that duplicates
_BaseFakeCatalog.create_namespace_if_not_exists; remove the method from
FakeCatalog so it inherits the base implementation (leave FakeCatalog definition
and other methods unchanged), and run tests to confirm no behavioral change.
src/iceberg_bioimage/integrations/cytomining.py (1)

366-374: Consider documenting memory implications for large datasets.

When source is a path, ds.dataset(source).to_table() loads the entire dataset into memory. For large profile datasets, this could cause memory pressure. Consider adding a docstring note about this limitation, or in future iterations, supporting streaming/batched writes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/integrations/cytomining.py` around lines 366 - 374, Add
a brief docstring to _metadata_source_to_table noting that when source is a
filesystem path the call ds.dataset(source).to_table() materializes the entire
dataset into memory and can cause high memory usage for large profile datasets;
mention the limitation and recommend using smaller partitions or a
streaming/batched approach in future iterations to avoid memory pressure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/iceberg_bioimage/validation/contracts.py`:
- Around line 188-196: The function load_profile_column_aliases silently returns
an empty dict when the TOML payload lacks the microscopy.aliases section; update
load_profile_column_aliases to detect when alias_section is missing or empty
(after tomllib.loads and alias_section = payload.get("microscopy",
{}).get("aliases", {})) and either emit a clear warning via the module logger
(create or use an existing logger) indicating the missing [microscopy.aliases]
section and the file path, or raise a descriptive exception (e.g., ValueError)
so callers know aliases were not loaded; ensure the chosen behavior is
consistent with existing project error-handling patterns.

---

Nitpick comments:
In `@src/iceberg_bioimage/integrations/cytomining.py`:
- Around line 366-374: Add a brief docstring to _metadata_source_to_table noting
that when source is a filesystem path the call ds.dataset(source).to_table()
materializes the entire dataset into memory and can cause high memory usage for
large profile datasets; mention the limitation and recommend using smaller
partitions or a streaming/batched approach in future iterations to avoid memory
pressure.

In `@tests/fakes.py`:
- Around line 54-58: The FakeCatalog class contains a redundant override of
create_namespace_if_not_exists that duplicates
_BaseFakeCatalog.create_namespace_if_not_exists; remove the method from
FakeCatalog so it inherits the base implementation (leave FakeCatalog definition
and other methods unchanged), and run tests to confirm no behavioral change.

In `@tests/test_publishing.py`:
- Around line 93-126: In test_publish_image_assets_uses_existing_legacy_table,
add an assertion that fake_catalog.created_namespaces is empty (e.g., assert
fake_catalog.created_namespaces == []) to document and verify that no new
namespaces are created when publishing to an existing legacy table; place this
assertion after the existing assertions on created_identifiers and
legacy_table.appends in the test function.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f5170f16-4ebc-4ed4-ba66-a1652e5a89c8

📥 Commits

Reviewing files that changed from the base of the PR and between 7b2885f and 85b7255.

📒 Files selected for processing (13)
  • .pre-commit-config.yaml
  • docs/src/duckdb.md
  • docs/src/examples/basic-workflow.ipynb
  • docs/src/examples/basic-workflow.py
  • src/iceberg_bioimage/cli.py
  • src/iceberg_bioimage/integrations/__init__.py
  • src/iceberg_bioimage/integrations/cytomining.py
  • src/iceberg_bioimage/integrations/duckdb.py
  • src/iceberg_bioimage/validation/contracts.py
  • tests/fakes.py
  • tests/test_catalog_integration.py
  • tests/test_cytomining_integration.py
  • tests/test_publishing.py
✅ Files skipped from review due to trivial changes (5)
  • docs/src/duckdb.md
  • .pre-commit-config.yaml
  • tests/test_catalog_integration.py
  • docs/src/examples/basic-workflow.ipynb
  • docs/src/examples/basic-workflow.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/iceberg_bioimage/integrations/init.py
  • tests/test_cytomining_integration.py
  • src/iceberg_bioimage/cli.py
  • src/iceberg_bioimage/integrations/duckdb.py

Comment thread src/iceberg_bioimage/validation/contracts.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
docs/src/examples/basic-workflow.py (2)

69-73: Prevent silent overwrite in DemoCatalog.create_table.

Line 69 currently replaces existing tables without signaling an error. That can mask duplicate-create bugs in warehouse ingestion and discard prior in-memory appended data in this demo harness.

Suggested change
 def create_table(self, identifier: tuple[str, ...], schema: object) -> DemoTable:
+    if identifier in self.tables:
+        raise ValueError(f"Table already exists: {identifier!r}")
     self.created_identifiers.append(identifier)
     table = DemoTable()
     self.tables[identifier] = table
     return table
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/src/examples/basic-workflow.py` around lines 69 - 73,
DemoCatalog.create_table currently overwrites an existing table silently; change
it to first check if identifier already exists in self.tables and raise an
explicit error (e.g., ValueError) instead of replacing the entry so duplicates
are not lost. Ensure the existence check happens before appending to
self.created_identifiers and before instantiating DemoTable so
created_identifiers isn't duplicated when a table already exists.

95-108: Use an auto-cleaned temp directory for the example workspace.

Line 95 uses tempfile.mkdtemp(...) without cleanup. For local doc runs this leaves artifacts behind; wrapping with TemporaryDirectory keeps the example self-cleaning.

Suggested direction
-tmpdir = Path(tempfile.mkdtemp(prefix="iceberg-bioimage-demo-"))
+with tempfile.TemporaryDirectory(prefix="iceberg-bioimage-demo-") as tmp:
+    tmpdir = Path(tmp)
+    # keep the existing workflow inside this context
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/src/examples/basic-workflow.py` around lines 95 - 108, The example
creates a temporary workspace with tempfile.mkdtemp into tmpdir which is never
cleaned up; replace that usage with a context-managed TemporaryDirectory (from
tempfile) and assign tmpdir = Path(tempdir_ctx) inside the context so the
tmpdir, zarr_path, and tiff_path work inside the with block; update surrounding
code that uses tmpdir, zarr_path, root, zarr.open_group, and tifffile.imwrite to
live inside the with TemporaryDirectory() as tmpdir_ctx: block so the temp
directory is automatically removed when the example finishes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/run-tests.yml:
- Around line 76-80: The "Check notebook review pairs" step currently runs
"python -m jupytext --sync docs/src/examples/*.ipynb" and then only runs "git
diff --exit-code docs/src/examples", which misses new untracked paired files;
update that job step to run the same shell sequence used in pyproject.toml's
tasks.notebooks.shell by invoking "jupytext --sync" and then running "git status
--porcelain docs/src/examples" (or just "git status --porcelain") and failing if
it reports any changes/untracked files so the workflow detects both modified and
newly created paired files after "jupytext --sync".

---

Nitpick comments:
In `@docs/src/examples/basic-workflow.py`:
- Around line 69-73: DemoCatalog.create_table currently overwrites an existing
table silently; change it to first check if identifier already exists in
self.tables and raise an explicit error (e.g., ValueError) instead of replacing
the entry so duplicates are not lost. Ensure the existence check happens before
appending to self.created_identifiers and before instantiating DemoTable so
created_identifiers isn't duplicated when a table already exists.
- Around line 95-108: The example creates a temporary workspace with
tempfile.mkdtemp into tmpdir which is never cleaned up; replace that usage with
a context-managed TemporaryDirectory (from tempfile) and assign tmpdir =
Path(tempdir_ctx) inside the context so the tmpdir, zarr_path, and tiff_path
work inside the with block; update surrounding code that uses tmpdir, zarr_path,
root, zarr.open_group, and tifffile.imwrite to live inside the with
TemporaryDirectory() as tmpdir_ctx: block so the temp directory is automatically
removed when the example finishes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 92d855be-c870-45d7-b5b6-2d2ac58cd84a

📥 Commits

Reviewing files that changed from the base of the PR and between 85b7255 and 4cfe9da.

📒 Files selected for processing (2)
  • .github/workflows/run-tests.yml
  • docs/src/examples/basic-workflow.py

Comment thread .github/workflows/run-tests.yml Outdated
@d33bs d33bs merged commit 7fa773c into main Apr 8, 2026
9 checks passed
@d33bs d33bs deleted the cytomining branch April 8, 2026 18:12
@coderabbitai coderabbitai Bot mentioned this pull request Apr 9, 2026
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant