Skip to content

Initial development#1

Merged
d33bs merged 7 commits intomainfrom
init
Mar 28, 2026
Merged

Initial development#1
d33bs merged 7 commits intomainfrom
init

Conversation

@d33bs
Copy link
Copy Markdown
Member

@d33bs d33bs commented Mar 28, 2026

Summary by CodeRabbit

  • New Features

    • Package for scanning/registering Zarr and OME‑TIFF bioimages, publishing metadata to Iceberg, CLI for scan/register/publish/validate, optional DuckDB query helpers, and catalog-loading utilities; included runnable examples.
  • Documentation

    • Full docs site, API reference, quickstart, workflow and DuckDB guides, contributing guide, code of conduct, and citation metadata.
  • Tests

    • Extensive test suite covering adapters, publishing, registration, integrations, CLI, and validation.
  • Chores

    • CI workflows (tests, docs, releases), Dependabot, pre-commit and tooling configs, packaging metadata.

@d33bs d33bs requested a review from Copilot March 28, 2026 14:00
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new Python package iceberg-bioimage with adapters for Zarr and OME‑TIFF, canonical scan models, publishing to Iceberg, DuckDB query/join helpers, CLI and examples, Sphinx docs, CI/CD workflows, pre-commit/tooling, and a comprehensive test suite.

Changes

Cohort / File(s) Summary
Repository CI / GitHub config
.github/ISSUE_TEMPLATE/issue.yml, .github/PULL_REQUEST_TEMPLATE.md, .github/dependabot.yml, .github/release-drafter.yml, .github/workflows/draft-release.yml, .github/workflows/publish-docs.yml, .github/workflows/publish-pypi.yml, .github/workflows/run-tests.yml
Added issue/PR templates, Dependabot, release-drafter, and workflows for drafting releases, publishing docs, publishing to PyPI, and running tests.
Project metadata & community files
CITATION.cff, CODE_OF_CONDUCT.md, CONTRIBUTING.md, README.md, .python-version, .gitignore
Added citation metadata, code of conduct, contribution guide, expanded README/quickstart, Python version hint, and minor .gitignore formatting edits.
Pre-commit & tooling
.pre-commit-config.yaml, pyproject.toml
Added pre-commit hooks and a full pyproject config (dependencies, dev/docs/duckdb groups, tooling config, entry point).
Documentation (Sphinx)
docs/src/conf.py, docs/src/_static/custom.css, docs/src/index.md, docs/src/python-api.md, docs/src/duckdb.md, docs/src/workflow.md
Added Sphinx configuration, theme/static assets, and multiple docs pages (API, DuckDB integration, workflow, index).
Core package init & models
src/iceberg_bioimage/__init__.py, src/iceberg_bioimage/models/__init__.py, src/iceberg_bioimage/models/scan_result.py
Introduced package public surface and canonical dataclasses: ImageAsset, ScanResult, ContractValidationResult, RegistrationResult with JSON helpers.
Adapter framework
src/iceberg_bioimage/adapters/base.py, src/iceberg_bioimage/adapters/zarr_v2.py, src/iceberg_bioimage/adapters/ome_tiff.py
Added BaseAdapter and concrete ZarrV2Adapter and OMETiffAdapter for auto-detected scanning of Zarr and OME‑TIFF stores.
Public API & registration
src/iceberg_bioimage/api.py
Added adapter registry _build_adapters(), scan_store(uri) adapter dispatch, and register_store(...) to publish image assets and optional chunk index.
Publishing layer
src/iceberg_bioimage/publishing/image_assets.py, src/iceberg_bioimage/publishing/chunk_index.py, src/iceberg_bioimage/publishing/__init__.py
Implemented conversion of ScanResult -> canonical rows, table resolution/create/append logic, and chunk-index row generation (including byte-length computation).
Validation
src/iceberg_bioimage/validation/__init__.py, src/iceberg_bioimage/validation/contracts.py
Added scan result validation and microscopy profile contract validators (required/recommended join keys and helpers to validate Arrow/Parquet datasets).
Optional integrations
src/iceberg_bioimage/integrations/__init__.py, src/iceberg_bioimage/integrations/catalog.py, src/iceberg_bioimage/integrations/duckdb.py
Added PyIceberg catalog helpers to load/scan tables into Arrow and DuckDB helpers for querying and joining metadata sources (Parquet/Arrow/row-lists).
CLI & examples
src/iceberg_bioimage/cli.py, examples/quickstart.py, examples/synthetic_workflow.py, examples/catalog_duckdb.py
Added iceberg-bioimage CLI (scan, register, validate-contract, publish-chunks) and three example scripts demonstrating quickstart, synthetic local workflow, and catalog-backed DuckDB join.
Notebooks & examples
src/notebooks/example_notebook.ipynb, src/notebooks/example_notebook.py
Added Jupytext-compatible example notebook demonstrating ScanResult/ImageAsset usage.
Test suite & test helpers
tests/fakes.py, tests/test_adapters.py, tests/test_api.py, tests/test_catalog_integration.py, tests/test_chunk_index.py, tests/test_cli.py, tests/test_duckdb.py, tests/test_publishing.py, tests/test_registration.py, tests/test_validation.py
Comprehensive tests covering adapters, scan/register workflows, catalog integration (stubs), chunk-index generation, CLI behavior, DuckDB integration, publishing flows, and validation.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant CLI as CLI
    participant Adapter as Adapter<br/>(Zarr/OME‑TIFF)
    participant Validator as Validator
    participant Publisher as Publisher<br/>(ImageAssets/ChunkIndex)
    participant Catalog as Iceberg<br/>Catalog

    User->>CLI: register_store(uri, catalog, namespace)
    CLI->>Adapter: scan(uri)
    Adapter-->>CLI: ScanResult
    CLI->>Validator: validate_scan_result(ScanResult)
    Validator-->>CLI: validation result
    CLI->>Publisher: publish_image_assets(ScanResult, catalog, namespace)
    Publisher->>Catalog: load_table/create_table/append(rows)
    Publisher-->>CLI: image_assets_rows_published
    CLI->>Publisher: publish_chunk_index(ScanResult, catalog, namespace)
    Publisher->>Catalog: load_table/create_table/append(chunk_rows)
    Publisher-->>CLI: chunk_rows_published
    CLI-->>User: RegistrationResult(source_uri, counts)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐰 I hopped through Zarr and TIFF with glee,
I counted chunks and made a key,
Tables drafted, docs take flight,
DuckDB joins work through the night,
Metadata dances—publish, and we’re free! 🎉

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title 'Initial development' is vague and generic, failing to convey meaningful information about the changeset, which spans infrastructure, documentation, APIs, and integrations. Use a more specific title that describes the primary change, such as 'Add initial iceberg-bioimage package implementation with CLI and integrations' or split into multiple focused PRs.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch init

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces the initial iceberg-bioimage package: canonical scan models, format adapters (Zarr/OME-TIFF), publishing helpers for Iceberg metadata tables, contract validation, optional DuckDB query/integration helpers, plus CLI/docs/examples and a CI setup.

Changes:

  • Add core library modules (adapters, models, publishing, validation, integrations) and a CLI entrypoint.
  • Add end-to-end tests covering scanning, publishing, joining/querying helpers, and CLI commands.
  • Add project scaffolding: docs, examples, pre-commit/CI workflows, and packaging metadata.

Reviewed changes

Copilot reviewed 52 out of 53 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_validation.py Tests microscopy join-contract column + table validation.
tests/test_registration.py Tests high-level register_store workflow with monkeypatching.
tests/test_publishing.py Tests image asset row building and table creation behavior.
tests/test_duckdb.py Tests optional DuckDB helpers (skipped if duckdb missing).
tests/test_cli.py CLI smoke tests for scan/validate/register/publish-chunks.
tests/test_chunk_index.py Tests chunk index row derivation and publish behavior.
tests/test_catalog_integration.py Tests catalog-facing helpers and join workflow integration.
tests/test_api.py Tests adapter selection and canonical scanning behavior.
tests/conftest.py Pytest config/fixtures.
src/notebooks/example_notebook.py Jupytext-backed example notebook (Python).
src/notebooks/example_notebook.ipynb Example notebook (ipynb).
src/iceberg_bioimage/validation/contracts.py ScanResult + microscopy contract validation utilities.
src/iceberg_bioimage/validation/init.py Validation package marker.
src/iceberg_bioimage/publishing/image_assets.py Publishing helpers for canonical image_assets table.
src/iceberg_bioimage/publishing/chunk_index.py Publishing helpers for canonical chunk_index table.
src/iceberg_bioimage/publishing/init.py Publishing package exports.
src/iceberg_bioimage/models/scan_result.py Canonical dataclasses (ScanResult, ImageAsset, etc.).
src/iceberg_bioimage/models/init.py Models package marker.
src/iceberg_bioimage/integrations/duckdb.py Optional DuckDB query/join helpers over metadata tables.
src/iceberg_bioimage/integrations/catalog.py Catalog-facing helpers (load/scan tables; join workflow).
src/iceberg_bioimage/integrations/init.py Integrations package exports.
src/iceberg_bioimage/cli.py CLI implementation (scan/register/validate-contract/publish-chunks).
src/iceberg_bioimage/api.py Public API entry points (scan_store, register_store).
src/iceberg_bioimage/adapters/zarr_v2.py Zarr v2 scanning adapter.
src/iceberg_bioimage/adapters/ome_tiff.py OME-TIFF scanning adapter.
src/iceberg_bioimage/adapters/base.py Adapter interface contract.
src/iceberg_bioimage/adapters/init.py Adapters package marker.
src/iceberg_bioimage/init.py Top-level public exports for the package.
pyproject.toml Packaging metadata, deps, tooling config (ruff/pytest/poe/jupytext/vulture).
examples/synthetic_workflow.py End-to-end synthetic workflow example (local + DuckDB join).
examples/quickstart.py Minimal quickstart example.
examples/catalog_duckdb.py Example joining catalog-backed metadata using DuckDB helpers.
docs/src/workflow.md Workflow + scope documentation.
docs/src/python-api.md Sphinx autodoc API page.
docs/src/index.md Docs index/toctree.
docs/src/duckdb.md DuckDB integration documentation.
docs/src/conf.py Sphinx configuration.
docs/src/_static/custom.css Docs theme override stub.
README.md Expanded project overview + usage examples.
CONTRIBUTING.md Contributor workflow and local checks.
CODE_OF_CONDUCT.md Contributor Covenant code of conduct.
CITATION.cff Citation metadata.
.python-version Default Python version.
.pre-commit-config.yaml Pre-commit hooks (ruff-format/check, vulture, etc.).
.github/workflows/run-tests.yml CI: pre-commit, test matrix, docs build, optional duckdb tests.
.github/workflows/publish-pypi.yml Release publish workflow to PyPI.
.github/workflows/publish-docs.yml Docs publish workflow to GitHub Pages.
.github/workflows/draft-release.yml Release drafter workflow.
.github/release-drafter.yml Release drafter config.
.github/dependabot.yml Dependabot config.
.github/PULL_REQUEST_TEMPLATE.md PR template.
.github/ISSUE_TEMPLATE/issue.yml Issue template.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/conftest.py Outdated
Comment thread src/iceberg_bioimage/adapters/zarr_v2.py Outdated
Comment thread src/iceberg_bioimage/integrations/catalog.py
Comment thread src/notebooks/example_notebook.py Outdated
Comment thread src/notebooks/example_notebook.ipynb
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

🧹 Nitpick comments (16)
tests/conftest.py (1)

8-10: Consider replacing the placeholder-style fixture with an intent-focused one.

Lines 8-10 are valid, but my_data/"Hello, differently!" reads generic and can make test intent harder to follow as coverage grows.

♻️ Optional cleanup
-@pytest.fixture
-def my_data() -> str:
-    return "Hello, differently!"
+@pytest.fixture
+def sample_greeting() -> str:
+    return "Hello, world!"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/conftest.py` around lines 8 - 10, The fixture my_data returning "Hello,
differently!" is too generic—rename it to reflect its test intent (for example
success_message, greeting_fixture, or error_payload depending on usage) and
replace the placeholder string with a value that matches that intent; update all
tests that use my_data to reference the new fixture name (and, if needed, adjust
assertions to the new fixture value) so the fixture name and returned data
clearly communicate purpose (refer to the my_data fixture in tests/conftest.py
and any tests that import/use it).
src/iceberg_bioimage/publishing/image_assets.py (2)

125-189: Non-sequential field IDs in schema are intentional.

The field IDs (1, 2, 9, 11, 13, 14, 15, 16, 19) appear non-sequential. This is fine for Iceberg schemas as field IDs are stable identifiers, but a comment explaining the gaps (reserved for future fields or removed fields) would aid maintainability.

📝 Suggested documentation
 def _build_image_assets_schema() -> object:
+    # Field IDs are stable identifiers. Gaps are reserved for future columns
+    # or represent removed columns. Do not reuse IDs.
     try:
         from pyiceberg.schema import Schema
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/publishing/image_assets.py` around lines 125 - 189, In
_build_image_assets_schema(), add a concise comment above the Schema
construction explaining that the non-sequential NestedField field_id values
(1,2,9,11,13,14,15,16,19) are intentional: they are stable Iceberg field
identifiers with gaps reserved for removed or future fields to preserve
compatibility; reference the function name _build_image_assets_schema and the
NestedField entries so reviewers can see this is deliberate and for
maintainability.

98-107: Fragile exception handling via class name string comparison.

Line 101 checks exc.__class__.__name__ != "NoSuchTableError" which is fragile - if PyIceberg renames or restructures exceptions, this will silently break. Consider importing and catching the specific exception type.

♻️ Suggested improvement
 def _load_or_create_table(
     catalog_or_name: str | SupportsCatalog,
     namespace: str | Iterable[str],
     table_name: str,
     *,
     schema_builder: Callable[[], object] | None = None,
 ) -> SupportsAppend:
     catalog = _resolve_catalog(catalog_or_name)
     identifier = (*_normalize_namespace(namespace), table_name)
 
+    try:
+        from pyiceberg.exceptions import NoSuchTableError
+    except ImportError:
+        NoSuchTableError = None  # type: ignore[misc, assignment]
+
     try:
         return catalog.load_table(identifier)
-    except Exception as exc:  # pragma: no cover - depends on active catalog backend
-        if getattr(exc.__class__, "__name__", "") != "NoSuchTableError":
-            raise
+    except Exception as exc:
+        if NoSuchTableError is None or not isinstance(exc, NoSuchTableError):
+            raise
 
     build_schema = (
         _build_image_assets_schema if schema_builder is None else schema_builder
     )
     return catalog.create_table(identifier, schema=build_schema())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/publishing/image_assets.py` around lines 98 - 107, The
code uses a fragile string comparison on exc.__class__.__name__ to detect a
missing table; instead import and catch the concrete NoSuchTableError from the
PyIceberg/catalog backend and handle only that exception. Replace the broad
"except Exception as exc" block with an explicit "except NoSuchTableError:" (and
keep letting any other exceptions propagate), leaving the rest of the logic that
picks build_schema (schema_builder vs _build_image_assets_schema) and calls
catalog.create_table(identifier, schema=build_schema()) unchanged; reference
catalog.load_table, catalog.create_table, schema_builder and
_build_image_assets_schema when making the change.
src/iceberg_bioimage/models/__init__.py (1)

1-1: Consider re-exporting key models for API convenience.

The __init__.py only contains a docstring. For a cleaner public API, consider re-exporting the main dataclasses so users can do from iceberg_bioimage.models import ScanResult instead of from iceberg_bioimage.models.scan_result import ScanResult.

♻️ Suggested addition
 """Canonical data models."""
+
+from iceberg_bioimage.models.scan_result import (
+    ContractValidationResult,
+    ImageAsset,
+    RegistrationResult,
+    ScanResult,
+)
+
+__all__ = [
+    "ContractValidationResult",
+    "ImageAsset",
+    "RegistrationResult",
+    "ScanResult",
+]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/models/__init__.py` at line 1, Add re-exports for the
package's main dataclasses so users can import them from
iceberg_bioimage.models; specifically import ScanResult from its module (e.g.,
from .scan_result import ScanResult) in __init__.py and add it to the module's
public API via __all__ (e.g., __all__ = ["ScanResult", ...]) or by setting
globals, repeating for any other primary model dataclasses you want to expose to
provide convenient top-level imports.
src/iceberg_bioimage/models/scan_result.py (1)

47-50: Type hint for json_kwargs is imprecise.

**json_kwargs: object allows any type, but json.dumps expects specific keyword arguments like indent: int, sort_keys: bool, etc. Consider using typing.Any or typing.Unpack with a TypedDict for stricter typing.

♻️ Suggested improvement
-    def to_json(self, **json_kwargs: object) -> str:
+    def to_json(self, **json_kwargs: Any) -> str:

This is a minor typing nit - Any is more conventional for pass-through kwargs.

Also applies to: 85-88, 108-111

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/models/scan_result.py` around lines 47 - 50, The type
hint for the pass-through kwargs on the to_json methods is too broad; change the
parameter annotation from **json_kwargs: object to **json_kwargs: Any (and add
"from typing import Any" at top) for conventional typing of JSON kwargs, or
alternatively use typing.Unpack with a TypedDict if you need stricter argument
shapes; update the signatures of the to_json methods (the one named to_json in
scan_result.py and the other occurrences referenced around lines 85–88 and
108–111) accordingly and add the import for Any.
src/iceberg_bioimage/adapters/zarr_v2.py (1)

61-78: Redundant assignment on line 67.

path = array_path or None is redundant since array_path is already typed as str | None. If it's an empty string, converting to None may be intentional, but this should be explicit.

♻️ Clarify intent
     def _build_asset(
         self,
         uri: str,
         array_path: str | None,
         array: object,
     ) -> ImageAsset:
-        path = array_path or None
+        # Treat empty string as None for consistency
+        path = array_path if array_path else None
         metadata = {"store_name": Path(uri).name}

Or if empty strings are not expected, simply use array_path directly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/adapters/zarr_v2.py` around lines 61 - 78, In
_build_asset the line "path = array_path or None" is redundant or ambiguous;
either remove the temporary and pass array_path directly to ImageAsset (use
array_path where path is referenced) if empty strings are not expected, or make
the intent explicit by replacing that assignment with an explicit empty-string
check (e.g., set path = None if array_path == "" else array_path) so callers
like self._coerce_chunks(...), self._image_id(uri, path) and the
ImageAsset(image_path=...) receive the intended value.
src/iceberg_bioimage/integrations/duckdb.py (1)

58-103: Hardcoded chunk_index columns may break if schema evolves.

Lines 81-82 hardcode chunk_key, chunk_coords_json, and byte_length for the chunk_index join. If the chunk_index schema changes, this will silently produce incorrect results or errors.

Consider defining these as a constant (similar to DEFAULT_JOIN_KEYS) for maintainability.

♻️ Suggested refactor
 DEFAULT_JOIN_KEYS = ("dataset_id", "image_id")
+CHUNK_INDEX_COLUMNS = ("chunk_key", "chunk_coords_json", "byte_length")
 
 ...
 
     if chunk_index is not None:
         _register_source(duckdb_connection, "chunk_index", chunk_index)
-        query.append(
-            "     , ci.chunk_key, ci.chunk_coords_json, ci.byte_length"
-        )
+        ci_cols = ", ".join(f"ci.{col}" for col in CHUNK_INDEX_COLUMNS)
+        query.append(f"     , {ci_cols}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/integrations/duckdb.py` around lines 58 - 103, The
join_image_assets_with_profiles function hardcodes chunk_index column names
(chunk_key, chunk_coords_json, byte_length), which is fragile if chunk_index
schema changes; define a module-level constant (e.g., CHUNK_INDEX_FIELDS =
("chunk_key", "chunk_coords_json", "byte_length")) similar to DEFAULT_JOIN_KEYS
and replace the literal column list in the query-building (the portion that
appends " , ci.chunk_key, ci.chunk_coords_json, ci.byte_length") with a join of
that constant (", ".join(CHUNK_INDEX_FIELDS)) so the field list is centralized
and easier to update and reuse when registering or selecting from chunk_index.
src/iceberg_bioimage/validation/contracts.py (1)

84-91: Missing error handling for invalid dataset paths.

ds.dataset(path) will raise an exception if the path doesn't exist or isn't a valid dataset. Consider wrapping with a more descriptive error or documenting the expected exceptions.

🛡️ Suggested improvement
 def validate_microscopy_profile_table(path: str) -> ContractValidationResult:
     """Validate a local profile table file against the microscopy join contract."""
 
-    dataset = ds.dataset(path)
+    try:
+        dataset = ds.dataset(path)
+    except Exception as exc:
+        raise ValueError(
+            f"Failed to read dataset from {path!r}: {exc}"
+        ) from exc
     return validate_microscopy_profile_columns(
         list(dataset.schema.names),
         target=str(Path(path)),
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/validation/contracts.py` around lines 84 - 91, The
function validate_microscopy_profile_table calls ds.dataset(path) without
guarding against invalid paths; wrap the ds.dataset(path) call in a try/except
that catches exceptions (e.g., OSError/ValueError/RuntimeError or a broad
Exception) and convert the error into a clear ContractValidationResult failure
(e.g., return ContractValidationResult(valid=False, errors=[f"Invalid dataset
path: {path}: {e}"])) or re-raise a more descriptive exception; update
validate_microscopy_profile_table to return that failure result when dataset
creation fails and keep the existing call to validate_microscopy_profile_columns
when successful.
src/iceberg_bioimage/publishing/chunk_index.py (1)

46-55: Validate input in the public row-conversion helper.

scan_result_to_chunk_rows is exported and callable directly; adding contract validation here gives consistent, high-level failures instead of lower-level shape/dtype exceptions.

Proposed refactor
 def scan_result_to_chunk_rows(scan_result: ScanResult) -> list[dict[str, object]]:
     """Convert a scan result into canonical chunk_index rows."""
+    raise_for_invalid_scan_result(scan_result)
 
     dataset_id = _dataset_id(scan_result.source_uri)
     rows: list[dict[str, object]] = []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/publishing/chunk_index.py` around lines 46 - 55,
scan_result_to_chunk_rows should validate its public input before processing:
check that scan_result is not None, is an instance of ScanResult (or at least
has expected attributes), and that scan_result.source_uri is a non-empty string
and scan_result.image_assets is an iterable of asset-like objects; if validation
fails raise a clear ValueError/TypeError. Update scan_result_to_chunk_rows to
perform these precondition checks (before calling _dataset_id or iterating), and
surface a descriptive error message mentioning scan_result, source_uri, or
image_assets as appropriate so callers get high-level, consistent failures
instead of downstream shape/dtype exceptions from _dataset_id or
_asset_to_chunk_rows.
src/iceberg_bioimage/integrations/catalog.py (2)

16-22: Define a scan-focused catalog protocol in this module.

Reusing SupportsCatalog from publishing couples this reader path to append/create semantics. A local protocol that guarantees load_table(...) -> SupportsIcebergTable makes typing and intent tighter.

Also applies to: 56-65

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/integrations/catalog.py` around lines 16 - 22, This
module imports and reuses SupportsCatalog (aliased as SupportsLoadTableCatalog)
from the publishing package which conflates reader behavior with append/create
semantics; define a local protocol in this module (e.g., SupportsIcebergTable
and a new ScanCatalog protocol) that exposes only load_table(self, name: str, /,
...) -> SupportsIcebergTable (or appropriate return type) and replace references
to SupportsLoadTableCatalog with this new ScanCatalog in this file (also update
the usage locations around the other import block at lines ~56-65) so typing
reflects scan/read-only intent rather than publishing semantics.

90-118: Consider pushdown options in catalog joins to avoid full-table loads.

join_catalog_image_assets_with_profiles currently scans full image_assets and optional chunk_index into memory. For large catalogs, expose CatalogScanOptions (projection/filter/limit) per source to reduce IO and memory pressure.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/integrations/catalog.py` around lines 90 - 118, The
function join_catalog_image_assets_with_profiles currently loads entire tables;
add optional CatalogScanOptions parameters (e.g., image_assets_scan_options:
CatalogScanOptions | None = None and chunk_index_scan_options:
CatalogScanOptions | None = None) to the join_catalog_image_assets_with_profiles
signature, pass image_assets_scan_options into the catalog_table_to_arrow call
for "image_assets" and pass chunk_index_scan_options into the
catalog_table_to_arrow call for the chunk_index (when chunk_index_table is
provided), and propagate these options into join_image_assets_with_profiles (or
add corresponding parameters) so callers can supply projection/filter/limit to
avoid full-table scans. Ensure types reference CatalogScanOptions and defaults
remain None so behavior is backward-compatible.
tests/test_registration.py (1)

64-80: Harden the no-chunk path with an explicit “must not call” assertion.

In test_register_store_without_chunks, consider monkeypatching iceberg_bioimage.api.publish_chunk_index to raise if invoked. That makes the intent explicit and catches accidental regressions immediately.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_registration.py` around lines 64 - 80, In
test_register_store_without_chunks, ensure publish_chunk_index is explicitly
forbidden by monkeypatching iceberg_bioimage.api.publish_chunk_index to raise an
exception if called (so any accidental invocation fails the test); keep the
existing monkeypatch for publish_image_assets and the call to register_store,
then assert result.chunk_rows_published == 0 — this makes the no-chunk path
intent explicit and detects regressions immediately.
tests/test_chunk_index.py (1)

9-9: Avoid cross-test imports for shared fakes.

Importing FakeCatalog from tests/test_publishing.py couples two test modules. Consider moving shared fakes to a dedicated helper (for example, tests/fakes.py or tests/conftest.py) to keep tests independently maintainable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_chunk_index.py` at line 9, The test imports FakeCatalog from
another test module which couples tests; extract the FakeCatalog class into a
shared test helper (e.g., create tests/fakes.py or add it to tests/conftest.py)
and update tests/test_chunk_index.py to import FakeCatalog from that new helper
instead of tests/test_publishing; ensure the symbol name FakeCatalog remains
unchanged so existing tests (including any references in
tests/test_publishing.py) can import from the shared location and update any
import paths accordingly.
.github/workflows/draft-release.yml (1)

21-23: Pin action to a commit SHA for supply-chain safety.

Using a floating major tag (@v6) is convenient but less secure. Pinning to a specific commit SHA ensures the workflow always uses an immutable, auditable version. Current v6 resolves to the latest patch release.

To find the commit SHA for v6, check the release on GitHub: https://github.com/release-drafter/release-drafter/releases (look for the tag under "v6" or the latest v6.x release).

Suggested change
-      - uses: release-drafter/release-drafter@v6
+      - uses: release-drafter/release-drafter@<commit-sha>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/draft-release.yml around lines 21 - 23, Replace the
floating tag "release-drafter/release-drafter@v6" with a specific commit SHA to
pin the action; locate the v6 release on the release-drafter GitHub releases
page, copy the commit SHA for the v6.x tag you want, and update the workflow
line that currently reads uses: release-drafter/release-drafter@v6 to uses:
release-drafter/release-drafter@<commit-sha>; this ensures the workflow uses an
immutable, auditable version while keeping the environment variable GITHUB_TOKEN
unchanged.
README.md (1)

59-61: Optional readability polish for repeated sentence openings.

Lines 59–61 all start with “See”. Consider converting to a short bullet list for smoother scanning.

✍️ Suggested wording tweak
-See `examples/quickstart.py` for a minimal script.
-See `examples/catalog_duckdb.py` for a catalog-backed query example.
-See `examples/synthetic_workflow.py` for a self-contained local workflow.
+Example scripts:
+- `examples/quickstart.py` — minimal usage
+- `examples/catalog_duckdb.py` — catalog-backed query flow
+- `examples/synthetic_workflow.py` — self-contained local workflow
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 59 - 61, The three consecutive lines that each begin
with "See" (referencing examples/quickstart.py, examples/catalog_duckdb.py, and
examples/synthetic_workflow.py) should be converted into a short bullet list to
improve readability; replace the three separate "See ..." sentences with a
concise unordered list where each bullet names the example file and a one-line
description (e.g., "examples/quickstart.py — minimal script") to avoid repeated
sentence openings and make scanning easier.
tests/test_cli.py (1)

13-14: Use pytest's public MonkeyPatch import.

_pytest.monkeypatch is private pytest internals, so this annotation can break on a pytest refactor even if the public API stays stable. Please verify against the pytest version pinned in the repo, but the supported import here should be the public one.

♻️ Suggested change
-from _pytest.monkeypatch import MonkeyPatch
-from pytest import CaptureFixture
+from pytest import CaptureFixture, MonkeyPatch
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_cli.py` around lines 13 - 14, Replace the private import of
MonkeyPatch with pytest's public export: change the import that currently reads
from _pytest.monkeypatch import MonkeyPatch to the public form (e.g., from
pytest import MonkeyPatch or use pytest.MonkeyPatch) so tests reference the
public symbol MonkeyPatch; confirm the project’s pinned pytest version exports
MonkeyPatch and update any test annotations that reference the private import.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/dependabot.yml:
- Line 7: Add a YAML document start marker to the Dependabot config by inserting
a leading `---` above the existing content so the linter no longer reports a
document-start warning; update the file containing the `version: 2` entry
(referencing the `version` key in this Dependabot config) to begin with `---` on
its own line.

In @.github/ISSUE_TEMPLATE/issue.yml:
- Around line 17-19: Update the checkbox label string under the YAML key "label"
that currently reads "I found no existing covering this topic." to the clearer
wording "I found no existing issue covering this topic." so the label text (the
value for label) is corrected; locate the label entry in the ISSUE template
where the multiline label value is defined and replace the phrase accordingly.
- Around line 1-2: Add a YAML document start marker to the top of the GitHub
issue template to satisfy yamllint's document-start rule by inserting the
three-dash YAML document separator (`---`) as the first line of the
.github/ISSUE_TEMPLATE/issue.yml file so the file begins with the YAML document
start before the existing comment and content.

In @.github/PULL_REQUEST_TEMPLATE.md:
- Line 14: Fix the grammar typo in the pull request template by updating the
sentence in .github/PULL_REQUEST_TEMPLATE.md that currently reads "you may used"
to "you may use" so the guidance is grammatically correct; locate the exact line
containing "Are there any issues which are related to this pull request (you may
used a `#<digit>` to reference GitHub issues as links within this description)?"
and replace "used" with "use".

In @.github/workflows/draft-release.yml:
- Around line 6-10: The top-level YAML key on in draft-release.yml should be
quoted to silence the yamllint truthy value warning; update the file to change
the mapping key from on: to 'on': (preserving the same nested push->branches
configuration) so the workflow behavior remains identical but the linter warning
is resolved.

In @.github/workflows/publish-docs.yml:
- Line 34: The CI failure is due to an over-long single-line run step containing
the command "uv run --group docs --frozen sphinx-build -b html docs/src
docs/_build/html"; fix it by turning that run into a YAML multiline block (use
run: |) and either split the command across lines or add a backslash
continuation so each line is under the max length, locating the step that
contains the "uv run --group docs --frozen sphinx-build -b html docs/src
docs/_build/html" string and replacing the single long line with the multiline
block.

In @.github/workflows/run-tests.yml:
- Line 4: The workflow currently has a bare "on:" (which trips the truthy YAML
lint rule) and very long "run:" command lines that exceed the line-length rule;
change the trigger to an explicit event list (e.g., replace "on:" with "on:
[push, pull_request]" or specific events you need) and split or fold long shell
commands in the "run:" steps (use multiple run steps or YAML block scalars like
| or > to break long commands) so each command line is under the configured
length; update the same pattern for the other long run steps referenced (the
second occurrence around the later run block) and ensure indentation and quoting
remain valid for the workflow parser.
- Around line 14-20: Replace the mutable tag refs in the workflow's "uses:"
entries with their corresponding full 40-character commit SHAs for each
occurrence of actions/checkout@v4, actions/setup-python@v5 and
astral-sh/setup-uv@v6 (they appear in three job sections); update each "uses:"
to the immutable SHA for the exact release you want (found on each action's
GitHub releases/tags page) and keep the original tag as a comment if desired for
context (e.g., actions/checkout@<full-sha>  # v4.x.y).

In `@CODE_OF_CONDUCT.md`:
- Around line 61-64: Replace the vague GitHub-profile-email instruction in the
“Please reach out to the maintainers…” sentence with a concrete, monitored
reporting channel: add a dedicated, monitored contact (e.g.,
security@yourproject.org or a secure reporting form URL) and update the sentence
to direct reporters to that channel; also include who monitors it (e.g.,
“maintainers and the Code of Conduct committee”), expected acknowledgement
timeframe (e.g., “we will acknowledge within 48 hours”), and a note that reports
will be handled privately to ensure the contact is actionable and monitored.

In `@pyproject.toml`:
- Line 7: The pyproject currently defines a static version = "0.0.1" while also
including a [tool.setuptools_scm] block that writes a generated _version.py
which is never produced or imported; pick one strategy and remove the
conflicting bits accordingly: either remove the static version line and enable
dynamic SCM versioning (keep and ensure [tool.setuptools_scm] is correctly
configured and that the generated _version.py or equivalent is imported by your
package), or delete the entire [tool.setuptools_scm] section so the project uses
the explicit version field; update references to _version.py (if any) to match
the chosen approach.

In `@src/iceberg_bioimage/adapters/ome_tiff.py`:
- Around line 53-62: can_handle performs a case-insensitive check but _image_id
strips suffixes case-sensitively, so names like "image.OME.TIFF" won't have the
suffix removed; update _image_id to perform case-insensitive suffix removal (for
example, operate on a lowercased filename or use a case-insensitive match)
before applying removesuffix logic and still preserve the original stem for
return; locate the _image_id method and change the stem extraction to use a
casefold()/lower() variant for suffix checks (or regex with re.IGNORECASE) so
.ome.tiff/.OME.TIFF/.OmE.Tif etc. are all handled the same as can_handle.

In `@src/iceberg_bioimage/adapters/zarr_v2.py`:
- Around line 86-88: can_handle does case-insensitive matching but _image_id
strips the ".zarr" suffix case-sensitively; update _image_id (function
_image_id, variable stem) to normalize the store name before removing the suffix
(e.g., use a lower()/casefold()-based check or a case-insensitive endswith) so
"data.ZARR" and "data.zarr" both yield the same stem, then append array_path as
before.

In `@src/iceberg_bioimage/api.py`:
- Around line 61-63: The RegistrationResult currently sets source_uri=uri which
may differ from the canonical URI produced by the adapter; change it to use
scan_result.source_uri instead. Locate the return that constructs
RegistrationResult (the variables uri, scan_result, and image_assets_rows are in
scope) and replace the source_uri argument so it passes scan_result.source_uri,
ensuring the returned metadata reflects the actual scanned/published URI.

In `@src/iceberg_bioimage/cli.py`:
- Around line 76-81: Update main to catch handler exceptions and translate them
into CLI-friendly stderr messages and non-zero exit codes: wrap the call to
args.handler(args) in a try/except that catches ValueError (and any additional
catalog-specific exceptions you want to expose) raised by scan_store/api code,
print a concise error message to stderr (include the exception text) and return
a non-zero int; re-raise unexpected exceptions or return a different non-zero
code if you want to preserve tracebacks for debugging. Ensure you reference main
and args.handler in the change so the handler invocation is wrapped by the
error-handling logic.

In `@src/iceberg_bioimage/integrations/duckdb.py`:
- Around line 33-56: The where string passed into query_metadata_table is used
directly in relation.filter(), creating an SQL injection risk; either
require/trust only internal callers or implement input sanitization/safer API:
replace the raw where param with a validated filter or structured filter API
(e.g., a dict of column->value or a list of (column, op, value) tuples) and
build the expression with proper escaping/quoting, or validate the where string
against a strict whitelist of allowed tokens/characters (columns from
_relation_for_source.schema names, safe operators, numeric and quoted literals,
parentheses) before calling relation.filter(); update query_metadata_table to
use the new validated/structured filter input and refuse/unwrap unsafe where
values.

In `@src/iceberg_bioimage/publishing/chunk_index.py`:
- Around line 33-40: The code calls _load_or_create_table before checking for an
empty rows list which can create or mutate the catalog unnecessarily; move the
empty-row guard so that the function (the caller that currently invokes
_load_or_create_table — e.g., publish_chunk_index or the surrounding publish
function) returns 0 if not rows before any call to _load_or_create_table,
ensuring no table creation or catalog mutation occurs when there are zero chunk
rows.

In `@src/notebooks/example_notebook.ipynb`:
- Line 21: Update the notebook markdown cell that contains the string
"browser-based Juypyter interface\n" by correcting the typo "Juypyter" to
"Jupyter" so the cell reads "browser-based Jupyter interface\n"; locate the
offending string in the notebook (e.g., the markdown cell content) and replace
the misspelled word.
- Around line 42-85: The notebook contains stale outputs (the HTML/text output
showing "Hello, notebook!" dataframe) that no longer match the subsequent cell
which constructs a ScanResult; open the notebook, clear the existing outputs for
the affected cells (the output block with the "Hello, notebook!" message) and
re-run the cells so the displayed outputs reflect the current code path
(including the ScanResult construction), then save the notebook before
committing so the saved outputs are consistent with the ScanResult-producing
cells.

In `@src/notebooks/example_notebook.py`:
- Line 28: Replace the misspelled user-facing string "Juypyter" with the correct
spelling "Jupyter" wherever it appears in the notebook text (look for the exact
token "Juypyter" in src/notebooks/example_notebook.py) so the displayed comment
reads "Jupyter" instead of "Juypyter".

---

Nitpick comments:
In @.github/workflows/draft-release.yml:
- Around line 21-23: Replace the floating tag
"release-drafter/release-drafter@v6" with a specific commit SHA to pin the
action; locate the v6 release on the release-drafter GitHub releases page, copy
the commit SHA for the v6.x tag you want, and update the workflow line that
currently reads uses: release-drafter/release-drafter@v6 to uses:
release-drafter/release-drafter@<commit-sha>; this ensures the workflow uses an
immutable, auditable version while keeping the environment variable GITHUB_TOKEN
unchanged.

In `@README.md`:
- Around line 59-61: The three consecutive lines that each begin with "See"
(referencing examples/quickstart.py, examples/catalog_duckdb.py, and
examples/synthetic_workflow.py) should be converted into a short bullet list to
improve readability; replace the three separate "See ..." sentences with a
concise unordered list where each bullet names the example file and a one-line
description (e.g., "examples/quickstart.py — minimal script") to avoid repeated
sentence openings and make scanning easier.

In `@src/iceberg_bioimage/adapters/zarr_v2.py`:
- Around line 61-78: In _build_asset the line "path = array_path or None" is
redundant or ambiguous; either remove the temporary and pass array_path directly
to ImageAsset (use array_path where path is referenced) if empty strings are not
expected, or make the intent explicit by replacing that assignment with an
explicit empty-string check (e.g., set path = None if array_path == "" else
array_path) so callers like self._coerce_chunks(...), self._image_id(uri, path)
and the ImageAsset(image_path=...) receive the intended value.

In `@src/iceberg_bioimage/integrations/catalog.py`:
- Around line 16-22: This module imports and reuses SupportsCatalog (aliased as
SupportsLoadTableCatalog) from the publishing package which conflates reader
behavior with append/create semantics; define a local protocol in this module
(e.g., SupportsIcebergTable and a new ScanCatalog protocol) that exposes only
load_table(self, name: str, /, ...) -> SupportsIcebergTable (or appropriate
return type) and replace references to SupportsLoadTableCatalog with this new
ScanCatalog in this file (also update the usage locations around the other
import block at lines ~56-65) so typing reflects scan/read-only intent rather
than publishing semantics.
- Around line 90-118: The function join_catalog_image_assets_with_profiles
currently loads entire tables; add optional CatalogScanOptions parameters (e.g.,
image_assets_scan_options: CatalogScanOptions | None = None and
chunk_index_scan_options: CatalogScanOptions | None = None) to the
join_catalog_image_assets_with_profiles signature, pass
image_assets_scan_options into the catalog_table_to_arrow call for
"image_assets" and pass chunk_index_scan_options into the catalog_table_to_arrow
call for the chunk_index (when chunk_index_table is provided), and propagate
these options into join_image_assets_with_profiles (or add corresponding
parameters) so callers can supply projection/filter/limit to avoid full-table
scans. Ensure types reference CatalogScanOptions and defaults remain None so
behavior is backward-compatible.

In `@src/iceberg_bioimage/integrations/duckdb.py`:
- Around line 58-103: The join_image_assets_with_profiles function hardcodes
chunk_index column names (chunk_key, chunk_coords_json, byte_length), which is
fragile if chunk_index schema changes; define a module-level constant (e.g.,
CHUNK_INDEX_FIELDS = ("chunk_key", "chunk_coords_json", "byte_length")) similar
to DEFAULT_JOIN_KEYS and replace the literal column list in the query-building
(the portion that appends " , ci.chunk_key, ci.chunk_coords_json,
ci.byte_length") with a join of that constant (", ".join(CHUNK_INDEX_FIELDS)) so
the field list is centralized and easier to update and reuse when registering or
selecting from chunk_index.

In `@src/iceberg_bioimage/models/__init__.py`:
- Line 1: Add re-exports for the package's main dataclasses so users can import
them from iceberg_bioimage.models; specifically import ScanResult from its
module (e.g., from .scan_result import ScanResult) in __init__.py and add it to
the module's public API via __all__ (e.g., __all__ = ["ScanResult", ...]) or by
setting globals, repeating for any other primary model dataclasses you want to
expose to provide convenient top-level imports.

In `@src/iceberg_bioimage/models/scan_result.py`:
- Around line 47-50: The type hint for the pass-through kwargs on the to_json
methods is too broad; change the parameter annotation from **json_kwargs: object
to **json_kwargs: Any (and add "from typing import Any" at top) for conventional
typing of JSON kwargs, or alternatively use typing.Unpack with a TypedDict if
you need stricter argument shapes; update the signatures of the to_json methods
(the one named to_json in scan_result.py and the other occurrences referenced
around lines 85–88 and 108–111) accordingly and add the import for Any.

In `@src/iceberg_bioimage/publishing/chunk_index.py`:
- Around line 46-55: scan_result_to_chunk_rows should validate its public input
before processing: check that scan_result is not None, is an instance of
ScanResult (or at least has expected attributes), and that
scan_result.source_uri is a non-empty string and scan_result.image_assets is an
iterable of asset-like objects; if validation fails raise a clear
ValueError/TypeError. Update scan_result_to_chunk_rows to perform these
precondition checks (before calling _dataset_id or iterating), and surface a
descriptive error message mentioning scan_result, source_uri, or image_assets as
appropriate so callers get high-level, consistent failures instead of downstream
shape/dtype exceptions from _dataset_id or _asset_to_chunk_rows.

In `@src/iceberg_bioimage/publishing/image_assets.py`:
- Around line 125-189: In _build_image_assets_schema(), add a concise comment
above the Schema construction explaining that the non-sequential NestedField
field_id values (1,2,9,11,13,14,15,16,19) are intentional: they are stable
Iceberg field identifiers with gaps reserved for removed or future fields to
preserve compatibility; reference the function name _build_image_assets_schema
and the NestedField entries so reviewers can see this is deliberate and for
maintainability.
- Around line 98-107: The code uses a fragile string comparison on
exc.__class__.__name__ to detect a missing table; instead import and catch the
concrete NoSuchTableError from the PyIceberg/catalog backend and handle only
that exception. Replace the broad "except Exception as exc" block with an
explicit "except NoSuchTableError:" (and keep letting any other exceptions
propagate), leaving the rest of the logic that picks build_schema
(schema_builder vs _build_image_assets_schema) and calls
catalog.create_table(identifier, schema=build_schema()) unchanged; reference
catalog.load_table, catalog.create_table, schema_builder and
_build_image_assets_schema when making the change.

In `@src/iceberg_bioimage/validation/contracts.py`:
- Around line 84-91: The function validate_microscopy_profile_table calls
ds.dataset(path) without guarding against invalid paths; wrap the
ds.dataset(path) call in a try/except that catches exceptions (e.g.,
OSError/ValueError/RuntimeError or a broad Exception) and convert the error into
a clear ContractValidationResult failure (e.g., return
ContractValidationResult(valid=False, errors=[f"Invalid dataset path: {path}:
{e}"])) or re-raise a more descriptive exception; update
validate_microscopy_profile_table to return that failure result when dataset
creation fails and keep the existing call to validate_microscopy_profile_columns
when successful.

In `@tests/conftest.py`:
- Around line 8-10: The fixture my_data returning "Hello, differently!" is too
generic—rename it to reflect its test intent (for example success_message,
greeting_fixture, or error_payload depending on usage) and replace the
placeholder string with a value that matches that intent; update all tests that
use my_data to reference the new fixture name (and, if needed, adjust assertions
to the new fixture value) so the fixture name and returned data clearly
communicate purpose (refer to the my_data fixture in tests/conftest.py and any
tests that import/use it).

In `@tests/test_chunk_index.py`:
- Line 9: The test imports FakeCatalog from another test module which couples
tests; extract the FakeCatalog class into a shared test helper (e.g., create
tests/fakes.py or add it to tests/conftest.py) and update
tests/test_chunk_index.py to import FakeCatalog from that new helper instead of
tests/test_publishing; ensure the symbol name FakeCatalog remains unchanged so
existing tests (including any references in tests/test_publishing.py) can import
from the shared location and update any import paths accordingly.

In `@tests/test_cli.py`:
- Around line 13-14: Replace the private import of MonkeyPatch with pytest's
public export: change the import that currently reads from _pytest.monkeypatch
import MonkeyPatch to the public form (e.g., from pytest import MonkeyPatch or
use pytest.MonkeyPatch) so tests reference the public symbol MonkeyPatch;
confirm the project’s pinned pytest version exports MonkeyPatch and update any
test annotations that reference the private import.

In `@tests/test_registration.py`:
- Around line 64-80: In test_register_store_without_chunks, ensure
publish_chunk_index is explicitly forbidden by monkeypatching
iceberg_bioimage.api.publish_chunk_index to raise an exception if called (so any
accidental invocation fails the test); keep the existing monkeypatch for
publish_image_assets and the call to register_store, then assert
result.chunk_rows_published == 0 — this makes the no-chunk path intent explicit
and detects regressions immediately.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1b32f078-70c2-4e11-ade2-b06d877a3ded

📥 Commits

Reviewing files that changed from the base of the PR and between 79bda02 and af57a2e.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (52)
  • .github/ISSUE_TEMPLATE/issue.yml
  • .github/PULL_REQUEST_TEMPLATE.md
  • .github/dependabot.yml
  • .github/release-drafter.yml
  • .github/workflows/draft-release.yml
  • .github/workflows/publish-docs.yml
  • .github/workflows/publish-pypi.yml
  • .github/workflows/run-tests.yml
  • .pre-commit-config.yaml
  • .python-version
  • CITATION.cff
  • CODE_OF_CONDUCT.md
  • CONTRIBUTING.md
  • README.md
  • docs/src/_static/custom.css
  • docs/src/conf.py
  • docs/src/duckdb.md
  • docs/src/index.md
  • docs/src/python-api.md
  • docs/src/workflow.md
  • examples/catalog_duckdb.py
  • examples/quickstart.py
  • examples/synthetic_workflow.py
  • pyproject.toml
  • src/iceberg_bioimage/__init__.py
  • src/iceberg_bioimage/adapters/__init__.py
  • src/iceberg_bioimage/adapters/base.py
  • src/iceberg_bioimage/adapters/ome_tiff.py
  • src/iceberg_bioimage/adapters/zarr_v2.py
  • src/iceberg_bioimage/api.py
  • src/iceberg_bioimage/cli.py
  • src/iceberg_bioimage/integrations/__init__.py
  • src/iceberg_bioimage/integrations/catalog.py
  • src/iceberg_bioimage/integrations/duckdb.py
  • src/iceberg_bioimage/models/__init__.py
  • src/iceberg_bioimage/models/scan_result.py
  • src/iceberg_bioimage/publishing/__init__.py
  • src/iceberg_bioimage/publishing/chunk_index.py
  • src/iceberg_bioimage/publishing/image_assets.py
  • src/iceberg_bioimage/validation/__init__.py
  • src/iceberg_bioimage/validation/contracts.py
  • src/notebooks/example_notebook.ipynb
  • src/notebooks/example_notebook.py
  • tests/conftest.py
  • tests/test_api.py
  • tests/test_catalog_integration.py
  • tests/test_chunk_index.py
  • tests/test_cli.py
  • tests/test_duckdb.py
  • tests/test_publishing.py
  • tests/test_registration.py
  • tests/test_validation.py

Comment thread .github/dependabot.yml
Comment thread .github/ISSUE_TEMPLATE/issue.yml
Comment thread .github/ISSUE_TEMPLATE/issue.yml
Comment thread .github/PULL_REQUEST_TEMPLATE.md Outdated
Comment thread .github/workflows/draft-release.yml Outdated
Comment thread src/iceberg_bioimage/integrations/duckdb.py
Comment thread src/iceberg_bioimage/publishing/chunk_index.py Outdated
Comment thread src/notebooks/example_notebook.ipynb Outdated
Comment thread src/notebooks/example_notebook.ipynb
Comment thread src/notebooks/example_notebook.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/iceberg_bioimage/integrations/catalog.py (1)

92-124: Consider making the image_assets table name configurable.

The function hardcodes "image_assets" at line 107. For consistency with other APIs that accept table names as parameters (e.g., register_store), consider adding an image_assets_table parameter.

♻️ Suggested change
 def join_catalog_image_assets_with_profiles(
     catalog: str | SupportsScanCatalog,
     namespace: str | Sequence[str],
     profiles: MetadataSource,
     *,
+    image_assets_table: str = "image_assets",
     chunk_index_table: str | None = None,
     join_keys: Sequence[str] = DEFAULT_JOIN_KEYS,
     image_assets_scan_options: CatalogScanOptions | None = None,
     chunk_index_scan_options: CatalogScanOptions | None = None,
 ) -> pa.Table:
     """Join catalog-backed image metadata to a profile table."""

     image_assets = catalog_table_to_arrow(
         catalog,
         namespace,
-        "image_assets",
+        image_assets_table,
         scan_options=image_assets_scan_options,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/integrations/catalog.py` around lines 92 - 124, The
function join_catalog_image_assets_with_profiles currently hardcodes the table
name "image_assets"; add a new parameter image_assets_table: str | None =
"image_assets" to the function signature and use that parameter when calling
catalog_table_to_arrow instead of the literal "image_assets". Update the call:
image_assets = catalog_table_to_arrow(catalog, namespace, image_assets_table,
scan_options=image_assets_scan_options) and preserve existing defaults and
behavior so callers remain backward compatible; ensure the new parameter name
appears in the docstring and function signature alongside existing parameters
like chunk_index_table and join_keys.
tests/test_cli.py (1)

162-172: Consider a more readable approach for throwing in a lambda.

The generator-throw idiom (_ for _ in ()).throw(ValueError(...)) works but is obscure. A helper function would be clearer.

♻️ Suggested alternative
+def _raise_value_error(args: object) -> int:
+    raise ValueError("bad dataset")
+
 def test_main_returns_cli_error_for_value_error(
     monkeypatch: MonkeyPatch,
     capsys: CaptureFixture[str],
 ) -> None:
     monkeypatch.setattr(
         cli_module,
         "_handle_scan",
-        lambda args: (_ for _ in ()).throw(ValueError("bad dataset")),
+        _raise_value_error,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_cli.py` around lines 162 - 172, Replace the obscure
generator-throw lambda used when patching cli_module._handle_scan with a simple
helper that explicitly raises the error: define a small function (e.g.,
raise_value_error) that takes the same args and raises ValueError("bad
dataset"), then monkeypatch.setattr cli_module, "_handle_scan", to that function
so the test remains equivalent but far more readable; keep assertions against
cli_module.main and CLI_VALUE_ERROR_EXIT_CODE unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/iceberg_bioimage/integrations/catalog.py`:
- Around line 92-124: The function join_catalog_image_assets_with_profiles
currently hardcodes the table name "image_assets"; add a new parameter
image_assets_table: str | None = "image_assets" to the function signature and
use that parameter when calling catalog_table_to_arrow instead of the literal
"image_assets". Update the call: image_assets = catalog_table_to_arrow(catalog,
namespace, image_assets_table, scan_options=image_assets_scan_options) and
preserve existing defaults and behavior so callers remain backward compatible;
ensure the new parameter name appears in the docstring and function signature
alongside existing parameters like chunk_index_table and join_keys.

In `@tests/test_cli.py`:
- Around line 162-172: Replace the obscure generator-throw lambda used when
patching cli_module._handle_scan with a simple helper that explicitly raises the
error: define a small function (e.g., raise_value_error) that takes the same
args and raises ValueError("bad dataset"), then monkeypatch.setattr cli_module,
"_handle_scan", to that function so the test remains equivalent but far more
readable; keep assertions against cli_module.main and CLI_VALUE_ERROR_EXIT_CODE
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2a4a2109-d674-4dcf-918b-f75a65b361e1

📥 Commits

Reviewing files that changed from the base of the PR and between af57a2e and 8b79584.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (36)
  • .github/ISSUE_TEMPLATE/issue.yml
  • .github/PULL_REQUEST_TEMPLATE.md
  • .github/dependabot.yml
  • .github/workflows/draft-release.yml
  • .github/workflows/publish-docs.yml
  • .github/workflows/publish-pypi.yml
  • .github/workflows/run-tests.yml
  • .gitignore
  • .pre-commit-config.yaml
  • CODE_OF_CONDUCT.md
  • CONTRIBUTING.md
  • README.md
  • pyproject.toml
  • src/iceberg_bioimage/adapters/ome_tiff.py
  • src/iceberg_bioimage/adapters/zarr_v2.py
  • src/iceberg_bioimage/api.py
  • src/iceberg_bioimage/cli.py
  • src/iceberg_bioimage/integrations/catalog.py
  • src/iceberg_bioimage/integrations/duckdb.py
  • src/iceberg_bioimage/models/__init__.py
  • src/iceberg_bioimage/models/scan_result.py
  • src/iceberg_bioimage/publishing/chunk_index.py
  • src/iceberg_bioimage/publishing/image_assets.py
  • src/iceberg_bioimage/validation/contracts.py
  • src/notebooks/example_notebook.ipynb
  • src/notebooks/example_notebook.py
  • tests/conftest.py
  • tests/fakes.py
  • tests/test_adapters.py
  • tests/test_catalog_integration.py
  • tests/test_chunk_index.py
  • tests/test_cli.py
  • tests/test_duckdb.py
  • tests/test_publishing.py
  • tests/test_registration.py
  • tests/test_validation.py
✅ Files skipped from review due to trivial changes (14)
  • .gitignore
  • .github/ISSUE_TEMPLATE/issue.yml
  • .github/dependabot.yml
  • .github/workflows/publish-pypi.yml
  • README.md
  • src/notebooks/example_notebook.ipynb
  • .pre-commit-config.yaml
  • src/notebooks/example_notebook.py
  • .github/workflows/publish-docs.yml
  • .github/workflows/run-tests.yml
  • tests/fakes.py
  • tests/test_duckdb.py
  • .github/PULL_REQUEST_TEMPLATE.md
  • src/iceberg_bioimage/adapters/ome_tiff.py
🚧 Files skipped from review as they are similar to previous changes (11)
  • tests/conftest.py
  • .github/workflows/draft-release.yml
  • src/iceberg_bioimage/models/init.py
  • tests/test_catalog_integration.py
  • pyproject.toml
  • src/iceberg_bioimage/cli.py
  • src/iceberg_bioimage/integrations/duckdb.py
  • src/iceberg_bioimage/publishing/image_assets.py
  • tests/test_registration.py
  • tests/test_chunk_index.py
  • src/iceberg_bioimage/publishing/chunk_index.py

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
src/iceberg_bioimage/integrations/catalog.py (1)

99-100: Validate join_keys before forwarding.

A quick non-empty check here avoids generating invalid downstream SQL when join_keys is empty.

Proposed fix
 def join_catalog_image_assets_with_profiles(
@@
 ) -> pa.Table:
@@
+    if not join_keys:
+        raise ValueError("join_keys must contain at least one column")
+
     image_assets = catalog_table_to_arrow(

Also applies to: 131-135

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/iceberg_bioimage/integrations/catalog.py` around lines 99 - 100, Add a
guard that validates the join_keys parameter is non-empty before it is forwarded
or used: in the function signatures that accept join_keys (the one with
parameters "join_keys: Sequence[str] = DEFAULT_JOIN_KEYS,
image_assets_scan_options: CatalogScanOptions | None = None" and the other
occurrence around lines 131-135), check if not join_keys (or len(join_keys) ==
0) and raise a ValueError with a clear message like "join_keys must be a
non-empty sequence of column names" so downstream SQL generation never receives
an empty join key set.
tests/test_cli.py (2)

71-90: Use a structured fake return object to reduce interface drift risk.

Returning an ad-hoc inner class here is a bit loose; using a structured object with explicit fields (including source_uri) makes the test contract clearer and safer during CLI payload evolution.

♻️ Proposed refinement
+from types import SimpleNamespace
@@
-        class _Result:
-            image_assets_rows_published = 1
-            chunk_rows_published = 4
-
-        return _Result()
+        return SimpleNamespace(
+            source_uri=uri,
+            image_assets_rows_published=1,
+            chunk_rows_published=4,
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_cli.py` around lines 71 - 90, Replace the ad-hoc inner class
returned by _fake_register_store with a structured, explicit return object
(e.g., a dataclass, typing.NamedTuple, or types.SimpleNamespace) that declares
the expected fields (at minimum image_assets_rows_published,
chunk_rows_published, and source_uri) so the test surface is explicit and
resilient to CLI payload changes; update _fake_register_store to construct and
return that structured object instead of the inner class to reduce
interface-drift risk.

124-134: Strengthen the scan_result assertion in the publish-chunks fake.

assert scan_result is not None is weak and can miss URI wiring regressions. Assert a concrete field tied to this test input.

♻️ Suggested assertion upgrade
-        assert scan_result is not None
+        assert getattr(scan_result, "source_uri", None) == str(store_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_cli.py` around lines 124 - 134, The fake function
_fake_publish_chunk_index currently uses a weak assertion (assert scan_result is
not None); update it to assert a concrete field on scan_result that ties to the
test fixture (for example assert scan_result["uri"] == <expected_uri> or assert
getattr(scan_result, "uri", None) == <expected_uri>) so the test fails if the
scan_result URI wiring breaks—replace the None check with a lookup of the
specific field (e.g., "uri" or the actual fixture field name used by the test)
and compare it to the expected value from the test input.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/iceberg_bioimage/integrations/catalog.py`:
- Around line 54-55: The columns parameter currently accepts Sequence[str] but a
plain string will be treated as an iterable of characters, so guard the columns
parameter where it is defined/used: detect if isinstance(columns, str) and
convert it to a single-item sequence (e.g., [columns]) or raise a clear
TypeError; apply this check both at the parameter site (the columns argument
declaration) and immediately before any projection-building logic that uses
columns so downstream code receives a proper Sequence[str].
- Around line 97-121: The image_assets_table parameter is incorrectly typed as
nullable (str | None) but is always used as a required table name when calling
catalog_table_to_arrow; change the function signature to make image_assets_table
a non-nullable str (e.g., image_assets_table: str = "image_assets") so callers
and type checkers know it cannot be None, and ensure any internal uses (the call
to catalog_table_to_arrow) and callers are updated to pass a concrete string
rather than None.

In `@tests/test_cli.py`:
- Around line 26-31: The subprocess.run calls in tests/test_cli.py (the
invocations of subprocess.run for "[sys.executable, '-m',
'iceberg_bioimage.cli', 'scan', str(store_path)]" and the other CLI invocation
around lines 50-61) are missing a timeout and can hang CI; add a reasonable
timeout argument (e.g., timeout=30) to each subprocess.run call to ensure tests
fail fast on hangs, keeping capture_output=True, text=True, and check=True
as-is.

---

Nitpick comments:
In `@src/iceberg_bioimage/integrations/catalog.py`:
- Around line 99-100: Add a guard that validates the join_keys parameter is
non-empty before it is forwarded or used: in the function signatures that accept
join_keys (the one with parameters "join_keys: Sequence[str] =
DEFAULT_JOIN_KEYS, image_assets_scan_options: CatalogScanOptions | None = None"
and the other occurrence around lines 131-135), check if not join_keys (or
len(join_keys) == 0) and raise a ValueError with a clear message like "join_keys
must be a non-empty sequence of column names" so downstream SQL generation never
receives an empty join key set.

In `@tests/test_cli.py`:
- Around line 71-90: Replace the ad-hoc inner class returned by
_fake_register_store with a structured, explicit return object (e.g., a
dataclass, typing.NamedTuple, or types.SimpleNamespace) that declares the
expected fields (at minimum image_assets_rows_published, chunk_rows_published,
and source_uri) so the test surface is explicit and resilient to CLI payload
changes; update _fake_register_store to construct and return that structured
object instead of the inner class to reduce interface-drift risk.
- Around line 124-134: The fake function _fake_publish_chunk_index currently
uses a weak assertion (assert scan_result is not None); update it to assert a
concrete field on scan_result that ties to the test fixture (for example assert
scan_result["uri"] == <expected_uri> or assert getattr(scan_result, "uri", None)
== <expected_uri>) so the test fails if the scan_result URI wiring
breaks—replace the None check with a lookup of the specific field (e.g., "uri"
or the actual fixture field name used by the test) and compare it to the
expected value from the test input.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f153d068-f20e-4bf3-ac83-7d2328503919

📥 Commits

Reviewing files that changed from the base of the PR and between 8b79584 and 3659c25.

📒 Files selected for processing (3)
  • src/iceberg_bioimage/integrations/catalog.py
  • tests/test_catalog_integration.py
  • tests/test_cli.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_catalog_integration.py

Comment thread src/iceberg_bioimage/integrations/catalog.py
Comment thread src/iceberg_bioimage/integrations/catalog.py Outdated
Comment thread tests/test_cli.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/iceberg_bioimage/integrations/catalog.py`:
- Around line 100-117: The code currently allows a bare string in join_keys
which iterates into characters later; call the existing helper
_normalize_columns on join_keys to convert strings into a sequence of column
names and then validate non-empty. Specifically, replace the current join_keys
check with: normalized = _normalize_columns(join_keys, "join_keys"); if not
normalized: raise ValueError(...); join_keys = normalized — this ensures the
join clause (used in duckdb.py:87) receives a proper list of column names.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 320bc2c0-7b8b-41a2-86a8-a1ff60baece6

📥 Commits

Reviewing files that changed from the base of the PR and between 12c7828 and c010a1a.

📒 Files selected for processing (3)
  • src/iceberg_bioimage/integrations/catalog.py
  • tests/test_catalog_integration.py
  • tests/test_cli.py

Comment thread src/iceberg_bioimage/integrations/catalog.py
@d33bs d33bs merged commit 579cfb0 into main Mar 28, 2026
17 checks passed
@d33bs d33bs deleted the init branch March 28, 2026 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants