Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new Python package Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant CLI as CLI
participant Adapter as Adapter<br/>(Zarr/OME‑TIFF)
participant Validator as Validator
participant Publisher as Publisher<br/>(ImageAssets/ChunkIndex)
participant Catalog as Iceberg<br/>Catalog
User->>CLI: register_store(uri, catalog, namespace)
CLI->>Adapter: scan(uri)
Adapter-->>CLI: ScanResult
CLI->>Validator: validate_scan_result(ScanResult)
Validator-->>CLI: validation result
CLI->>Publisher: publish_image_assets(ScanResult, catalog, namespace)
Publisher->>Catalog: load_table/create_table/append(rows)
Publisher-->>CLI: image_assets_rows_published
CLI->>Publisher: publish_chunk_index(ScanResult, catalog, namespace)
Publisher->>Catalog: load_table/create_table/append(chunk_rows)
Publisher-->>CLI: chunk_rows_published
CLI-->>User: RegistrationResult(source_uri, counts)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Introduces the initial iceberg-bioimage package: canonical scan models, format adapters (Zarr/OME-TIFF), publishing helpers for Iceberg metadata tables, contract validation, optional DuckDB query/integration helpers, plus CLI/docs/examples and a CI setup.
Changes:
- Add core library modules (adapters, models, publishing, validation, integrations) and a CLI entrypoint.
- Add end-to-end tests covering scanning, publishing, joining/querying helpers, and CLI commands.
- Add project scaffolding: docs, examples, pre-commit/CI workflows, and packaging metadata.
Reviewed changes
Copilot reviewed 52 out of 53 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_validation.py | Tests microscopy join-contract column + table validation. |
| tests/test_registration.py | Tests high-level register_store workflow with monkeypatching. |
| tests/test_publishing.py | Tests image asset row building and table creation behavior. |
| tests/test_duckdb.py | Tests optional DuckDB helpers (skipped if duckdb missing). |
| tests/test_cli.py | CLI smoke tests for scan/validate/register/publish-chunks. |
| tests/test_chunk_index.py | Tests chunk index row derivation and publish behavior. |
| tests/test_catalog_integration.py | Tests catalog-facing helpers and join workflow integration. |
| tests/test_api.py | Tests adapter selection and canonical scanning behavior. |
| tests/conftest.py | Pytest config/fixtures. |
| src/notebooks/example_notebook.py | Jupytext-backed example notebook (Python). |
| src/notebooks/example_notebook.ipynb | Example notebook (ipynb). |
| src/iceberg_bioimage/validation/contracts.py | ScanResult + microscopy contract validation utilities. |
| src/iceberg_bioimage/validation/init.py | Validation package marker. |
| src/iceberg_bioimage/publishing/image_assets.py | Publishing helpers for canonical image_assets table. |
| src/iceberg_bioimage/publishing/chunk_index.py | Publishing helpers for canonical chunk_index table. |
| src/iceberg_bioimage/publishing/init.py | Publishing package exports. |
| src/iceberg_bioimage/models/scan_result.py | Canonical dataclasses (ScanResult, ImageAsset, etc.). |
| src/iceberg_bioimage/models/init.py | Models package marker. |
| src/iceberg_bioimage/integrations/duckdb.py | Optional DuckDB query/join helpers over metadata tables. |
| src/iceberg_bioimage/integrations/catalog.py | Catalog-facing helpers (load/scan tables; join workflow). |
| src/iceberg_bioimage/integrations/init.py | Integrations package exports. |
| src/iceberg_bioimage/cli.py | CLI implementation (scan/register/validate-contract/publish-chunks). |
| src/iceberg_bioimage/api.py | Public API entry points (scan_store, register_store). |
| src/iceberg_bioimage/adapters/zarr_v2.py | Zarr v2 scanning adapter. |
| src/iceberg_bioimage/adapters/ome_tiff.py | OME-TIFF scanning adapter. |
| src/iceberg_bioimage/adapters/base.py | Adapter interface contract. |
| src/iceberg_bioimage/adapters/init.py | Adapters package marker. |
| src/iceberg_bioimage/init.py | Top-level public exports for the package. |
| pyproject.toml | Packaging metadata, deps, tooling config (ruff/pytest/poe/jupytext/vulture). |
| examples/synthetic_workflow.py | End-to-end synthetic workflow example (local + DuckDB join). |
| examples/quickstart.py | Minimal quickstart example. |
| examples/catalog_duckdb.py | Example joining catalog-backed metadata using DuckDB helpers. |
| docs/src/workflow.md | Workflow + scope documentation. |
| docs/src/python-api.md | Sphinx autodoc API page. |
| docs/src/index.md | Docs index/toctree. |
| docs/src/duckdb.md | DuckDB integration documentation. |
| docs/src/conf.py | Sphinx configuration. |
| docs/src/_static/custom.css | Docs theme override stub. |
| README.md | Expanded project overview + usage examples. |
| CONTRIBUTING.md | Contributor workflow and local checks. |
| CODE_OF_CONDUCT.md | Contributor Covenant code of conduct. |
| CITATION.cff | Citation metadata. |
| .python-version | Default Python version. |
| .pre-commit-config.yaml | Pre-commit hooks (ruff-format/check, vulture, etc.). |
| .github/workflows/run-tests.yml | CI: pre-commit, test matrix, docs build, optional duckdb tests. |
| .github/workflows/publish-pypi.yml | Release publish workflow to PyPI. |
| .github/workflows/publish-docs.yml | Docs publish workflow to GitHub Pages. |
| .github/workflows/draft-release.yml | Release drafter workflow. |
| .github/release-drafter.yml | Release drafter config. |
| .github/dependabot.yml | Dependabot config. |
| .github/PULL_REQUEST_TEMPLATE.md | PR template. |
| .github/ISSUE_TEMPLATE/issue.yml | Issue template. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 19
🧹 Nitpick comments (16)
tests/conftest.py (1)
8-10: Consider replacing the placeholder-style fixture with an intent-focused one.Lines 8-10 are valid, but
my_data/"Hello, differently!"reads generic and can make test intent harder to follow as coverage grows.♻️ Optional cleanup
-@pytest.fixture -def my_data() -> str: - return "Hello, differently!" +@pytest.fixture +def sample_greeting() -> str: + return "Hello, world!"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/conftest.py` around lines 8 - 10, The fixture my_data returning "Hello, differently!" is too generic—rename it to reflect its test intent (for example success_message, greeting_fixture, or error_payload depending on usage) and replace the placeholder string with a value that matches that intent; update all tests that use my_data to reference the new fixture name (and, if needed, adjust assertions to the new fixture value) so the fixture name and returned data clearly communicate purpose (refer to the my_data fixture in tests/conftest.py and any tests that import/use it).src/iceberg_bioimage/publishing/image_assets.py (2)
125-189: Non-sequential field IDs in schema are intentional.The field IDs (1, 2, 9, 11, 13, 14, 15, 16, 19) appear non-sequential. This is fine for Iceberg schemas as field IDs are stable identifiers, but a comment explaining the gaps (reserved for future fields or removed fields) would aid maintainability.
📝 Suggested documentation
def _build_image_assets_schema() -> object: + # Field IDs are stable identifiers. Gaps are reserved for future columns + # or represent removed columns. Do not reuse IDs. try: from pyiceberg.schema import Schema🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/publishing/image_assets.py` around lines 125 - 189, In _build_image_assets_schema(), add a concise comment above the Schema construction explaining that the non-sequential NestedField field_id values (1,2,9,11,13,14,15,16,19) are intentional: they are stable Iceberg field identifiers with gaps reserved for removed or future fields to preserve compatibility; reference the function name _build_image_assets_schema and the NestedField entries so reviewers can see this is deliberate and for maintainability.
98-107: Fragile exception handling via class name string comparison.Line 101 checks
exc.__class__.__name__ != "NoSuchTableError"which is fragile - if PyIceberg renames or restructures exceptions, this will silently break. Consider importing and catching the specific exception type.♻️ Suggested improvement
def _load_or_create_table( catalog_or_name: str | SupportsCatalog, namespace: str | Iterable[str], table_name: str, *, schema_builder: Callable[[], object] | None = None, ) -> SupportsAppend: catalog = _resolve_catalog(catalog_or_name) identifier = (*_normalize_namespace(namespace), table_name) + try: + from pyiceberg.exceptions import NoSuchTableError + except ImportError: + NoSuchTableError = None # type: ignore[misc, assignment] + try: return catalog.load_table(identifier) - except Exception as exc: # pragma: no cover - depends on active catalog backend - if getattr(exc.__class__, "__name__", "") != "NoSuchTableError": - raise + except Exception as exc: + if NoSuchTableError is None or not isinstance(exc, NoSuchTableError): + raise build_schema = ( _build_image_assets_schema if schema_builder is None else schema_builder ) return catalog.create_table(identifier, schema=build_schema())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/publishing/image_assets.py` around lines 98 - 107, The code uses a fragile string comparison on exc.__class__.__name__ to detect a missing table; instead import and catch the concrete NoSuchTableError from the PyIceberg/catalog backend and handle only that exception. Replace the broad "except Exception as exc" block with an explicit "except NoSuchTableError:" (and keep letting any other exceptions propagate), leaving the rest of the logic that picks build_schema (schema_builder vs _build_image_assets_schema) and calls catalog.create_table(identifier, schema=build_schema()) unchanged; reference catalog.load_table, catalog.create_table, schema_builder and _build_image_assets_schema when making the change.src/iceberg_bioimage/models/__init__.py (1)
1-1: Consider re-exporting key models for API convenience.The
__init__.pyonly contains a docstring. For a cleaner public API, consider re-exporting the main dataclasses so users can dofrom iceberg_bioimage.models import ScanResultinstead offrom iceberg_bioimage.models.scan_result import ScanResult.♻️ Suggested addition
"""Canonical data models.""" + +from iceberg_bioimage.models.scan_result import ( + ContractValidationResult, + ImageAsset, + RegistrationResult, + ScanResult, +) + +__all__ = [ + "ContractValidationResult", + "ImageAsset", + "RegistrationResult", + "ScanResult", +]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/models/__init__.py` at line 1, Add re-exports for the package's main dataclasses so users can import them from iceberg_bioimage.models; specifically import ScanResult from its module (e.g., from .scan_result import ScanResult) in __init__.py and add it to the module's public API via __all__ (e.g., __all__ = ["ScanResult", ...]) or by setting globals, repeating for any other primary model dataclasses you want to expose to provide convenient top-level imports.src/iceberg_bioimage/models/scan_result.py (1)
47-50: Type hint forjson_kwargsis imprecise.
**json_kwargs: objectallows any type, butjson.dumpsexpects specific keyword arguments likeindent: int,sort_keys: bool, etc. Consider usingtyping.Anyortyping.Unpackwith a TypedDict for stricter typing.♻️ Suggested improvement
- def to_json(self, **json_kwargs: object) -> str: + def to_json(self, **json_kwargs: Any) -> str:This is a minor typing nit -
Anyis more conventional for pass-through kwargs.Also applies to: 85-88, 108-111
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/models/scan_result.py` around lines 47 - 50, The type hint for the pass-through kwargs on the to_json methods is too broad; change the parameter annotation from **json_kwargs: object to **json_kwargs: Any (and add "from typing import Any" at top) for conventional typing of JSON kwargs, or alternatively use typing.Unpack with a TypedDict if you need stricter argument shapes; update the signatures of the to_json methods (the one named to_json in scan_result.py and the other occurrences referenced around lines 85–88 and 108–111) accordingly and add the import for Any.src/iceberg_bioimage/adapters/zarr_v2.py (1)
61-78: Redundant assignment on line 67.
path = array_path or Noneis redundant sincearray_pathis already typed asstr | None. If it's an empty string, converting toNonemay be intentional, but this should be explicit.♻️ Clarify intent
def _build_asset( self, uri: str, array_path: str | None, array: object, ) -> ImageAsset: - path = array_path or None + # Treat empty string as None for consistency + path = array_path if array_path else None metadata = {"store_name": Path(uri).name}Or if empty strings are not expected, simply use
array_pathdirectly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/adapters/zarr_v2.py` around lines 61 - 78, In _build_asset the line "path = array_path or None" is redundant or ambiguous; either remove the temporary and pass array_path directly to ImageAsset (use array_path where path is referenced) if empty strings are not expected, or make the intent explicit by replacing that assignment with an explicit empty-string check (e.g., set path = None if array_path == "" else array_path) so callers like self._coerce_chunks(...), self._image_id(uri, path) and the ImageAsset(image_path=...) receive the intended value.src/iceberg_bioimage/integrations/duckdb.py (1)
58-103: Hardcoded chunk_index columns may break if schema evolves.Lines 81-82 hardcode
chunk_key,chunk_coords_json, andbyte_lengthfor the chunk_index join. If the chunk_index schema changes, this will silently produce incorrect results or errors.Consider defining these as a constant (similar to
DEFAULT_JOIN_KEYS) for maintainability.♻️ Suggested refactor
DEFAULT_JOIN_KEYS = ("dataset_id", "image_id") +CHUNK_INDEX_COLUMNS = ("chunk_key", "chunk_coords_json", "byte_length") ... if chunk_index is not None: _register_source(duckdb_connection, "chunk_index", chunk_index) - query.append( - " , ci.chunk_key, ci.chunk_coords_json, ci.byte_length" - ) + ci_cols = ", ".join(f"ci.{col}" for col in CHUNK_INDEX_COLUMNS) + query.append(f" , {ci_cols}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/integrations/duckdb.py` around lines 58 - 103, The join_image_assets_with_profiles function hardcodes chunk_index column names (chunk_key, chunk_coords_json, byte_length), which is fragile if chunk_index schema changes; define a module-level constant (e.g., CHUNK_INDEX_FIELDS = ("chunk_key", "chunk_coords_json", "byte_length")) similar to DEFAULT_JOIN_KEYS and replace the literal column list in the query-building (the portion that appends " , ci.chunk_key, ci.chunk_coords_json, ci.byte_length") with a join of that constant (", ".join(CHUNK_INDEX_FIELDS)) so the field list is centralized and easier to update and reuse when registering or selecting from chunk_index.src/iceberg_bioimage/validation/contracts.py (1)
84-91: Missing error handling for invalid dataset paths.
ds.dataset(path)will raise an exception if the path doesn't exist or isn't a valid dataset. Consider wrapping with a more descriptive error or documenting the expected exceptions.🛡️ Suggested improvement
def validate_microscopy_profile_table(path: str) -> ContractValidationResult: """Validate a local profile table file against the microscopy join contract.""" - dataset = ds.dataset(path) + try: + dataset = ds.dataset(path) + except Exception as exc: + raise ValueError( + f"Failed to read dataset from {path!r}: {exc}" + ) from exc return validate_microscopy_profile_columns( list(dataset.schema.names), target=str(Path(path)), )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/validation/contracts.py` around lines 84 - 91, The function validate_microscopy_profile_table calls ds.dataset(path) without guarding against invalid paths; wrap the ds.dataset(path) call in a try/except that catches exceptions (e.g., OSError/ValueError/RuntimeError or a broad Exception) and convert the error into a clear ContractValidationResult failure (e.g., return ContractValidationResult(valid=False, errors=[f"Invalid dataset path: {path}: {e}"])) or re-raise a more descriptive exception; update validate_microscopy_profile_table to return that failure result when dataset creation fails and keep the existing call to validate_microscopy_profile_columns when successful.src/iceberg_bioimage/publishing/chunk_index.py (1)
46-55: Validate input in the public row-conversion helper.
scan_result_to_chunk_rowsis exported and callable directly; adding contract validation here gives consistent, high-level failures instead of lower-level shape/dtype exceptions.Proposed refactor
def scan_result_to_chunk_rows(scan_result: ScanResult) -> list[dict[str, object]]: """Convert a scan result into canonical chunk_index rows.""" + raise_for_invalid_scan_result(scan_result) dataset_id = _dataset_id(scan_result.source_uri) rows: list[dict[str, object]] = []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/publishing/chunk_index.py` around lines 46 - 55, scan_result_to_chunk_rows should validate its public input before processing: check that scan_result is not None, is an instance of ScanResult (or at least has expected attributes), and that scan_result.source_uri is a non-empty string and scan_result.image_assets is an iterable of asset-like objects; if validation fails raise a clear ValueError/TypeError. Update scan_result_to_chunk_rows to perform these precondition checks (before calling _dataset_id or iterating), and surface a descriptive error message mentioning scan_result, source_uri, or image_assets as appropriate so callers get high-level, consistent failures instead of downstream shape/dtype exceptions from _dataset_id or _asset_to_chunk_rows.src/iceberg_bioimage/integrations/catalog.py (2)
16-22: Define a scan-focused catalog protocol in this module.Reusing
SupportsCatalogfrom publishing couples this reader path to append/create semantics. A local protocol that guaranteesload_table(...) -> SupportsIcebergTablemakes typing and intent tighter.Also applies to: 56-65
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/integrations/catalog.py` around lines 16 - 22, This module imports and reuses SupportsCatalog (aliased as SupportsLoadTableCatalog) from the publishing package which conflates reader behavior with append/create semantics; define a local protocol in this module (e.g., SupportsIcebergTable and a new ScanCatalog protocol) that exposes only load_table(self, name: str, /, ...) -> SupportsIcebergTable (or appropriate return type) and replace references to SupportsLoadTableCatalog with this new ScanCatalog in this file (also update the usage locations around the other import block at lines ~56-65) so typing reflects scan/read-only intent rather than publishing semantics.
90-118: Consider pushdown options in catalog joins to avoid full-table loads.
join_catalog_image_assets_with_profilescurrently scans fullimage_assetsand optionalchunk_indexinto memory. For large catalogs, exposeCatalogScanOptions(projection/filter/limit) per source to reduce IO and memory pressure.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/integrations/catalog.py` around lines 90 - 118, The function join_catalog_image_assets_with_profiles currently loads entire tables; add optional CatalogScanOptions parameters (e.g., image_assets_scan_options: CatalogScanOptions | None = None and chunk_index_scan_options: CatalogScanOptions | None = None) to the join_catalog_image_assets_with_profiles signature, pass image_assets_scan_options into the catalog_table_to_arrow call for "image_assets" and pass chunk_index_scan_options into the catalog_table_to_arrow call for the chunk_index (when chunk_index_table is provided), and propagate these options into join_image_assets_with_profiles (or add corresponding parameters) so callers can supply projection/filter/limit to avoid full-table scans. Ensure types reference CatalogScanOptions and defaults remain None so behavior is backward-compatible.tests/test_registration.py (1)
64-80: Harden the no-chunk path with an explicit “must not call” assertion.In
test_register_store_without_chunks, consider monkeypatchingiceberg_bioimage.api.publish_chunk_indexto raise if invoked. That makes the intent explicit and catches accidental regressions immediately.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_registration.py` around lines 64 - 80, In test_register_store_without_chunks, ensure publish_chunk_index is explicitly forbidden by monkeypatching iceberg_bioimage.api.publish_chunk_index to raise an exception if called (so any accidental invocation fails the test); keep the existing monkeypatch for publish_image_assets and the call to register_store, then assert result.chunk_rows_published == 0 — this makes the no-chunk path intent explicit and detects regressions immediately.tests/test_chunk_index.py (1)
9-9: Avoid cross-test imports for shared fakes.Importing
FakeCatalogfromtests/test_publishing.pycouples two test modules. Consider moving shared fakes to a dedicated helper (for example,tests/fakes.pyortests/conftest.py) to keep tests independently maintainable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_chunk_index.py` at line 9, The test imports FakeCatalog from another test module which couples tests; extract the FakeCatalog class into a shared test helper (e.g., create tests/fakes.py or add it to tests/conftest.py) and update tests/test_chunk_index.py to import FakeCatalog from that new helper instead of tests/test_publishing; ensure the symbol name FakeCatalog remains unchanged so existing tests (including any references in tests/test_publishing.py) can import from the shared location and update any import paths accordingly..github/workflows/draft-release.yml (1)
21-23: Pin action to a commit SHA for supply-chain safety.Using a floating major tag (
@v6) is convenient but less secure. Pinning to a specific commit SHA ensures the workflow always uses an immutable, auditable version. Current v6 resolves to the latest patch release.To find the commit SHA for v6, check the release on GitHub: https://github.com/release-drafter/release-drafter/releases (look for the tag under "v6" or the latest v6.x release).
Suggested change
- - uses: release-drafter/release-drafter@v6 + - uses: release-drafter/release-drafter@<commit-sha>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/draft-release.yml around lines 21 - 23, Replace the floating tag "release-drafter/release-drafter@v6" with a specific commit SHA to pin the action; locate the v6 release on the release-drafter GitHub releases page, copy the commit SHA for the v6.x tag you want, and update the workflow line that currently reads uses: release-drafter/release-drafter@v6 to uses: release-drafter/release-drafter@<commit-sha>; this ensures the workflow uses an immutable, auditable version while keeping the environment variable GITHUB_TOKEN unchanged.README.md (1)
59-61: Optional readability polish for repeated sentence openings.Lines 59–61 all start with “See”. Consider converting to a short bullet list for smoother scanning.
✍️ Suggested wording tweak
-See `examples/quickstart.py` for a minimal script. -See `examples/catalog_duckdb.py` for a catalog-backed query example. -See `examples/synthetic_workflow.py` for a self-contained local workflow. +Example scripts: +- `examples/quickstart.py` — minimal usage +- `examples/catalog_duckdb.py` — catalog-backed query flow +- `examples/synthetic_workflow.py` — self-contained local workflow🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 59 - 61, The three consecutive lines that each begin with "See" (referencing examples/quickstart.py, examples/catalog_duckdb.py, and examples/synthetic_workflow.py) should be converted into a short bullet list to improve readability; replace the three separate "See ..." sentences with a concise unordered list where each bullet names the example file and a one-line description (e.g., "examples/quickstart.py — minimal script") to avoid repeated sentence openings and make scanning easier.tests/test_cli.py (1)
13-14: Use pytest's publicMonkeyPatchimport.
_pytest.monkeypatchis private pytest internals, so this annotation can break on a pytest refactor even if the public API stays stable. Please verify against the pytest version pinned in the repo, but the supported import here should be the public one.♻️ Suggested change
-from _pytest.monkeypatch import MonkeyPatch -from pytest import CaptureFixture +from pytest import CaptureFixture, MonkeyPatch🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_cli.py` around lines 13 - 14, Replace the private import of MonkeyPatch with pytest's public export: change the import that currently reads from _pytest.monkeypatch import MonkeyPatch to the public form (e.g., from pytest import MonkeyPatch or use pytest.MonkeyPatch) so tests reference the public symbol MonkeyPatch; confirm the project’s pinned pytest version exports MonkeyPatch and update any test annotations that reference the private import.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/dependabot.yml:
- Line 7: Add a YAML document start marker to the Dependabot config by inserting
a leading `---` above the existing content so the linter no longer reports a
document-start warning; update the file containing the `version: 2` entry
(referencing the `version` key in this Dependabot config) to begin with `---` on
its own line.
In @.github/ISSUE_TEMPLATE/issue.yml:
- Around line 17-19: Update the checkbox label string under the YAML key "label"
that currently reads "I found no existing covering this topic." to the clearer
wording "I found no existing issue covering this topic." so the label text (the
value for label) is corrected; locate the label entry in the ISSUE template
where the multiline label value is defined and replace the phrase accordingly.
- Around line 1-2: Add a YAML document start marker to the top of the GitHub
issue template to satisfy yamllint's document-start rule by inserting the
three-dash YAML document separator (`---`) as the first line of the
.github/ISSUE_TEMPLATE/issue.yml file so the file begins with the YAML document
start before the existing comment and content.
In @.github/PULL_REQUEST_TEMPLATE.md:
- Line 14: Fix the grammar typo in the pull request template by updating the
sentence in .github/PULL_REQUEST_TEMPLATE.md that currently reads "you may used"
to "you may use" so the guidance is grammatically correct; locate the exact line
containing "Are there any issues which are related to this pull request (you may
used a `#<digit>` to reference GitHub issues as links within this description)?"
and replace "used" with "use".
In @.github/workflows/draft-release.yml:
- Around line 6-10: The top-level YAML key on in draft-release.yml should be
quoted to silence the yamllint truthy value warning; update the file to change
the mapping key from on: to 'on': (preserving the same nested push->branches
configuration) so the workflow behavior remains identical but the linter warning
is resolved.
In @.github/workflows/publish-docs.yml:
- Line 34: The CI failure is due to an over-long single-line run step containing
the command "uv run --group docs --frozen sphinx-build -b html docs/src
docs/_build/html"; fix it by turning that run into a YAML multiline block (use
run: |) and either split the command across lines or add a backslash
continuation so each line is under the max length, locating the step that
contains the "uv run --group docs --frozen sphinx-build -b html docs/src
docs/_build/html" string and replacing the single long line with the multiline
block.
In @.github/workflows/run-tests.yml:
- Line 4: The workflow currently has a bare "on:" (which trips the truthy YAML
lint rule) and very long "run:" command lines that exceed the line-length rule;
change the trigger to an explicit event list (e.g., replace "on:" with "on:
[push, pull_request]" or specific events you need) and split or fold long shell
commands in the "run:" steps (use multiple run steps or YAML block scalars like
| or > to break long commands) so each command line is under the configured
length; update the same pattern for the other long run steps referenced (the
second occurrence around the later run block) and ensure indentation and quoting
remain valid for the workflow parser.
- Around line 14-20: Replace the mutable tag refs in the workflow's "uses:"
entries with their corresponding full 40-character commit SHAs for each
occurrence of actions/checkout@v4, actions/setup-python@v5 and
astral-sh/setup-uv@v6 (they appear in three job sections); update each "uses:"
to the immutable SHA for the exact release you want (found on each action's
GitHub releases/tags page) and keep the original tag as a comment if desired for
context (e.g., actions/checkout@<full-sha> # v4.x.y).
In `@CODE_OF_CONDUCT.md`:
- Around line 61-64: Replace the vague GitHub-profile-email instruction in the
“Please reach out to the maintainers…” sentence with a concrete, monitored
reporting channel: add a dedicated, monitored contact (e.g.,
security@yourproject.org or a secure reporting form URL) and update the sentence
to direct reporters to that channel; also include who monitors it (e.g.,
“maintainers and the Code of Conduct committee”), expected acknowledgement
timeframe (e.g., “we will acknowledge within 48 hours”), and a note that reports
will be handled privately to ensure the contact is actionable and monitored.
In `@pyproject.toml`:
- Line 7: The pyproject currently defines a static version = "0.0.1" while also
including a [tool.setuptools_scm] block that writes a generated _version.py
which is never produced or imported; pick one strategy and remove the
conflicting bits accordingly: either remove the static version line and enable
dynamic SCM versioning (keep and ensure [tool.setuptools_scm] is correctly
configured and that the generated _version.py or equivalent is imported by your
package), or delete the entire [tool.setuptools_scm] section so the project uses
the explicit version field; update references to _version.py (if any) to match
the chosen approach.
In `@src/iceberg_bioimage/adapters/ome_tiff.py`:
- Around line 53-62: can_handle performs a case-insensitive check but _image_id
strips suffixes case-sensitively, so names like "image.OME.TIFF" won't have the
suffix removed; update _image_id to perform case-insensitive suffix removal (for
example, operate on a lowercased filename or use a case-insensitive match)
before applying removesuffix logic and still preserve the original stem for
return; locate the _image_id method and change the stem extraction to use a
casefold()/lower() variant for suffix checks (or regex with re.IGNORECASE) so
.ome.tiff/.OME.TIFF/.OmE.Tif etc. are all handled the same as can_handle.
In `@src/iceberg_bioimage/adapters/zarr_v2.py`:
- Around line 86-88: can_handle does case-insensitive matching but _image_id
strips the ".zarr" suffix case-sensitively; update _image_id (function
_image_id, variable stem) to normalize the store name before removing the suffix
(e.g., use a lower()/casefold()-based check or a case-insensitive endswith) so
"data.ZARR" and "data.zarr" both yield the same stem, then append array_path as
before.
In `@src/iceberg_bioimage/api.py`:
- Around line 61-63: The RegistrationResult currently sets source_uri=uri which
may differ from the canonical URI produced by the adapter; change it to use
scan_result.source_uri instead. Locate the return that constructs
RegistrationResult (the variables uri, scan_result, and image_assets_rows are in
scope) and replace the source_uri argument so it passes scan_result.source_uri,
ensuring the returned metadata reflects the actual scanned/published URI.
In `@src/iceberg_bioimage/cli.py`:
- Around line 76-81: Update main to catch handler exceptions and translate them
into CLI-friendly stderr messages and non-zero exit codes: wrap the call to
args.handler(args) in a try/except that catches ValueError (and any additional
catalog-specific exceptions you want to expose) raised by scan_store/api code,
print a concise error message to stderr (include the exception text) and return
a non-zero int; re-raise unexpected exceptions or return a different non-zero
code if you want to preserve tracebacks for debugging. Ensure you reference main
and args.handler in the change so the handler invocation is wrapped by the
error-handling logic.
In `@src/iceberg_bioimage/integrations/duckdb.py`:
- Around line 33-56: The where string passed into query_metadata_table is used
directly in relation.filter(), creating an SQL injection risk; either
require/trust only internal callers or implement input sanitization/safer API:
replace the raw where param with a validated filter or structured filter API
(e.g., a dict of column->value or a list of (column, op, value) tuples) and
build the expression with proper escaping/quoting, or validate the where string
against a strict whitelist of allowed tokens/characters (columns from
_relation_for_source.schema names, safe operators, numeric and quoted literals,
parentheses) before calling relation.filter(); update query_metadata_table to
use the new validated/structured filter input and refuse/unwrap unsafe where
values.
In `@src/iceberg_bioimage/publishing/chunk_index.py`:
- Around line 33-40: The code calls _load_or_create_table before checking for an
empty rows list which can create or mutate the catalog unnecessarily; move the
empty-row guard so that the function (the caller that currently invokes
_load_or_create_table — e.g., publish_chunk_index or the surrounding publish
function) returns 0 if not rows before any call to _load_or_create_table,
ensuring no table creation or catalog mutation occurs when there are zero chunk
rows.
In `@src/notebooks/example_notebook.ipynb`:
- Line 21: Update the notebook markdown cell that contains the string
"browser-based Juypyter interface\n" by correcting the typo "Juypyter" to
"Jupyter" so the cell reads "browser-based Jupyter interface\n"; locate the
offending string in the notebook (e.g., the markdown cell content) and replace
the misspelled word.
- Around line 42-85: The notebook contains stale outputs (the HTML/text output
showing "Hello, notebook!" dataframe) that no longer match the subsequent cell
which constructs a ScanResult; open the notebook, clear the existing outputs for
the affected cells (the output block with the "Hello, notebook!" message) and
re-run the cells so the displayed outputs reflect the current code path
(including the ScanResult construction), then save the notebook before
committing so the saved outputs are consistent with the ScanResult-producing
cells.
In `@src/notebooks/example_notebook.py`:
- Line 28: Replace the misspelled user-facing string "Juypyter" with the correct
spelling "Jupyter" wherever it appears in the notebook text (look for the exact
token "Juypyter" in src/notebooks/example_notebook.py) so the displayed comment
reads "Jupyter" instead of "Juypyter".
---
Nitpick comments:
In @.github/workflows/draft-release.yml:
- Around line 21-23: Replace the floating tag
"release-drafter/release-drafter@v6" with a specific commit SHA to pin the
action; locate the v6 release on the release-drafter GitHub releases page, copy
the commit SHA for the v6.x tag you want, and update the workflow line that
currently reads uses: release-drafter/release-drafter@v6 to uses:
release-drafter/release-drafter@<commit-sha>; this ensures the workflow uses an
immutable, auditable version while keeping the environment variable GITHUB_TOKEN
unchanged.
In `@README.md`:
- Around line 59-61: The three consecutive lines that each begin with "See"
(referencing examples/quickstart.py, examples/catalog_duckdb.py, and
examples/synthetic_workflow.py) should be converted into a short bullet list to
improve readability; replace the three separate "See ..." sentences with a
concise unordered list where each bullet names the example file and a one-line
description (e.g., "examples/quickstart.py — minimal script") to avoid repeated
sentence openings and make scanning easier.
In `@src/iceberg_bioimage/adapters/zarr_v2.py`:
- Around line 61-78: In _build_asset the line "path = array_path or None" is
redundant or ambiguous; either remove the temporary and pass array_path directly
to ImageAsset (use array_path where path is referenced) if empty strings are not
expected, or make the intent explicit by replacing that assignment with an
explicit empty-string check (e.g., set path = None if array_path == "" else
array_path) so callers like self._coerce_chunks(...), self._image_id(uri, path)
and the ImageAsset(image_path=...) receive the intended value.
In `@src/iceberg_bioimage/integrations/catalog.py`:
- Around line 16-22: This module imports and reuses SupportsCatalog (aliased as
SupportsLoadTableCatalog) from the publishing package which conflates reader
behavior with append/create semantics; define a local protocol in this module
(e.g., SupportsIcebergTable and a new ScanCatalog protocol) that exposes only
load_table(self, name: str, /, ...) -> SupportsIcebergTable (or appropriate
return type) and replace references to SupportsLoadTableCatalog with this new
ScanCatalog in this file (also update the usage locations around the other
import block at lines ~56-65) so typing reflects scan/read-only intent rather
than publishing semantics.
- Around line 90-118: The function join_catalog_image_assets_with_profiles
currently loads entire tables; add optional CatalogScanOptions parameters (e.g.,
image_assets_scan_options: CatalogScanOptions | None = None and
chunk_index_scan_options: CatalogScanOptions | None = None) to the
join_catalog_image_assets_with_profiles signature, pass
image_assets_scan_options into the catalog_table_to_arrow call for
"image_assets" and pass chunk_index_scan_options into the catalog_table_to_arrow
call for the chunk_index (when chunk_index_table is provided), and propagate
these options into join_image_assets_with_profiles (or add corresponding
parameters) so callers can supply projection/filter/limit to avoid full-table
scans. Ensure types reference CatalogScanOptions and defaults remain None so
behavior is backward-compatible.
In `@src/iceberg_bioimage/integrations/duckdb.py`:
- Around line 58-103: The join_image_assets_with_profiles function hardcodes
chunk_index column names (chunk_key, chunk_coords_json, byte_length), which is
fragile if chunk_index schema changes; define a module-level constant (e.g.,
CHUNK_INDEX_FIELDS = ("chunk_key", "chunk_coords_json", "byte_length")) similar
to DEFAULT_JOIN_KEYS and replace the literal column list in the query-building
(the portion that appends " , ci.chunk_key, ci.chunk_coords_json,
ci.byte_length") with a join of that constant (", ".join(CHUNK_INDEX_FIELDS)) so
the field list is centralized and easier to update and reuse when registering or
selecting from chunk_index.
In `@src/iceberg_bioimage/models/__init__.py`:
- Line 1: Add re-exports for the package's main dataclasses so users can import
them from iceberg_bioimage.models; specifically import ScanResult from its
module (e.g., from .scan_result import ScanResult) in __init__.py and add it to
the module's public API via __all__ (e.g., __all__ = ["ScanResult", ...]) or by
setting globals, repeating for any other primary model dataclasses you want to
expose to provide convenient top-level imports.
In `@src/iceberg_bioimage/models/scan_result.py`:
- Around line 47-50: The type hint for the pass-through kwargs on the to_json
methods is too broad; change the parameter annotation from **json_kwargs: object
to **json_kwargs: Any (and add "from typing import Any" at top) for conventional
typing of JSON kwargs, or alternatively use typing.Unpack with a TypedDict if
you need stricter argument shapes; update the signatures of the to_json methods
(the one named to_json in scan_result.py and the other occurrences referenced
around lines 85–88 and 108–111) accordingly and add the import for Any.
In `@src/iceberg_bioimage/publishing/chunk_index.py`:
- Around line 46-55: scan_result_to_chunk_rows should validate its public input
before processing: check that scan_result is not None, is an instance of
ScanResult (or at least has expected attributes), and that
scan_result.source_uri is a non-empty string and scan_result.image_assets is an
iterable of asset-like objects; if validation fails raise a clear
ValueError/TypeError. Update scan_result_to_chunk_rows to perform these
precondition checks (before calling _dataset_id or iterating), and surface a
descriptive error message mentioning scan_result, source_uri, or image_assets as
appropriate so callers get high-level, consistent failures instead of downstream
shape/dtype exceptions from _dataset_id or _asset_to_chunk_rows.
In `@src/iceberg_bioimage/publishing/image_assets.py`:
- Around line 125-189: In _build_image_assets_schema(), add a concise comment
above the Schema construction explaining that the non-sequential NestedField
field_id values (1,2,9,11,13,14,15,16,19) are intentional: they are stable
Iceberg field identifiers with gaps reserved for removed or future fields to
preserve compatibility; reference the function name _build_image_assets_schema
and the NestedField entries so reviewers can see this is deliberate and for
maintainability.
- Around line 98-107: The code uses a fragile string comparison on
exc.__class__.__name__ to detect a missing table; instead import and catch the
concrete NoSuchTableError from the PyIceberg/catalog backend and handle only
that exception. Replace the broad "except Exception as exc" block with an
explicit "except NoSuchTableError:" (and keep letting any other exceptions
propagate), leaving the rest of the logic that picks build_schema
(schema_builder vs _build_image_assets_schema) and calls
catalog.create_table(identifier, schema=build_schema()) unchanged; reference
catalog.load_table, catalog.create_table, schema_builder and
_build_image_assets_schema when making the change.
In `@src/iceberg_bioimage/validation/contracts.py`:
- Around line 84-91: The function validate_microscopy_profile_table calls
ds.dataset(path) without guarding against invalid paths; wrap the
ds.dataset(path) call in a try/except that catches exceptions (e.g.,
OSError/ValueError/RuntimeError or a broad Exception) and convert the error into
a clear ContractValidationResult failure (e.g., return
ContractValidationResult(valid=False, errors=[f"Invalid dataset path: {path}:
{e}"])) or re-raise a more descriptive exception; update
validate_microscopy_profile_table to return that failure result when dataset
creation fails and keep the existing call to validate_microscopy_profile_columns
when successful.
In `@tests/conftest.py`:
- Around line 8-10: The fixture my_data returning "Hello, differently!" is too
generic—rename it to reflect its test intent (for example success_message,
greeting_fixture, or error_payload depending on usage) and replace the
placeholder string with a value that matches that intent; update all tests that
use my_data to reference the new fixture name (and, if needed, adjust assertions
to the new fixture value) so the fixture name and returned data clearly
communicate purpose (refer to the my_data fixture in tests/conftest.py and any
tests that import/use it).
In `@tests/test_chunk_index.py`:
- Line 9: The test imports FakeCatalog from another test module which couples
tests; extract the FakeCatalog class into a shared test helper (e.g., create
tests/fakes.py or add it to tests/conftest.py) and update
tests/test_chunk_index.py to import FakeCatalog from that new helper instead of
tests/test_publishing; ensure the symbol name FakeCatalog remains unchanged so
existing tests (including any references in tests/test_publishing.py) can import
from the shared location and update any import paths accordingly.
In `@tests/test_cli.py`:
- Around line 13-14: Replace the private import of MonkeyPatch with pytest's
public export: change the import that currently reads from _pytest.monkeypatch
import MonkeyPatch to the public form (e.g., from pytest import MonkeyPatch or
use pytest.MonkeyPatch) so tests reference the public symbol MonkeyPatch;
confirm the project’s pinned pytest version exports MonkeyPatch and update any
test annotations that reference the private import.
In `@tests/test_registration.py`:
- Around line 64-80: In test_register_store_without_chunks, ensure
publish_chunk_index is explicitly forbidden by monkeypatching
iceberg_bioimage.api.publish_chunk_index to raise an exception if called (so any
accidental invocation fails the test); keep the existing monkeypatch for
publish_image_assets and the call to register_store, then assert
result.chunk_rows_published == 0 — this makes the no-chunk path intent explicit
and detects regressions immediately.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1b32f078-70c2-4e11-ade2-b06d877a3ded
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (52)
.github/ISSUE_TEMPLATE/issue.yml.github/PULL_REQUEST_TEMPLATE.md.github/dependabot.yml.github/release-drafter.yml.github/workflows/draft-release.yml.github/workflows/publish-docs.yml.github/workflows/publish-pypi.yml.github/workflows/run-tests.yml.pre-commit-config.yaml.python-versionCITATION.cffCODE_OF_CONDUCT.mdCONTRIBUTING.mdREADME.mddocs/src/_static/custom.cssdocs/src/conf.pydocs/src/duckdb.mddocs/src/index.mddocs/src/python-api.mddocs/src/workflow.mdexamples/catalog_duckdb.pyexamples/quickstart.pyexamples/synthetic_workflow.pypyproject.tomlsrc/iceberg_bioimage/__init__.pysrc/iceberg_bioimage/adapters/__init__.pysrc/iceberg_bioimage/adapters/base.pysrc/iceberg_bioimage/adapters/ome_tiff.pysrc/iceberg_bioimage/adapters/zarr_v2.pysrc/iceberg_bioimage/api.pysrc/iceberg_bioimage/cli.pysrc/iceberg_bioimage/integrations/__init__.pysrc/iceberg_bioimage/integrations/catalog.pysrc/iceberg_bioimage/integrations/duckdb.pysrc/iceberg_bioimage/models/__init__.pysrc/iceberg_bioimage/models/scan_result.pysrc/iceberg_bioimage/publishing/__init__.pysrc/iceberg_bioimage/publishing/chunk_index.pysrc/iceberg_bioimage/publishing/image_assets.pysrc/iceberg_bioimage/validation/__init__.pysrc/iceberg_bioimage/validation/contracts.pysrc/notebooks/example_notebook.ipynbsrc/notebooks/example_notebook.pytests/conftest.pytests/test_api.pytests/test_catalog_integration.pytests/test_chunk_index.pytests/test_cli.pytests/test_duckdb.pytests/test_publishing.pytests/test_registration.pytests/test_validation.py
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/iceberg_bioimage/integrations/catalog.py (1)
92-124: Consider making theimage_assetstable name configurable.The function hardcodes
"image_assets"at line 107. For consistency with other APIs that accept table names as parameters (e.g.,register_store), consider adding animage_assets_tableparameter.♻️ Suggested change
def join_catalog_image_assets_with_profiles( catalog: str | SupportsScanCatalog, namespace: str | Sequence[str], profiles: MetadataSource, *, + image_assets_table: str = "image_assets", chunk_index_table: str | None = None, join_keys: Sequence[str] = DEFAULT_JOIN_KEYS, image_assets_scan_options: CatalogScanOptions | None = None, chunk_index_scan_options: CatalogScanOptions | None = None, ) -> pa.Table: """Join catalog-backed image metadata to a profile table.""" image_assets = catalog_table_to_arrow( catalog, namespace, - "image_assets", + image_assets_table, scan_options=image_assets_scan_options, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/integrations/catalog.py` around lines 92 - 124, The function join_catalog_image_assets_with_profiles currently hardcodes the table name "image_assets"; add a new parameter image_assets_table: str | None = "image_assets" to the function signature and use that parameter when calling catalog_table_to_arrow instead of the literal "image_assets". Update the call: image_assets = catalog_table_to_arrow(catalog, namespace, image_assets_table, scan_options=image_assets_scan_options) and preserve existing defaults and behavior so callers remain backward compatible; ensure the new parameter name appears in the docstring and function signature alongside existing parameters like chunk_index_table and join_keys.tests/test_cli.py (1)
162-172: Consider a more readable approach for throwing in a lambda.The generator-throw idiom
(_ for _ in ()).throw(ValueError(...))works but is obscure. A helper function would be clearer.♻️ Suggested alternative
+def _raise_value_error(args: object) -> int: + raise ValueError("bad dataset") + def test_main_returns_cli_error_for_value_error( monkeypatch: MonkeyPatch, capsys: CaptureFixture[str], ) -> None: monkeypatch.setattr( cli_module, "_handle_scan", - lambda args: (_ for _ in ()).throw(ValueError("bad dataset")), + _raise_value_error, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_cli.py` around lines 162 - 172, Replace the obscure generator-throw lambda used when patching cli_module._handle_scan with a simple helper that explicitly raises the error: define a small function (e.g., raise_value_error) that takes the same args and raises ValueError("bad dataset"), then monkeypatch.setattr cli_module, "_handle_scan", to that function so the test remains equivalent but far more readable; keep assertions against cli_module.main and CLI_VALUE_ERROR_EXIT_CODE unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/iceberg_bioimage/integrations/catalog.py`:
- Around line 92-124: The function join_catalog_image_assets_with_profiles
currently hardcodes the table name "image_assets"; add a new parameter
image_assets_table: str | None = "image_assets" to the function signature and
use that parameter when calling catalog_table_to_arrow instead of the literal
"image_assets". Update the call: image_assets = catalog_table_to_arrow(catalog,
namespace, image_assets_table, scan_options=image_assets_scan_options) and
preserve existing defaults and behavior so callers remain backward compatible;
ensure the new parameter name appears in the docstring and function signature
alongside existing parameters like chunk_index_table and join_keys.
In `@tests/test_cli.py`:
- Around line 162-172: Replace the obscure generator-throw lambda used when
patching cli_module._handle_scan with a simple helper that explicitly raises the
error: define a small function (e.g., raise_value_error) that takes the same
args and raises ValueError("bad dataset"), then monkeypatch.setattr cli_module,
"_handle_scan", to that function so the test remains equivalent but far more
readable; keep assertions against cli_module.main and CLI_VALUE_ERROR_EXIT_CODE
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2a4a2109-d674-4dcf-918b-f75a65b361e1
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (36)
.github/ISSUE_TEMPLATE/issue.yml.github/PULL_REQUEST_TEMPLATE.md.github/dependabot.yml.github/workflows/draft-release.yml.github/workflows/publish-docs.yml.github/workflows/publish-pypi.yml.github/workflows/run-tests.yml.gitignore.pre-commit-config.yamlCODE_OF_CONDUCT.mdCONTRIBUTING.mdREADME.mdpyproject.tomlsrc/iceberg_bioimage/adapters/ome_tiff.pysrc/iceberg_bioimage/adapters/zarr_v2.pysrc/iceberg_bioimage/api.pysrc/iceberg_bioimage/cli.pysrc/iceberg_bioimage/integrations/catalog.pysrc/iceberg_bioimage/integrations/duckdb.pysrc/iceberg_bioimage/models/__init__.pysrc/iceberg_bioimage/models/scan_result.pysrc/iceberg_bioimage/publishing/chunk_index.pysrc/iceberg_bioimage/publishing/image_assets.pysrc/iceberg_bioimage/validation/contracts.pysrc/notebooks/example_notebook.ipynbsrc/notebooks/example_notebook.pytests/conftest.pytests/fakes.pytests/test_adapters.pytests/test_catalog_integration.pytests/test_chunk_index.pytests/test_cli.pytests/test_duckdb.pytests/test_publishing.pytests/test_registration.pytests/test_validation.py
✅ Files skipped from review due to trivial changes (14)
- .gitignore
- .github/ISSUE_TEMPLATE/issue.yml
- .github/dependabot.yml
- .github/workflows/publish-pypi.yml
- README.md
- src/notebooks/example_notebook.ipynb
- .pre-commit-config.yaml
- src/notebooks/example_notebook.py
- .github/workflows/publish-docs.yml
- .github/workflows/run-tests.yml
- tests/fakes.py
- tests/test_duckdb.py
- .github/PULL_REQUEST_TEMPLATE.md
- src/iceberg_bioimage/adapters/ome_tiff.py
🚧 Files skipped from review as they are similar to previous changes (11)
- tests/conftest.py
- .github/workflows/draft-release.yml
- src/iceberg_bioimage/models/init.py
- tests/test_catalog_integration.py
- pyproject.toml
- src/iceberg_bioimage/cli.py
- src/iceberg_bioimage/integrations/duckdb.py
- src/iceberg_bioimage/publishing/image_assets.py
- tests/test_registration.py
- tests/test_chunk_index.py
- src/iceberg_bioimage/publishing/chunk_index.py
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
src/iceberg_bioimage/integrations/catalog.py (1)
99-100: Validatejoin_keysbefore forwarding.A quick non-empty check here avoids generating invalid downstream SQL when
join_keysis empty.Proposed fix
def join_catalog_image_assets_with_profiles( @@ ) -> pa.Table: @@ + if not join_keys: + raise ValueError("join_keys must contain at least one column") + image_assets = catalog_table_to_arrow(Also applies to: 131-135
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/integrations/catalog.py` around lines 99 - 100, Add a guard that validates the join_keys parameter is non-empty before it is forwarded or used: in the function signatures that accept join_keys (the one with parameters "join_keys: Sequence[str] = DEFAULT_JOIN_KEYS, image_assets_scan_options: CatalogScanOptions | None = None" and the other occurrence around lines 131-135), check if not join_keys (or len(join_keys) == 0) and raise a ValueError with a clear message like "join_keys must be a non-empty sequence of column names" so downstream SQL generation never receives an empty join key set.tests/test_cli.py (2)
71-90: Use a structured fake return object to reduce interface drift risk.Returning an ad-hoc inner class here is a bit loose; using a structured object with explicit fields (including
source_uri) makes the test contract clearer and safer during CLI payload evolution.♻️ Proposed refinement
+from types import SimpleNamespace @@ - class _Result: - image_assets_rows_published = 1 - chunk_rows_published = 4 - - return _Result() + return SimpleNamespace( + source_uri=uri, + image_assets_rows_published=1, + chunk_rows_published=4, + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_cli.py` around lines 71 - 90, Replace the ad-hoc inner class returned by _fake_register_store with a structured, explicit return object (e.g., a dataclass, typing.NamedTuple, or types.SimpleNamespace) that declares the expected fields (at minimum image_assets_rows_published, chunk_rows_published, and source_uri) so the test surface is explicit and resilient to CLI payload changes; update _fake_register_store to construct and return that structured object instead of the inner class to reduce interface-drift risk.
124-134: Strengthen thescan_resultassertion in the publish-chunks fake.
assert scan_result is not Noneis weak and can miss URI wiring regressions. Assert a concrete field tied to this test input.♻️ Suggested assertion upgrade
- assert scan_result is not None + assert getattr(scan_result, "source_uri", None) == str(store_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_cli.py` around lines 124 - 134, The fake function _fake_publish_chunk_index currently uses a weak assertion (assert scan_result is not None); update it to assert a concrete field on scan_result that ties to the test fixture (for example assert scan_result["uri"] == <expected_uri> or assert getattr(scan_result, "uri", None) == <expected_uri>) so the test fails if the scan_result URI wiring breaks—replace the None check with a lookup of the specific field (e.g., "uri" or the actual fixture field name used by the test) and compare it to the expected value from the test input.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/iceberg_bioimage/integrations/catalog.py`:
- Around line 54-55: The columns parameter currently accepts Sequence[str] but a
plain string will be treated as an iterable of characters, so guard the columns
parameter where it is defined/used: detect if isinstance(columns, str) and
convert it to a single-item sequence (e.g., [columns]) or raise a clear
TypeError; apply this check both at the parameter site (the columns argument
declaration) and immediately before any projection-building logic that uses
columns so downstream code receives a proper Sequence[str].
- Around line 97-121: The image_assets_table parameter is incorrectly typed as
nullable (str | None) but is always used as a required table name when calling
catalog_table_to_arrow; change the function signature to make image_assets_table
a non-nullable str (e.g., image_assets_table: str = "image_assets") so callers
and type checkers know it cannot be None, and ensure any internal uses (the call
to catalog_table_to_arrow) and callers are updated to pass a concrete string
rather than None.
In `@tests/test_cli.py`:
- Around line 26-31: The subprocess.run calls in tests/test_cli.py (the
invocations of subprocess.run for "[sys.executable, '-m',
'iceberg_bioimage.cli', 'scan', str(store_path)]" and the other CLI invocation
around lines 50-61) are missing a timeout and can hang CI; add a reasonable
timeout argument (e.g., timeout=30) to each subprocess.run call to ensure tests
fail fast on hangs, keeping capture_output=True, text=True, and check=True
as-is.
---
Nitpick comments:
In `@src/iceberg_bioimage/integrations/catalog.py`:
- Around line 99-100: Add a guard that validates the join_keys parameter is
non-empty before it is forwarded or used: in the function signatures that accept
join_keys (the one with parameters "join_keys: Sequence[str] =
DEFAULT_JOIN_KEYS, image_assets_scan_options: CatalogScanOptions | None = None"
and the other occurrence around lines 131-135), check if not join_keys (or
len(join_keys) == 0) and raise a ValueError with a clear message like "join_keys
must be a non-empty sequence of column names" so downstream SQL generation never
receives an empty join key set.
In `@tests/test_cli.py`:
- Around line 71-90: Replace the ad-hoc inner class returned by
_fake_register_store with a structured, explicit return object (e.g., a
dataclass, typing.NamedTuple, or types.SimpleNamespace) that declares the
expected fields (at minimum image_assets_rows_published, chunk_rows_published,
and source_uri) so the test surface is explicit and resilient to CLI payload
changes; update _fake_register_store to construct and return that structured
object instead of the inner class to reduce interface-drift risk.
- Around line 124-134: The fake function _fake_publish_chunk_index currently
uses a weak assertion (assert scan_result is not None); update it to assert a
concrete field on scan_result that ties to the test fixture (for example assert
scan_result["uri"] == <expected_uri> or assert getattr(scan_result, "uri", None)
== <expected_uri>) so the test fails if the scan_result URI wiring
breaks—replace the None check with a lookup of the specific field (e.g., "uri"
or the actual fixture field name used by the test) and compare it to the
expected value from the test input.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f153d068-f20e-4bf3-ac83-7d2328503919
📒 Files selected for processing (3)
src/iceberg_bioimage/integrations/catalog.pytests/test_catalog_integration.pytests/test_cli.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_catalog_integration.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/iceberg_bioimage/integrations/catalog.py`:
- Around line 100-117: The code currently allows a bare string in join_keys
which iterates into characters later; call the existing helper
_normalize_columns on join_keys to convert strings into a sequence of column
names and then validate non-empty. Specifically, replace the current join_keys
check with: normalized = _normalize_columns(join_keys, "join_keys"); if not
normalized: raise ValueError(...); join_keys = normalized — this ensures the
join clause (used in duckdb.py:87) receives a proper list of column names.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 320bc2c0-7b8b-41a2-86a8-a1ff60baece6
📒 Files selected for processing (3)
src/iceberg_bioimage/integrations/catalog.pytests/test_catalog_integration.pytests/test_cli.py
Summary by CodeRabbit
New Features
Documentation
Tests
Chores