Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Cytomining-compatible Parquet warehouse export and ingestion across API, CLI, and integrations; implements profile-column alias resolution and optional profile_dataset_id propagation; adds namespace-fallback/legacy handling, manifest management, notebook build checks, models, examples, and tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API as iceberg_bioimage.api
participant Publishing as Publishing Layer
participant Catalog as Iceberg Catalog
participant Warehouse as Parquet Warehouse
rect rgba(100,200,100,0.5)
Note over Client,Warehouse: Warehouse Ingestion Flow
Client->>API: ingest_stores_to_warehouse(uris, catalog, namespace)
API->>API: scan_store(uri)
API->>Publishing: publish_image_assets(scan_result,...)
Publishing->>Catalog: load_table / create_table (namespace candidates)
Catalog->>Warehouse: append image_assets rows
API->>Publishing: publish_chunk_index(scan_result,...)
Publishing->>Catalog: load_table / create_table
Catalog->>Warehouse: append chunk_index rows
API-->>Client: WarehouseIngestResult
end
sequenceDiagram
participant Client
participant Cytom as integrations.cytomining
participant Profiles as Profile Resolver
participant Parquet as Parquet Writer
participant Manifest as Manifest Manager
rect rgba(100,150,200,0.5)
Note over Client,Parquet: Cytomining Export Flow
Client->>Cytom: export_store_to_cytomining_warehouse(uri, warehouse_root,...)
Cytom->>Cytom: scan_store(uri)
Cytom->>Parquet: export_table_to_cytomining_warehouse(image_assets)
Parquet->>Parquet: write parquet files under warehouse_root/image_assets
alt profiles provided
Cytom->>Profiles: _normalize_profiles_table(profiles, profile_dataset_id)
Profiles->>Profiles: resolve_microscopy_profile_columns(alias_map)
Profiles-->>Cytom: normalized profiles table
Cytom->>Cytom: join profiles with image_assets
Cytom->>Parquet: export_table_to_cytomining_warehouse(joined_profiles)
end
Cytom->>Manifest: _update_manifest(warehouse_manifest.json)
Manifest-->>Client: CytominingWarehouseResult(manifest_path, tables_written, row_counts)
end
sequenceDiagram
participant Client
participant DuckDB as join_image_assets_with_profiles
participant Resolver as resolve_microscopy_profile_columns
participant SQL as SQL Engine
rect rgba(200,150,100,0.5)
Note over Client,SQL: Profile Join with Alias Resolution
Client->>DuckDB: join_image_assets_with_profiles(profiles, join_keys, profile_dataset_id?)
DuckDB->>Resolver: resolve_microscopy_profile_columns(profile_columns, alias_map)
Resolver-->>DuckDB: mapping dict (canonical->source)
DuckDB->>DuckDB: build profiles projection (aliasing, inject dataset_id if needed)
DuckDB->>SQL: CREATE VIEW profiles AS (projection)
DuckDB->>SQL: INNER JOIN image_assets USING (join_keys)
SQL-->>DuckDB: joined result
DuckDB-->>Client: normalized joined table
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/iceberg_bioimage/integrations/duckdb.py (1)
87-124:⚠️ Potential issue | 🟡 MinorClose owned DuckDB connections on every exit path.
_register_profiles_source()adds new failure points before Line 123, so an internally created connection can stay open on bad profile inputs. Wrap the body after_get_connection()intry/finallyand only close whenowns_connectionis true.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/integrations/duckdb.py` around lines 87 - 124, The code may leak an internally created DuckDB connection returned by _get_connection when later calls like _register_profiles_source or _register_source fail; wrap the logic after calling _get_connection(…) in a try/finally and in the finally block close duckdb_connection only if owns_connection is True so the connection is always closed on every exit path; ensure this covers all calls including _register_source, _register_profiles_source, building/executing the query and calling _as_arrow_table so any exception still triggers the conditional close.
🧹 Nitpick comments (4)
pyproject.toml (1)
142-144: Consider makingtasks.notebooksfail when sync changes files.Right now it always syncs but doesn’t enforce a clean post-sync state, which weakens drift detection for contributors running this task locally.
♻️ Suggested enhancement
tasks.notebooks.shell = """ python -m jupytext --sync docs/src/examples/*.ipynb +git diff --exit-code -- docs/src/examples """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 142 - 144, Update the tasks.notebooks.shell entry so that after running the jupytext sync it verifies the repository is unchanged and fails if any files were modified: run the existing jupytext --sync step (the tasks.notebooks shell command), then perform a git check for uncommitted/changed files and if any are found print a clear message and exit with a non-zero status so the task fails; this enforces drift detection for contributors running tasks.notebooks locally.src/iceberg_bioimage/integrations/catalog.py (1)
128-128: Document the newprofile_dataset_idparameter.Line 128 adds a public keyword, but the docstring still stops at
chunk_index_scan_options. Please add it sohelp()and generated docs stay complete.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/integrations/catalog.py` at line 128, The public parameter profile_dataset_id was added to the signature (profile_dataset_id: str | None = None) but its description is missing from the function/class docstring; update the docstring where parameters are listed (near chunk_index_scan_options) to add a brief entry for profile_dataset_id describing its type (str | None), default (None), and purpose (which dataset id to use for profiling), so help() and generated docs include it; ensure formatting matches the existing docstring style and ordering.src/iceberg_bioimage/integrations/cytomining.py (2)
303-317: Consider usingLiteraltype for themodeparameter.Using
Literal["overwrite", "append"]instead ofstrwould provide better IDE autocompletion and type checking support across all functions accepting this parameter.💡 Suggested type improvement
+from typing import Literal + +WriteMode = Literal["overwrite", "append"] + def _write_parquet_dataset( table: pa.Table, dataset_path: Path, *, - mode: str, + mode: WriteMode, ) -> None:Then update the type hint for
modein the public functions as well.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/integrations/cytomining.py` around lines 303 - 317, Change the mode parameter's type from plain str to a Literal union to get better type-checking and IDE help: update the signature of _write_parquet_dataset to use typing.Literal["overwrite", "append"] for mode and import Literal, and propagate the same Literal["overwrite","append"] annotation to all public functions that accept the mode parameter so callers and static checkers see the narrowed type.
398-402: Usepa.repeat()for memory-efficient array construction with scalar values.For large tables, creating an intermediate Python list with
[profile_dataset_id] * normalized.num_rowsallocates unnecessary memory. PyArrow providespa.repeat()which directly repeats a scalar value:pa.repeat(pa.scalar(profile_dataset_id), normalized.num_rows)This avoids the intermediate list allocation and is more efficient for large tables. PyArrow 18+ (your project's minimum version) supports this function.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/integrations/cytomining.py` around lines 398 - 402, Replace the intermediate Python list used when appending the dataset_id column with a PyArrow repeated scalar to avoid large-list allocation: in the block that checks `if canonical == "dataset_id" and profile_dataset_id is not None`, change the value passed to `normalized.append_column` so it uses `pa.repeat(pa.scalar(profile_dataset_id), normalized.num_rows)` instead of `[profile_dataset_id] * normalized.num_rows`, keeping the same call to `normalized.append_column`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/src/examples/basic-workflow.py`:
- Line 42: The long import line containing ingest_stores_to_warehouse,
join_profiles_with_store, scan_store, summarize_store (and other
over-88-character import lines in this file) must be wrapped so no line exceeds
88 characters to satisfy Ruff; break the import across multiple lines (one
symbol per line or grouped) in the Python file
(docs/src/examples/basic-workflow.py) and then regenerate or sync the
corresponding .ipynb cell so both sources match; locate the import statements by
the function names ingest_stores_to_warehouse, join_profiles_with_store,
scan_store, and summarize_store to apply the change consistently.
In `@pyproject.toml`:
- Around line 110-115: The lint.per-file-ignores block in pyproject.toml is not
ordered as expected by pyproject-fmt; reorder the entries so the keys are sorted
lexicographically (e.g. ensure "src/iceberg_bioimage/cli.py" comes in correct
alphabetical position relative to "src/iceberg_bioimage/integrations/..." and
"src/iceberg_bioimage/publishing/..."), preserve the exact key quoting and value
arrays (PLC0415, PLR0915) and formatting style used elsewhere in the file, then
run the formatter/hook to confirm the section is stable.
In `@README.md`:
- Around line 79-80: Update the examples that pass the namespace string to use
the canonical "bioimage.cytotable" instead of the legacy "bioimage" (e.g.,
replace occurrences of "bioimage" in the arrays shown and in the surrounding
register examples), or explicitly mark those examples as using the legacy
fallback; ensure all examples that call register use "bioimage.cytotable" so the
new resolver won't emit warnings (apply the same change to the occurrences noted
around lines 79-80 and 97-98).
- Around line 77-82: The README example calls warehouse.to_dict() but the result
type WarehouseIngestResult only exposes to_json(); update the quickstart to call
warehouse.to_json() (or, alternatively, add a to_dict method on
WarehouseIngestResult that returns the same dict as to_json()). Locate the
example using ingest_stores_to_warehouse and replace the to_dict() call with
to_json() to match the existing WarehouseIngestResult API (or implement a
to_dict wrapper in class ScanResult/WarehouseIngestResult to forward to_json()).
In `@src/iceberg_bioimage/integrations/catalog.py`:
- Around line 72-77: The public function load_catalog_table declares it returns
SupportsIcebergTable but currently returns _load_table_with_namespace_fallback
which is typed to return SupportsAppend; fix by making
_load_table_with_namespace_fallback generic over the table type (e.g., TypeVar T
bounded to table protocol) and accept a catalog type parameter bounded to the
matching SupportsCatalog/SupportsScanCatalog so its load_table() returns the
same T, or alternatively add a thin type-correct wrapper that calls the existing
helper and casts/validates the result to SupportsIcebergTable before returning;
update the annotations for _load_table_with_namespace_fallback,
load_catalog_table, and any intermediate variables so load_table() signatures
(SupportsCatalog vs SupportsScanCatalog) and methods (append vs scan) line up
with the declared return type.
In `@src/iceberg_bioimage/publishing/image_assets.py`:
- Around line 301-306: The current code around catalog.create_namespace()
catches all exceptions and only re-raises NoSuchNamespaceError, which hides real
errors; change the try/except to catch specifically NamespaceAlreadyExistsError
(from pyiceberg) and suppress it (no-op), remove the broad "except Exception"
block and any special-case re-raising of NoSuchNamespaceError so that any other
exceptions (auth/permission/backend failures) propagate normally; update imports
or references to ensure NamespaceAlreadyExistsError is available where
create_namespace() is called.
In `@src/iceberg_bioimage/validation/contracts.py`:
- Around line 157-175: The code currently replaces
MICROSCOPY_PROFILE_COLUMN_ALIASES when alias_map is provided, losing default
aliases; instead merge them: start from a copy of
MICROSCOPY_PROFILE_COLUMN_ALIASES and for each key in alias_map append/extend
its aliases (preserving order) so both built-in and custom aliases are
considered; then use that merged "aliases" map in the existing resolution loop
(symbols to update: alias_map, MICROSCOPY_PROFILE_COLUMN_ALIASES, aliases,
resolved, MICROSCOPY_REQUIRED_JOIN_KEYS, MICROSCOPY_RECOMMENDED_JOIN_KEYS).
- Around line 203-208: The code currently calls load_warehouse_manifest(root)
without handling exceptions, so a malformed or invalid JSON/schema will raise
and crash instead of returning a WarehouseValidationResult; wrap the call to
load_warehouse_manifest(root) in a try/except, catch the specific exceptions it
can raise (e.g., JSONDecodeError, validation exceptions) and append an
appropriate error to result.errors (or populate result with failure info), then
return result; ensure you still return the validated manifest when no exception
occurs so callers of manifest_path, load_warehouse_manifest, and the enclosing
function get a proper WarehouseValidationResult on both success and failure.
In `@tests/test_cytomining_integration.py`:
- Around line 72-74: Tests use zarr.Group.create_array which is not available in
zarr 2.18.4; replace calls to root.create_array("0", data=data, chunks=(2, 2))
with root.create_dataset("0", data=data, chunks=(2, 2)) (and do the same for
other occurrences around lines 101-113) so the test fixture uses the documented
zarr 2.x API; update all instances where create_array is used (e.g., the
variables root, store_path, data) to call create_dataset with the same
arguments.
---
Outside diff comments:
In `@src/iceberg_bioimage/integrations/duckdb.py`:
- Around line 87-124: The code may leak an internally created DuckDB connection
returned by _get_connection when later calls like _register_profiles_source or
_register_source fail; wrap the logic after calling _get_connection(…) in a
try/finally and in the finally block close duckdb_connection only if
owns_connection is True so the connection is always closed on every exit path;
ensure this covers all calls including _register_source,
_register_profiles_source, building/executing the query and calling
_as_arrow_table so any exception still triggers the conditional close.
---
Nitpick comments:
In `@pyproject.toml`:
- Around line 142-144: Update the tasks.notebooks.shell entry so that after
running the jupytext sync it verifies the repository is unchanged and fails if
any files were modified: run the existing jupytext --sync step (the
tasks.notebooks shell command), then perform a git check for uncommitted/changed
files and if any are found print a clear message and exit with a non-zero status
so the task fails; this enforces drift detection for contributors running
tasks.notebooks locally.
In `@src/iceberg_bioimage/integrations/catalog.py`:
- Line 128: The public parameter profile_dataset_id was added to the signature
(profile_dataset_id: str | None = None) but its description is missing from the
function/class docstring; update the docstring where parameters are listed (near
chunk_index_scan_options) to add a brief entry for profile_dataset_id describing
its type (str | None), default (None), and purpose (which dataset id to use for
profiling), so help() and generated docs include it; ensure formatting matches
the existing docstring style and ordering.
In `@src/iceberg_bioimage/integrations/cytomining.py`:
- Around line 303-317: Change the mode parameter's type from plain str to a
Literal union to get better type-checking and IDE help: update the signature of
_write_parquet_dataset to use typing.Literal["overwrite", "append"] for mode and
import Literal, and propagate the same Literal["overwrite","append"] annotation
to all public functions that accept the mode parameter so callers and static
checkers see the narrowed type.
- Around line 398-402: Replace the intermediate Python list used when appending
the dataset_id column with a PyArrow repeated scalar to avoid large-list
allocation: in the block that checks `if canonical == "dataset_id" and
profile_dataset_id is not None`, change the value passed to
`normalized.append_column` so it uses `pa.repeat(pa.scalar(profile_dataset_id),
normalized.num_rows)` instead of `[profile_dataset_id] * normalized.num_rows`,
keeping the same call to `normalized.append_column`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e122b7c1-ff8d-417b-823a-39824120bb0f
⛔ Files ignored due to path filters (2)
tests/data/profiles_cosmicqc.parquetis excluded by!**/*.parquettests/data/profiles_pycytominer.parquetis excluded by!**/*.parquet
📒 Files selected for processing (33)
.github/workflows/run-tests.ymlCONTRIBUTING.mdREADME.mddocs/src/conf.pydocs/src/cytomining.mddocs/src/duckdb.mddocs/src/examples/basic-workflow.ipynbdocs/src/examples/basic-workflow.pydocs/src/examples/metadata-workflow.ipynbdocs/src/examples/metadata-workflow.pydocs/src/index.mddocs/src/python-api.mddocs/src/workflow.mdpyproject.tomlsrc/iceberg_bioimage/__init__.pysrc/iceberg_bioimage/api.pysrc/iceberg_bioimage/cli.pysrc/iceberg_bioimage/integrations/__init__.pysrc/iceberg_bioimage/integrations/catalog.pysrc/iceberg_bioimage/integrations/cytomining.pysrc/iceberg_bioimage/integrations/duckdb.pysrc/iceberg_bioimage/models/scan_result.pysrc/iceberg_bioimage/publishing/image_assets.pysrc/iceberg_bioimage/validation/contracts.pytests/fakes.pytests/test_api.pytests/test_catalog_integration.pytests/test_chunk_index.pytests/test_cli.pytests/test_cytomining_integration.pytests/test_duckdb.pytests/test_publishing.pytests/test_validation.py
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
src/iceberg_bioimage/integrations/duckdb.py (1)
83-85: Consider documenting the newprofile_dataset_idparameter.The new parameter enables profile joining when profiles lack a
dataset_idcolumn, but the docstring doesn't describe its purpose or behavior.📝 Suggested docstring update
def join_image_assets_with_profiles( # noqa: PLR0913 image_assets: MetadataSource, profiles: MetadataSource, *, join_keys: Sequence[str] = DEFAULT_JOIN_KEYS, chunk_index: MetadataSource | None = None, connection: DuckDBPyConnection | None = None, profile_dataset_id: str | None = None, ) -> pa.Table: - """Join image metadata to a profile table using the canonical join keys.""" + """Join image metadata to a profile table using the canonical join keys. + + Parameters + ---------- + profile_dataset_id : str | None + Fallback dataset identifier to use when the profiles source lacks a + ``dataset_id`` column and no alias can be resolved. + """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/integrations/duckdb.py` around lines 83 - 85, The docstring for the function that "Join image metadata to a profile table using the canonical join keys" is missing documentation for the new profile_dataset_id parameter; update the function's docstring to describe profile_dataset_id (type: str | None, default None), explain that when provided it will be used to populate or override the profile's dataset_id for performing the join when the profile table lacks a dataset_id column, and state that if None the function expects dataset_id to exist on the profile; reference the parameter name profile_dataset_id and keep the description concise and include any effects on join behavior or preconditions.src/iceberg_bioimage/publishing/image_assets.py (2)
114-145: Make_load_or_create_tablereuse the load-fallback helper.Lines 117-130 duplicate the candidate iteration and warning path already implemented in
_load_table_with_namespace_fallback(). Keeping both copies aligned will be easy to miss the next time namespace resolution changes.♻️ Suggested simplification
def _load_or_create_table( catalog_or_name: str | SupportsCatalog, namespace: str | Iterable[str], table_name: str, @@ ) -> SupportsAppend: catalog = _resolve_catalog(catalog_or_name) requested_namespace = _normalize_namespace(namespace) candidate_namespaces = _namespace_candidates(requested_namespace) - for resolved_namespace in candidate_namespaces: - identifier = (*resolved_namespace, table_name) - try: - table = catalog.load_table(identifier) - except NoSuchTableError: # pragma: no cover - depends on active catalog backend - continue - - _warn_for_namespace_resolution( - requested_namespace, - resolved_namespace, - table_name, - operation="publishing", - ) - return table + try: + return _load_table_with_namespace_fallback( + catalog, + requested_namespace, + table_name, + operation="publishing", + ) + except NoSuchTableError: + pass build_schema = ( _build_image_assets_schema if schema_builder is None else schema_builder )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/publishing/image_assets.py` around lines 114 - 145, Replace the duplicated candidate-iteration and warning logic in _load_or_create_table with a call to the existing helper _load_table_with_namespace_fallback: call _load_table_with_namespace_fallback(catalog, requested_namespace, table_name) (or the appropriate signature) to attempt to load the table and capture its returned table and resolved_namespace when present; if it returns a table, return it immediately, otherwise proceed to use candidate_namespaces[0] as preferred_namespace, call _warn_for_namespace_resolution with creating=True, ensure the namespace with _ensure_namespace_exists, and create the table with catalog.create_table(identifier, schema=build_schema()) (where build_schema is _build_image_assets_schema or schema_builder) so all namespace resolution logic is centralized in _load_table_with_namespace_fallback.
307-319: Cover thecreate_namespacefallback with a dedicated fake.
tests/fakes.py:19-51only implementscreate_namespace_if_not_exists, so the fallback branch here and theNamespaceAlreadyExistsErrorsuppression can regress without test feedback.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/publishing/image_assets.py` around lines 307 - 319, The fallback branch in _ensure_namespace_exists that calls catalog.create_namespace and catches NamespaceAlreadyExistsError isn’t covered by tests because tests/fakes.py only implements create_namespace_if_not_exists; update tests/fakes.py to add a fake method named create_namespace (and storage to track namespaces) that creates the namespace or raises the same NamespaceAlreadyExistsError used in src/iceberg_bioimage/publishing/image_assets.py when the namespace already exists, and add a test case that exercises the create_namespace fallback path to ensure the exception suppression behavior is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/iceberg_bioimage/publishing/image_assets.py`:
- Around line 339-348: The warning message can produce a leading dot when
_normalize_namespace() collapses to an empty tuple; update the message
construction so it conditionally joins namespace and table without a leading '.'
when the namespace is empty. Specifically, change how expected_namespace (from
_namespace_candidates(requested_namespace)[0]) and resolved_namespace_name are
combined with table_name: compute expected_full and resolved_full that use
"namespace + '.' + table_name" only if namespace is non-empty, otherwise just
use table_name, and then use those variables in the warnings.warn call
(referencing expected_namespace_parts, expected_namespace,
resolved_namespace_name, requested_namespace, table_name, and the existing
action/operation variables).
- Around line 286-300: The loop over _namespace_candidates currently calls
_warn_for_namespace_resolution each time identifiers are found, producing
multiple UserWarnings for a single catalog.list_tables call; introduce a flag
(e.g., warned_for_list = False) before the for loop and, inside the loop where
identifiers is truthy, call _warn_for_namespace_resolution only if
warned_for_list is False, then set the flag to True so subsequent matching
resolved_namespace iterations for the same list_tables call do not emit
additional warnings; reference the existing variables/functions: discovered,
_namespace_candidates(requested_namespace), catalog.list_tables,
_warn_for_namespace_resolution, requested_namespace, resolved_namespace.
---
Nitpick comments:
In `@src/iceberg_bioimage/integrations/duckdb.py`:
- Around line 83-85: The docstring for the function that "Join image metadata to
a profile table using the canonical join keys" is missing documentation for the
new profile_dataset_id parameter; update the function's docstring to describe
profile_dataset_id (type: str | None, default None), explain that when provided
it will be used to populate or override the profile's dataset_id for performing
the join when the profile table lacks a dataset_id column, and state that if
None the function expects dataset_id to exist on the profile; reference the
parameter name profile_dataset_id and keep the description concise and include
any effects on join behavior or preconditions.
In `@src/iceberg_bioimage/publishing/image_assets.py`:
- Around line 114-145: Replace the duplicated candidate-iteration and warning
logic in _load_or_create_table with a call to the existing helper
_load_table_with_namespace_fallback: call
_load_table_with_namespace_fallback(catalog, requested_namespace, table_name)
(or the appropriate signature) to attempt to load the table and capture its
returned table and resolved_namespace when present; if it returns a table,
return it immediately, otherwise proceed to use candidate_namespaces[0] as
preferred_namespace, call _warn_for_namespace_resolution with creating=True,
ensure the namespace with _ensure_namespace_exists, and create the table with
catalog.create_table(identifier, schema=build_schema()) (where build_schema is
_build_image_assets_schema or schema_builder) so all namespace resolution logic
is centralized in _load_table_with_namespace_fallback.
- Around line 307-319: The fallback branch in _ensure_namespace_exists that
calls catalog.create_namespace and catches NamespaceAlreadyExistsError isn’t
covered by tests because tests/fakes.py only implements
create_namespace_if_not_exists; update tests/fakes.py to add a fake method named
create_namespace (and storage to track namespaces) that creates the namespace or
raises the same NamespaceAlreadyExistsError used in
src/iceberg_bioimage/publishing/image_assets.py when the namespace already
exists, and add a test case that exercises the create_namespace fallback path to
ensure the exception suppression behavior is exercised.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 161a9419-87fb-4e2f-b9d0-406a41f475bc
📒 Files selected for processing (10)
README.mddocs/src/examples/basic-workflow.ipynbdocs/src/examples/basic-workflow.pypyproject.tomlsrc/iceberg_bioimage/integrations/catalog.pysrc/iceberg_bioimage/integrations/cytomining.pysrc/iceberg_bioimage/integrations/duckdb.pysrc/iceberg_bioimage/publishing/image_assets.pysrc/iceberg_bioimage/validation/contracts.pytests/test_cytomining_integration.py
✅ Files skipped from review due to trivial changes (4)
- docs/src/examples/basic-workflow.ipynb
- docs/src/examples/basic-workflow.py
- tests/test_cytomining_integration.py
- src/iceberg_bioimage/validation/contracts.py
🚧 Files skipped from review as they are similar to previous changes (3)
- src/iceberg_bioimage/integrations/catalog.py
- README.md
- src/iceberg_bioimage/integrations/cytomining.py
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/iceberg_bioimage/integrations/duckdb.py (1)
172-184: Correct approach, with a minor inefficiency.The pattern of registering the raw source then creating a normalized view is clean. However, for file-path sources, the schema is read twice: once by
_register_source(viafrom_parquet) and again by_columns_for_source(viads.dataset). This is a minor inefficiency but not a correctness issue.💡 Optional optimization to read schema once
Consider caching the relation or extracting columns from the already-registered view:
def _register_profiles_source( connection: DuckDBPyConnection, source: MetadataSource, *, dataset_id: str | None = None, ) -> None: _register_source(connection, "profiles_source", source) - columns = _columns_for_source(source) + columns = connection.execute( + "SELECT * FROM profiles_source LIMIT 0" + ).columns projection = _profile_projection(columns, dataset_id=dataset_id) connection.execute( "CREATE OR REPLACE VIEW profiles AS " f"SELECT {projection} FROM profiles_source" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/integrations/duckdb.py` around lines 172 - 184, The current implementation reads file-path schemas twice: once in _register_source and again in _columns_for_source; to avoid this, change _register_source to return the registered relation identifier or a DuckDB relation object (instead of only registering) and update _register_profiles_source to pass that relation into _columns_for_source (or use it directly to build the projection) so _columns_for_source/_profile_projection can reuse the already-loaded schema; update function signatures for _register_source and _columns_for_source accordingly and ensure _register_profiles_source calls the new return value when building projection.tests/fakes.py (1)
54-92: Consider extracting a base class to reduce duplication.
FakeCreateNamespaceCatalogduplicatesload_table,create_table, andlist_tablesimplementations verbatim fromFakeCatalog. While acceptable in test code, a base class could reduce maintenance burden if these stubs evolve.Sketch of potential refactor
class _BaseFakeCatalog: """Shared catalog stub behavior.""" def __init__( self, table: FakeTable | None = None, tables: dict[tuple[str, ...], FakeTable] | None = None, ) -> None: self.table = table self.tables = {} if tables is None else dict(tables) self.created_identifiers: list[tuple[str, ...]] = [] self.created_namespaces: list[tuple[str, ...]] = [] def load_table(self, identifier: tuple[str, ...]) -> FakeTable: if identifier in self.tables: return self.tables[identifier] if self.table is None: raise NoSuchTableError(f"Missing table: {identifier!r}") return self.table def create_table(self, identifier: tuple[str, ...], schema: object) -> FakeTable: self.created_identifiers.append(identifier) self.table = FakeTable() self.tables[identifier] = self.table return self.table def list_tables(self, namespace: tuple[str, ...]) -> list[tuple[str, ...]]: return [ identifier for identifier in self.tables if identifier[:-1] == namespace ] class FakeCatalog(_BaseFakeCatalog): def create_namespace_if_not_exists(self, namespace: tuple[str, ...]) -> None: self.created_namespaces.append(namespace) class FakeCreateNamespaceCatalog(_BaseFakeCatalog): def __init__(self, *args, **kwargs) -> None: super().__init__(*args, **kwargs) self.namespaces: set[tuple[str, ...]] = set() def create_namespace(self, namespace: tuple[str, ...]) -> None: if namespace in self.namespaces: raise NamespaceAlreadyExistsError( f"Namespace already exists: {namespace!r}" ) self.namespaces.add(namespace) self.created_namespaces.append(namespace)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/fakes.py` around lines 54 - 92, Extract the shared table handling into a small base class (e.g. _BaseFakeCatalog) and move the duplicated implementations of load_table, create_table, list_tables and the common __init__ fields (table, tables, created_identifiers, created_namespaces) there; then have FakeCatalog and FakeCreateNamespaceCatalog inherit from _BaseFakeCatalog, keep FakeCatalog.create_namespace_if_not_exists as-is, and in FakeCreateNamespaceCatalog keep the specialized __init__ (to init namespaces set) and create_namespace (raising NamespaceAlreadyExistsError) but remove the duplicated methods. Reference classes/methods: FakeCreateNamespaceCatalog, FakeCatalog, _BaseFakeCatalog, load_table, create_table, list_tables, create_namespace.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_catalog_integration.py`:
- Around line 315-323: The test stub's parameter type for join_keys is
incorrect: change the type annotation on the stub function parameter join_keys
from tuple[str, ...] to a compatible type (preferably
collections.abc.Sequence[str] or list[str]) so it matches the actual value
produced by _normalize_columns (which returns a list); also add the necessary
import for Sequence if you choose that option and update any references to
join_keys in the stub to reflect the new Sequence typing.
---
Nitpick comments:
In `@src/iceberg_bioimage/integrations/duckdb.py`:
- Around line 172-184: The current implementation reads file-path schemas twice:
once in _register_source and again in _columns_for_source; to avoid this, change
_register_source to return the registered relation identifier or a DuckDB
relation object (instead of only registering) and update
_register_profiles_source to pass that relation into _columns_for_source (or use
it directly to build the projection) so _columns_for_source/_profile_projection
can reuse the already-loaded schema; update function signatures for
_register_source and _columns_for_source accordingly and ensure
_register_profiles_source calls the new return value when building projection.
In `@tests/fakes.py`:
- Around line 54-92: Extract the shared table handling into a small base class
(e.g. _BaseFakeCatalog) and move the duplicated implementations of load_table,
create_table, list_tables and the common __init__ fields (table, tables,
created_identifiers, created_namespaces) there; then have FakeCatalog and
FakeCreateNamespaceCatalog inherit from _BaseFakeCatalog, keep
FakeCatalog.create_namespace_if_not_exists as-is, and in
FakeCreateNamespaceCatalog keep the specialized __init__ (to init namespaces
set) and create_namespace (raising NamespaceAlreadyExistsError) but remove the
duplicated methods. Reference classes/methods: FakeCreateNamespaceCatalog,
FakeCatalog, _BaseFakeCatalog, load_table, create_table, list_tables,
create_namespace.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 830e9caa-72c3-413e-ac7e-7297a2068238
📒 Files selected for processing (5)
src/iceberg_bioimage/integrations/duckdb.pysrc/iceberg_bioimage/publishing/image_assets.pytests/fakes.pytests/test_catalog_integration.pytests/test_publishing.py
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/test_publishing.py
- src/iceberg_bioimage/publishing/image_assets.py
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/test_publishing.py (1)
93-126: Consider adding assertion forcreated_namespaces.Unlike
test_publish_image_assets_creates_missing_table(line 85), this test doesn't assert onfake_catalog.created_namespaces. Adding an explicit assertion would document whether namespaces are created when using an existing legacy table, making the expected behavior clearer.📝 Proposed addition
assert row_count == 1 + assert fake_catalog.created_namespaces == [] # or expected value if namespaces are created assert fake_catalog.created_identifiers == [] assert len(legacy_table.appends) == 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_publishing.py` around lines 93 - 126, In test_publish_image_assets_uses_existing_legacy_table, add an assertion that fake_catalog.created_namespaces is empty (e.g., assert fake_catalog.created_namespaces == []) to document and verify that no new namespaces are created when publishing to an existing legacy table; place this assertion after the existing assertions on created_identifiers and legacy_table.appends in the test function.tests/fakes.py (1)
54-58: Remove redundant method override.
FakeCatalog.create_namespace_if_not_existsis identical to the inherited_BaseFakeCatalog.create_namespace_if_not_exists(lines 45-46). This override can be removed.🔧 Proposed fix
class FakeCatalog(_BaseFakeCatalog): """Minimal catalog stub.""" - - def create_namespace_if_not_exists(self, namespace: tuple[str, ...]) -> None: - self.created_namespaces.append(namespace) + pass🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/fakes.py` around lines 54 - 58, The FakeCatalog class contains a redundant override of create_namespace_if_not_exists that duplicates _BaseFakeCatalog.create_namespace_if_not_exists; remove the method from FakeCatalog so it inherits the base implementation (leave FakeCatalog definition and other methods unchanged), and run tests to confirm no behavioral change.src/iceberg_bioimage/integrations/cytomining.py (1)
366-374: Consider documenting memory implications for large datasets.When
sourceis a path,ds.dataset(source).to_table()loads the entire dataset into memory. For large profile datasets, this could cause memory pressure. Consider adding a docstring note about this limitation, or in future iterations, supporting streaming/batched writes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/iceberg_bioimage/integrations/cytomining.py` around lines 366 - 374, Add a brief docstring to _metadata_source_to_table noting that when source is a filesystem path the call ds.dataset(source).to_table() materializes the entire dataset into memory and can cause high memory usage for large profile datasets; mention the limitation and recommend using smaller partitions or a streaming/batched approach in future iterations to avoid memory pressure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/iceberg_bioimage/validation/contracts.py`:
- Around line 188-196: The function load_profile_column_aliases silently returns
an empty dict when the TOML payload lacks the microscopy.aliases section; update
load_profile_column_aliases to detect when alias_section is missing or empty
(after tomllib.loads and alias_section = payload.get("microscopy",
{}).get("aliases", {})) and either emit a clear warning via the module logger
(create or use an existing logger) indicating the missing [microscopy.aliases]
section and the file path, or raise a descriptive exception (e.g., ValueError)
so callers know aliases were not loaded; ensure the chosen behavior is
consistent with existing project error-handling patterns.
---
Nitpick comments:
In `@src/iceberg_bioimage/integrations/cytomining.py`:
- Around line 366-374: Add a brief docstring to _metadata_source_to_table noting
that when source is a filesystem path the call ds.dataset(source).to_table()
materializes the entire dataset into memory and can cause high memory usage for
large profile datasets; mention the limitation and recommend using smaller
partitions or a streaming/batched approach in future iterations to avoid memory
pressure.
In `@tests/fakes.py`:
- Around line 54-58: The FakeCatalog class contains a redundant override of
create_namespace_if_not_exists that duplicates
_BaseFakeCatalog.create_namespace_if_not_exists; remove the method from
FakeCatalog so it inherits the base implementation (leave FakeCatalog definition
and other methods unchanged), and run tests to confirm no behavioral change.
In `@tests/test_publishing.py`:
- Around line 93-126: In test_publish_image_assets_uses_existing_legacy_table,
add an assertion that fake_catalog.created_namespaces is empty (e.g., assert
fake_catalog.created_namespaces == []) to document and verify that no new
namespaces are created when publishing to an existing legacy table; place this
assertion after the existing assertions on created_identifiers and
legacy_table.appends in the test function.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f5170f16-4ebc-4ed4-ba66-a1652e5a89c8
📒 Files selected for processing (13)
.pre-commit-config.yamldocs/src/duckdb.mddocs/src/examples/basic-workflow.ipynbdocs/src/examples/basic-workflow.pysrc/iceberg_bioimage/cli.pysrc/iceberg_bioimage/integrations/__init__.pysrc/iceberg_bioimage/integrations/cytomining.pysrc/iceberg_bioimage/integrations/duckdb.pysrc/iceberg_bioimage/validation/contracts.pytests/fakes.pytests/test_catalog_integration.pytests/test_cytomining_integration.pytests/test_publishing.py
✅ Files skipped from review due to trivial changes (5)
- docs/src/duckdb.md
- .pre-commit-config.yaml
- tests/test_catalog_integration.py
- docs/src/examples/basic-workflow.ipynb
- docs/src/examples/basic-workflow.py
🚧 Files skipped from review as they are similar to previous changes (4)
- src/iceberg_bioimage/integrations/init.py
- tests/test_cytomining_integration.py
- src/iceberg_bioimage/cli.py
- src/iceberg_bioimage/integrations/duckdb.py
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
docs/src/examples/basic-workflow.py (2)
69-73: Prevent silent overwrite inDemoCatalog.create_table.Line 69 currently replaces existing tables without signaling an error. That can mask duplicate-create bugs in warehouse ingestion and discard prior in-memory appended data in this demo harness.
Suggested change
def create_table(self, identifier: tuple[str, ...], schema: object) -> DemoTable: + if identifier in self.tables: + raise ValueError(f"Table already exists: {identifier!r}") self.created_identifiers.append(identifier) table = DemoTable() self.tables[identifier] = table return table🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/src/examples/basic-workflow.py` around lines 69 - 73, DemoCatalog.create_table currently overwrites an existing table silently; change it to first check if identifier already exists in self.tables and raise an explicit error (e.g., ValueError) instead of replacing the entry so duplicates are not lost. Ensure the existence check happens before appending to self.created_identifiers and before instantiating DemoTable so created_identifiers isn't duplicated when a table already exists.
95-108: Use an auto-cleaned temp directory for the example workspace.Line 95 uses
tempfile.mkdtemp(...)without cleanup. For local doc runs this leaves artifacts behind; wrapping withTemporaryDirectorykeeps the example self-cleaning.Suggested direction
-tmpdir = Path(tempfile.mkdtemp(prefix="iceberg-bioimage-demo-")) +with tempfile.TemporaryDirectory(prefix="iceberg-bioimage-demo-") as tmp: + tmpdir = Path(tmp) + # keep the existing workflow inside this context🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/src/examples/basic-workflow.py` around lines 95 - 108, The example creates a temporary workspace with tempfile.mkdtemp into tmpdir which is never cleaned up; replace that usage with a context-managed TemporaryDirectory (from tempfile) and assign tmpdir = Path(tempdir_ctx) inside the context so the tmpdir, zarr_path, and tiff_path work inside the with block; update surrounding code that uses tmpdir, zarr_path, root, zarr.open_group, and tifffile.imwrite to live inside the with TemporaryDirectory() as tmpdir_ctx: block so the temp directory is automatically removed when the example finishes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/run-tests.yml:
- Around line 76-80: The "Check notebook review pairs" step currently runs
"python -m jupytext --sync docs/src/examples/*.ipynb" and then only runs "git
diff --exit-code docs/src/examples", which misses new untracked paired files;
update that job step to run the same shell sequence used in pyproject.toml's
tasks.notebooks.shell by invoking "jupytext --sync" and then running "git status
--porcelain docs/src/examples" (or just "git status --porcelain") and failing if
it reports any changes/untracked files so the workflow detects both modified and
newly created paired files after "jupytext --sync".
---
Nitpick comments:
In `@docs/src/examples/basic-workflow.py`:
- Around line 69-73: DemoCatalog.create_table currently overwrites an existing
table silently; change it to first check if identifier already exists in
self.tables and raise an explicit error (e.g., ValueError) instead of replacing
the entry so duplicates are not lost. Ensure the existence check happens before
appending to self.created_identifiers and before instantiating DemoTable so
created_identifiers isn't duplicated when a table already exists.
- Around line 95-108: The example creates a temporary workspace with
tempfile.mkdtemp into tmpdir which is never cleaned up; replace that usage with
a context-managed TemporaryDirectory (from tempfile) and assign tmpdir =
Path(tempdir_ctx) inside the context so the tmpdir, zarr_path, and tiff_path
work inside the with block; update surrounding code that uses tmpdir, zarr_path,
root, zarr.open_group, and tifffile.imwrite to live inside the with
TemporaryDirectory() as tmpdir_ctx: block so the temp directory is automatically
removed when the example finishes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 92d855be-c870-45d7-b5b6-2d2ac58cd84a
📒 Files selected for processing (2)
.github/workflows/run-tests.ymldocs/src/examples/basic-workflow.py
Summary by CodeRabbit
New Features
Documentation
Tests
Chores