[fix](cloud) avoid NPE and clear stale cache on warmup job cancel/expire (#62805)#63020
[fix](cloud) avoid NPE and clear stale cache on warmup job cancel/expire (#62805)#63020liaoxin01 wants to merge 1 commit intoapache:branch-4.0from
Conversation
…ire (apache#62805) Fix two related bugs on the event-driven warm-up path. Together they stall BE's heavy work pool when a warm-up job is cancelled or expired while BE still has its id in `_tablet_replica_cache`. `be/src/cloud/cloud_warm_up_manager.cpp` used `st.is<CANCELED>()` (one L). `CANCELED` is not in `ErrorCode`; ADL resolved it to `PCacheStatus::CANCELED = 9` from a proto enum, so the check compared against 9 and was always false. When FE returned `TStatusCode.CANCELLED` (value 1) to tell BE a job was done, BE never pushed the `job_id` into `cancelled_jobs`, leaving a zombie entry in `_tablet_replica_cache` that every subsequent `commit_rowset` re-queried. Fix: use `st.is<ErrorCode::CANCELLED>()`, matching the same namespace-qualified form used elsewhere in the file. ```java if (job == null || job.isDone()) { LOG.info("warmup job {} is not running, notify caller BE {} to cancel job", job.getJobId(), clientAddr); // NPE when job == null ... } ``` Once a cancelled job was removed from `cloudWarmUpJobs` past `history_cloud_warm_up_job_keep_max_second`, `job` is null and the log call NPE'd. With bug apache#1 keeping the stale id in BE cache, BE kept hitting this RPC forever; each failure took the `apache::thrift::TException` branch in `thrift_rpc_helper.cpp`, which sleeps 2s inside `CloudWarmUpManager::_mtx`. That serialised `bthread_fork_join(commit_rowset)`, blocked heavy-pool threads in `CloudTabletsChannel::close`, and backed up the heavy-pool queue — leading to load timeouts and query `Fragment RPC Phase1` latency in the 10s range. Fix: log `request.getWarmUpJobId()` instead; it is guaranteed set by the enclosing `request.isSetWarmUpJobId()` check. (cherry picked from commit b1c1a5a)
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR fixes a cloud warm-up cancellation/expiry interaction where BE could keep stale warm-up job cache entries and FE could throw an NPE when a warm-up job has already been expired/removed, leading to repeated RPC failures and potential commit slowdowns. It also adds regression and unit tests to cover the cancellation + expiry scenario and the FE null-job path.
Changes:
- FE: avoid dereferencing a null warm-up job in
getTabletReplicaInfosand returnCANCELLEDfor missing/done jobs. - BE: treat FE’s cancellation status correctly (
ErrorCode::CANCELLED) so cancelled warm-up jobs are purged from_tablet_replica_cache. - Tests: add a new cloud regression test for cancel+expire behavior and a FE unit test to ensure no NPE when the warm-up job is missing.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| regression-test/suites/cloud_p0/cache/multi_cluster/warm_up/cluster/test_warm_up_cluster_event_cancel_expired.groovy | Adds a multi-cluster regression test covering warm-up cancel + FE expiry and validating BE cache purge behavior. |
| fe/fe-core/src/test/java/org/apache/doris/service/FrontendServiceImplTest.java | Adds a unit test ensuring getTabletReplicaInfos returns CANCELLED (and does not NPE) when the warm-up job is missing. |
| fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java | Fixes null dereference in log message and returns CANCELLED when warm-up job is null/done. |
| be/src/cloud/cloud_warm_up_manager.cpp | Fixes cancellation status check so BE removes cancelled job cache entries correctly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| url = url.replace("http://", "https://") + " --cert " + context.config.otherConfigs.get("trustCert") + " --cacert " + context.config.otherConfigs.get("trustCACert") + " --key " + context.config.otherConfigs.get("trustCAKey") | ||
| } | ||
| def metrics = new URL(url).text |
|
run buildall |
Pick #62805