Skip to content

[fix](cloud) avoid NPE and clear stale cache on warmup job cancel/expire (#62805)#63020

Open
liaoxin01 wants to merge 1 commit intoapache:branch-4.0from
liaoxin01:pick-pr-62805-to-branch-4.0
Open

[fix](cloud) avoid NPE and clear stale cache on warmup job cancel/expire (#62805)#63020
liaoxin01 wants to merge 1 commit intoapache:branch-4.0from
liaoxin01:pick-pr-62805-to-branch-4.0

Conversation

@liaoxin01
Copy link
Copy Markdown
Contributor

Pick #62805

…ire (apache#62805)

Fix two related bugs on the event-driven warm-up path. Together they
stall BE's heavy work pool when a warm-up job is cancelled or expired
while BE still has its id in `_tablet_replica_cache`.

`be/src/cloud/cloud_warm_up_manager.cpp` used `st.is<CANCELED>()`
(one L). `CANCELED` is not in `ErrorCode`; ADL resolved it to
`PCacheStatus::CANCELED = 9` from a proto enum, so the check compared
against 9 and was always false. When FE returned `TStatusCode.CANCELLED`
(value 1) to tell BE a job was done, BE never pushed the `job_id` into
`cancelled_jobs`, leaving a zombie entry in `_tablet_replica_cache`
that every subsequent `commit_rowset` re-queried.

Fix: use `st.is<ErrorCode::CANCELLED>()`, matching the same
namespace-qualified form used elsewhere in the file.

```java
if (job == null || job.isDone()) {
    LOG.info("warmup job {} is not running, notify caller BE {} to cancel job",
            job.getJobId(), clientAddr);   // NPE when job == null
    ...
}
```

Once a cancelled job was removed from `cloudWarmUpJobs` past
`history_cloud_warm_up_job_keep_max_second`, `job` is null and the log
call NPE'd. With bug apache#1 keeping the stale id in BE cache, BE kept
hitting this RPC forever; each failure took the
`apache::thrift::TException` branch in `thrift_rpc_helper.cpp`, which
sleeps 2s inside `CloudWarmUpManager::_mtx`. That serialised
`bthread_fork_join(commit_rowset)`, blocked heavy-pool threads in
`CloudTabletsChannel::close`, and backed up the heavy-pool queue —
leading to load timeouts and query `Fragment RPC Phase1` latency in
the 10s range.

Fix: log `request.getWarmUpJobId()` instead; it is guaranteed set by
the enclosing `request.isSetWarmUpJobId()` check.

(cherry picked from commit b1c1a5a)
@liaoxin01 liaoxin01 requested a review from yiguolei as a code owner May 6, 2026 08:46
Copilot AI review requested due to automatic review settings May 6, 2026 08:46
@liaoxin01 liaoxin01 requested a review from morningman as a code owner May 6, 2026 08:46
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a cloud warm-up cancellation/expiry interaction where BE could keep stale warm-up job cache entries and FE could throw an NPE when a warm-up job has already been expired/removed, leading to repeated RPC failures and potential commit slowdowns. It also adds regression and unit tests to cover the cancellation + expiry scenario and the FE null-job path.

Changes:

  • FE: avoid dereferencing a null warm-up job in getTabletReplicaInfos and return CANCELLED for missing/done jobs.
  • BE: treat FE’s cancellation status correctly (ErrorCode::CANCELLED) so cancelled warm-up jobs are purged from _tablet_replica_cache.
  • Tests: add a new cloud regression test for cancel+expire behavior and a FE unit test to ensure no NPE when the warm-up job is missing.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
regression-test/suites/cloud_p0/cache/multi_cluster/warm_up/cluster/test_warm_up_cluster_event_cancel_expired.groovy Adds a multi-cluster regression test covering warm-up cancel + FE expiry and validating BE cache purge behavior.
fe/fe-core/src/test/java/org/apache/doris/service/FrontendServiceImplTest.java Adds a unit test ensuring getTabletReplicaInfos returns CANCELLED (and does not NPE) when the warm-up job is missing.
fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java Fixes null dereference in log message and returns CANCELLED when warm-up job is null/done.
be/src/cloud/cloud_warm_up_manager.cpp Fixes cancellation status check so BE removes cancelled job cache entries correctly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +69 to +71
url = url.replace("http://", "https://") + " --cert " + context.config.otherConfigs.get("trustCert") + " --cacert " + context.config.otherConfigs.get("trustCACert") + " --key " + context.config.otherConfigs.get("trustCAKey")
}
def metrics = new URL(url).text
@liaoxin01
Copy link
Copy Markdown
Contributor Author

run buildall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants