Skip to content

Commit 49b6fbf

Browse files
therealalephclaude
andcommitted
fix: v1.9.9 — Android second disconnect crash + tunnel-node drain correctness
Android (#700 from @ilok67): - Reordered MhrvVpnService.teardown() to call Native.stopProxy() FIRST. The previous order (tun2proxy.stop → tun.close → join → stopProxy) crashed SIGSEGV ~2s after Disconnect: tun2proxy's worker thread was blocked in native code on a SOCKS5 socket read; after the 2s+4s timeouts expired with the worker still alive, Native.stopProxy freed the runtime including that socket, and the worker hit use-after-free in the next read. The old comment claimed "runtime shutdown will knock the rest of the world over" — wrong, Native.stopProxy can't forcibly terminate a separate native thread, it just frees memory the other thread is still using. New order closes the socket first, the worker's blocking read returns with EOF, the worker exits cleanly through its error path, and the join is then near-instant. tunnel-node (PR #695 from @dazzling-no-more, merged): - Cleanup now tracks eof'd sids from drain_now's return value, not the raw atomic — was silently dropping the tail on >16 MiB buffers when EOF arrived between polls. - Phase-1 `data` op no longer holds the sessions map across upstream write/flush — was head-of-line-blocking every other batch op. - Mixed TCP+UDP batch wait switched from tokio::join! to tokio::select! — was paying the UDP LONGPOLL_DEADLINE (15 s) on TCP-ready bursts. - Watcher tasks now wrapped in AbortOnDrop newtype — was leaking Arc<Inner> permits when select!'s loser arm dropped its future. - 2 new regression tests, 35/35 pass. Example configs: - config.exit-node.example.json: added aistudio.google.com + ai.google.dev to default hosts (#701 — AI Studio sanctions Iran IPs). - config.fronting-groups.example.json: PR #696 from @Shjpr9 added Reddit/Fastly/Pinterest/CNN/BuzzFeed family domains on the Fastly 151.101.x.x edge. Tests: 179 lib + 35 tunnel-node green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d8d03be commit 49b6fbf

5 files changed

Lines changed: 76 additions & 33 deletions

File tree

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "mhrv-rs"
3-
version = "1.9.8"
3+
version = "1.9.9"
44
edition = "2021"
55
description = "Rust port of MasterHttpRelayVPN -- DPI bypass via Google Apps Script relay with domain fronting"
66
license = "MIT"

android/app/src/main/java/com/therealaleph/mhrv/MhrvVpnService.kt

Lines changed: 49 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -332,19 +332,37 @@ class MhrvVpnService : VpnService() {
332332
*
333333
* Steps, with the bound on each one called out so a hung native call
334334
* cannot stall the whole teardown thread:
335-
* 1. Signal tun2proxy to stop (cooperative). Bounded by a 2s
336-
* side-thread join — if the JNI call hangs we proceed anyway.
337-
* 2. Drop our `ParcelFileDescriptor` reference. Because we already
335+
* 1. Shut down the Rust proxy FIRST. This closes the listening
336+
* SOCKS5 socket that tun2proxy's worker thread is blocked on
337+
* a read() from. Killing the upstream socket is what makes the
338+
* worker's blocking native call return — we have no other lever
339+
* to wake it. Bounded by `rt.shutdown_timeout(3s)` Rust-side.
340+
* 2. Signal tun2proxy to stop (cooperative). Mostly redundant after
341+
* step 1, but cheap and covers the rare path where the worker is
342+
* blocked on something other than its socket read (e.g. a
343+
* smoltcp internal queue waiting on a wake). Bounded by a 2s
344+
* side-thread join.
345+
* 3. Drop our `ParcelFileDescriptor` reference. Because we already
338346
* called detachFd() at startup, this is a no-op for the
339347
* underlying fd — the worker (closeFdOnDrop=true) owns it.
340348
* We keep the call only so the PROXY_ONLY / failed-establish
341349
* paths still null out the field cleanly.
342-
* 3. Join the tun2proxy thread, bounded at 4s. If the worker is
343-
* stuck we log and move on — the runtime shutdown below will
344-
* knock the rest of the world over.
345-
* 4. Shut down the Rust proxy runtime, bounded by `rt.shutdown_timeout`
346-
* on the Rust side (5s). This is the hard backstop: the listener
347-
* socket is released here regardless of what the worker is doing.
350+
* 4. Join the tun2proxy thread, bounded at 4s. With step 1 having
351+
* already closed the socket the worker was reading from, this
352+
* join almost always completes well under the deadline.
353+
*
354+
* History (#700 from @ilok67): the original order was
355+
* tun2proxy → tun.close → join → stopProxy. That ordering crashed
356+
* SIGSEGV ~2s after Disconnect because Native.stopProxy() freed the
357+
* Rust runtime (including the SOCKS5 listener) while tun2proxy's
358+
* worker was still in a blocking native read against it — classic
359+
* use-after-free. The previous comment claimed "the runtime shutdown
360+
* below will knock the rest of the world over," but Native.stopProxy
361+
* cannot forcibly terminate a separate native thread; it just frees
362+
* memory the other thread is still using. Reversing the order means
363+
* the worker's blocking read returns with an EOF / socket-closed
364+
* error, the worker exits through its own error path, and the join
365+
* is effectively just confirming a clean shutdown.
348366
*/
349367
private fun teardown() {
350368
// Idempotency guard. Without this, onDestroy racing the
@@ -363,11 +381,24 @@ class MhrvVpnService : VpnService() {
363381
"(tun2proxy running=${tun2proxyRunning.get()}, proxyHandle=$proxyHandle)",
364382
)
365383

366-
// 1. Cooperative stop signal — bounded so a hung Rust call cannot
367-
// stall the entire teardown thread. We've never observed
368-
// Tun2proxy.stop() block in practice, but the contract isn't
369-
// documented as bounded and the rest of teardown already takes
370-
// care to be timeout-bounded; this closes the gap.
384+
// 1. Stop the Rust proxy FIRST. Closing the SOCKS5 listener is
385+
// what makes tun2proxy's worker thread's blocking read return
386+
// — without this the worker stays in native code and a later
387+
// Native.stopProxy would race it into use-after-free (#700).
388+
val handle = proxyHandle
389+
proxyHandle = 0L
390+
if (handle != 0L) {
391+
Log.i(TAG, "teardown: stopping proxy handle=$handle")
392+
try { Native.stopProxy(handle) } catch (t: Throwable) {
393+
Log.e(TAG, "Native.stopProxy threw: ${t.message}", t)
394+
}
395+
}
396+
397+
// 2. Cooperative stop signal — mostly redundant now that step 1
398+
// has yanked the socket out from under the worker, but cheap
399+
// and covers any future code path where the worker might be
400+
// blocked on something other than its upstream socket read.
401+
// Bounded so a hung JNI call can't stall teardown.
371402
if (tun2proxyRunning.get()) {
372403
val stopper = Thread({
373404
try { Tun2proxy.stop() } catch (t: Throwable) {
@@ -380,7 +411,7 @@ class MhrvVpnService : VpnService() {
380411
}
381412
}
382413

383-
// 2. Drop our PFD reference. detachFd at startup means this
414+
// 3. Drop our PFD reference. detachFd at startup means this
384415
// close() is a no-op for the underlying fd — tun2proxy owns
385416
// it (closeFdOnDrop = true) and closes it on return from
386417
// run(). The call is kept only to null the field cleanly on
@@ -391,9 +422,9 @@ class MhrvVpnService : VpnService() {
391422
}
392423
tun = null
393424

394-
// 3. Join the worker. 4s is enough in the happy case; if tun2proxy
395-
// is stuck on something untoward we'd rather move on and force
396-
// the runtime shutdown than hang forever.
425+
// 4. Join the worker. With step 1 having killed its upstream this
426+
// almost always completes immediately; the 4s budget is just
427+
// headroom for tun2proxy's internal close path to drain.
397428
try {
398429
tun2proxyThread?.join(4_000)
399430
} catch (_: InterruptedException) {}
@@ -403,18 +434,6 @@ class MhrvVpnService : VpnService() {
403434
Log.w(TAG, "tun2proxy thread still alive after join timeout — proceeding anyway")
404435
}
405436

406-
// 4. Shut down the Rust proxy. Backed by `rt.shutdown_timeout(3s)`
407-
// on the Rust side, so this is bounded even if the runtime
408-
// has in-flight tasks (common when the Apps Script relay has
409-
// piled up pending 30s timeouts).
410-
val handle = proxyHandle
411-
proxyHandle = 0L
412-
if (handle != 0L) {
413-
Log.i(TAG, "teardown: stopping proxy handle=$handle")
414-
try { Native.stopProxy(handle) } catch (t: Throwable) {
415-
Log.e(TAG, "Native.stopProxy threw: ${t.message}", t)
416-
}
417-
}
418437
// Flip UI state last — the button reverts to Connect only after
419438
// the native-side cleanup actually happened, not optimistically.
420439
VpnState.setProxyHandle(0L)

config.exit-node.example.json

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,9 @@
2727
"claude.ai",
2828
"x.com",
2929
"grok.com",
30-
"openai.com"
30+
"openai.com",
31+
"aistudio.google.com",
32+
"ai.google.dev"
3133
]
3234
}
3335
}

docs/changelog/v1.9.9.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
<!-- see docs/changelog/v1.1.0.md for the file format: Persian, then `---`, then English. -->
2+
• Fix v1.9.8 Android: کرش جدید ~۲ ثانیه بعد از Disconnect ([#700](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/issues/700) از @ilok67 با root cause + fix کامل): علی‌رغم fix v1.9.8 برای race lifecycle (#666)، crash جداگانه در `MhrvVpnService.teardown()` باقی مانده بود. ترتیب قبلی: tun2proxy.stop → tun.close → join → Native.stopProxy. مشکل: tun2proxy worker thread در native code blocked روی socket read از SOCKS5 proxy است. وقتی Tun2proxy.stop کالد می‌شه + 2s timeout می‌گذره + 4s join timeout می‌گذره (worker هنوز alive)، Native.stopProxy runtime Rust رو shutdown می‌کنه شامل listener socket — worker thread که در native blocking read از همان socket است → use-after-free → SIGSEGV. comment کد قدیمی ادعا می‌کرد "runtime shutdown will knock the rest of the world over" که اشتباه بود — Native.stopProxy نمی‌تونه force-terminate یک thread native دیگه. ترتیب جدید: **Native.stopProxy اول** (socket رو می‌بنده → blocking read worker با error برمی‌گرده → worker پاک exit می‌کنه از error path)، بعد Tun2proxy.stop (cooperative، redundant ولی ارزان) → tun.close → join (تقریباً همیشه فوری چون worker از قبل تموم شده). تشکر بیشتر از @ilok67 برای triage دقیق دومین crash.
3+
• Fix tunnel-node batch drain correctness + lock contention (PR [#695](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/pull/695) از @dazzling-no-more): چهار باگ، دو correctness، دو latency.
4+
- **Cleanup race tail-bytes drop می‌کرد:** session با buffer > ۱۶ MiB + EOF — `drain_now` صحیح `eof=false` برمی‌گردوند تا tail tail رو در poll بعدی drain کنه، ولی cleanup loop همان atomic رو می‌خوند، `true` می‌دید + session رو حذف می‌کرد + `reader_task` رو abort + tail هدر می‌رفت. حالا cleanup از مقدار return `drain_now` پیروی می‌کنه — session فقط بعد از shipped شدن drain که `eof=true` می‌فرسته، حذف می‌شه. data loss silent در 1Gbps+ VPS که buffer بین poll‌ها پر می‌شد، fix شد.
5+
- **Sessions-map lock روی upstream await نگه می‌داشت:** phase-1 `data` op global sessions map رو نگه می‌داشت روی `last_active.lock`، `writer.lock`، `write_all`، و `flush` — head-of-line-block برای هر batch + connect/close op دیگه. حالا (مثل `udp_data` که قبلاً درست بود) Arc از under map clone می‌شه، lock drop، بعد write/flush.
6+
- **TCP+UDP batch deadline UDP رو می‌پرداخت:** `tokio::join!(wait_tcp, wait_udp)` conjunctive هست — TCP-ready burst هنوز LONGPOLL_DEADLINE 15 ثانیه‌ای UDP رو می‌پرداخت قبل از پاسخ. comment می‌گفت "either side"، code "both sides" انجام می‌داد. تغییر به `select!`. test جدید `batch_tcp_ready_does_not_pay_udp_longpoll_deadline` این رد رو حفظ می‌کنه.
7+
- **Watcher tasks تحت `select!` cancellation leak می‌کرد:** `wait_for_any_drainable` فقط در trailing loop watcher‌ها رو abort می‌کرد — past همه cancel point‌ها. با تبدیل phase-2 wait به `select!`، loser arm's future drop می‌شه و watcher‌هاش *detach* می‌شن (drop کردن `JoinHandle` abort نمی‌کنه). هر orphan یک `Arc<...Inner>` نگه می‌داشت + می‌توانست `notify_one()` permit از batch بعدی بدزده. fix: `AbortOnDrop` newtype روی همه `JoinHandle` watcher.
8+
۲ test جدید + 35/35 pass.
9+
• Example config exit-node با `aistudio.google.com` و `ai.google.dev` — درخواست از [#701](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/issues/701). AI Studio روی Iran IP sanction می‌خوره (نه Apps Script طرف ما). exit-node IP val.town رو می‌بینه که نه Iran است نه Google datacenter.
10+
• Example config fronting-groups با Reddit / Fastly / Pinterest / CNN / BuzzFeed family domains اضافه شد (PR [#696](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/pull/696) از @Shjpr9). همه روی Fastly Anycast 151.101.x.x — کاربران می‌تونن از example بیشتر دامنه برداشت کنن، اونی که در شبکه‌شان کار می‌کنه نگه دارن.
11+
• تست: ۱۷۹ lib + ۳۵ tunnel-node test همه pass.
12+
---
13+
• Fix Android `~2-second-delayed` crash on Disconnect from v1.9.8 ([#700](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/issues/700) by @ilok67 with full root cause + fix): despite the v1.9.8 fix for the lifecycle race (#666), a separate crash inside `MhrvVpnService.teardown()` remained. Old order was tun2proxy.stop → tun.close → join → Native.stopProxy. Problem: tun2proxy's worker thread is blocked in native code on a socket read from the proxy's SOCKS5 port. After `Tun2proxy.stop()`'s 2s timeout and the 4s thread join both expire (worker still alive), `Native.stopProxy()` shuts down the Rust runtime — including the listener socket — and the worker, still reading from that socket in native code, hits use-after-free → SIGSEGV. The old code comment claimed "the runtime shutdown will knock the rest of the world over," which was wrong: `Native.stopProxy` cannot forcibly terminate a separate native thread. New order: **`Native.stopProxy` FIRST** (closes the socket → worker's blocking read returns with EOF/error → worker exits cleanly through its error path), then `Tun2proxy.stop` (cooperative, mostly redundant but cheap), `tun.close`, then `join` (almost always immediate now). Thanks @ilok67 again for the precise root-cause work on the second crash.
14+
• Fix tunnel-node batch drain correctness + lock contention (PR [#695](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/pull/695) from @dazzling-no-more): four bugs, two correctness + two latency.
15+
- **Cleanup race dropped tail bytes:** when a session's read buffer > 16 MiB and upstream signaled EOF, `drain_now` correctly returned `eof=false` and left the tail for the next poll, but the cleanup loop read the raw atomic, saw `true`, removed the session, aborted `reader_task`, dropped the tail. Cleanup now tracks eof'd sids from `drain_now`'s return value — the session is only removed once the drain that returned `eof=true` has shipped to the client. Silent data loss on 1Gbps+ VPS that filled the buffer between polls — fixed.
16+
- **Sessions-map lock held across upstream awaits:** phase-1 `data` op held the global sessions map across `last_active.lock`, `writer.lock`, `write_all`, and `flush` — head-of-line-blocking every other batch and connect/close op. Now (mirroring `udp_data`'s already-correct shape) it clones the `Arc` under the map lock, drops the lock, then awaits.
17+
- **Mixed TCP+UDP batch paid the slower side's deadline:** `tokio::join!(wait_tcp, wait_udp)` is conjunctive — a TCP-ready burst still paid the UDP `LONGPOLL_DEADLINE` (15 s) before responding. Comment said "either side", code did "both sides". Switched to `tokio::select!`. New test `batch_tcp_ready_does_not_pay_udp_longpoll_deadline` locks down the regression.
18+
- **Watcher tasks leaked under `select!` cancellation:** `wait_for_any_drainable` only aborted its watcher tasks in a trailing loop, past every cancellation point. With phase-2 wait flipped to `select!`, the loser arm's future drops and *detaches* its watchers (dropping a `JoinHandle` doesn't abort). Each orphan held an `Arc<...Inner>` and could steal a `notify_one()` permit from a future batch. Fix: `AbortOnDrop` newtype wraps every watcher `JoinHandle`.
19+
2 new tests + 35/35 pass.
20+
• Example config exit-node now lists `aistudio.google.com` and `ai.google.dev` — requested in [#701](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/issues/701). AI Studio sanctions Iran IPs (independently of any Apps Script issue on our side). Routing it through the exit-node makes the destination see val.town's IP, which is neither Iran nor a Google datacenter.
21+
• Example config fronting-groups gained Reddit / Fastly / Pinterest / CNN / BuzzFeed family domains (PR [#696](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/pull/696) from @Shjpr9). All on the Fastly Anycast `151.101.x.x` edge — gives users a richer starter list to trim down based on what works in their network.
22+
• Tests: 179 lib + 35 tunnel-node tests all passing.

0 commit comments

Comments
 (0)