Skip to content

Commit 93fac57

Browse files
committed
v0.9.3: accept-loop backoff on EMFILE + louder rlimit diagnostics (issue #18)
@Behzad9 on #18: the OpenWRT 'No file descriptors available' errors are back in v0.8.0+, this time logged as a wall of thousands of identical ERRORs within seconds of activating the proxy. Two real bugs, now fixed: === 1. accept() loop had no backoff === Previous code: loop { match listener.accept().await { Ok(x) => ..., Err(e) => { tracing::error!(...); continue; } // tight loop } } On EMFILE (RLIMIT_NOFILE exhausted), accept() returns synchronously, the match re-runs instantly, accept() EMFILEs again, forever. The tight loop ALSO starves the tokio runtime of CPU that existing connections need to finish and close their fds — so the problem never clears on its own. It's a self-sustaining meltdown. New accept_backoff() helper (in proxy_server.rs) wraps both the HTTP and SOCKS5 accept loops: - Detects EMFILE/ENFILE via raw_os_error (24 or 23). - Sleeps proportional to how long the pressure has lasted (50 ms first hit, ramping to a 2 s cap around hit #40). Gives existing connections a chance to finish and free fds. - Rate-limits the log line: one WARN on the first EMFILE with fix instructions, then one every 100 retries. No more walls of identical errors. - Resets the counter on the next successful accept. - Non-EMFILE errors (ECONNABORTED from clients that went away during handshake, etc.) get a plain single-line error + 5 ms sleep so we still don't tight-loop on any unexpected error. End-to-end verified: ran mhrv-rs under , flooded the SOCKS5 port with 247 concurrent connections to trip EMFILE. Before: log would have been 1000s of identical lines. After: exactly 1 warning, listener stayed quiet, fds drained, accept resumed. === 2. RLIMIT_NOFILE bump was too conservative + silent === Previous behavior: target 16384 soft, cap to existing hard limit, no log. On constrained systems where hard is already tiny, we'd stay at the tiny limit silently. rlimit.rs now: - Targets 65536 soft. - ALSO tries to raise the hard limit up to /proc/sys/fs/nr_open on Linux (Linux allows a non-privileged process to bump its own hard limit up to the kernel ceiling, usually 1048576 on modern kernels). On macOS/BSD we skip this — only bump soft. - Logs WARN on startup if soft ends up <4096 with the exact fix ('ulimit -n 65536' or use the procd init). No more silent failure. - Logs INFO with the before/after limits otherwise, so field bug reports tell us immediately whether the kernel cap is the real bottleneck. Moved the rlimit call from main() pre-logging to post-init_logging so its tracing output actually lands in the log panel + stderr. Small reorganization only. 49 tests pass, musl x86_64 cross-compile verified locally.
1 parent ca947d7 commit 93fac57

5 files changed

Lines changed: 175 additions & 46 deletions

File tree

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "mhrv-rs"
3-
version = "0.9.2"
3+
version = "0.9.3"
44
edition = "2021"
55
description = "Rust port of MasterHttpRelayVPN -- DPI bypass via Google Apps Script relay with domain fronting"
66
license = "MIT"

src/main.rs

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -121,10 +121,6 @@ async fn main() -> ExitCode {
121121
// Install default rustls crypto provider (ring).
122122
let _ = rustls::crypto::ring::default_provider().install_default();
123123

124-
// Bump RLIMIT_NOFILE where possible — OpenWRT/Alpine hosts often ship a
125-
// default so low the proxy runs out of fds under normal browser load.
126-
mhrv_rs::rlimit::raise_nofile_limit_best_effort();
127-
128124
let args = match parse_args() {
129125
Ok(a) => a,
130126
Err(e) => {
@@ -170,6 +166,13 @@ async fn main() -> ExitCode {
170166

171167
init_logging(&config.log_level);
172168

169+
// Bump RLIMIT_NOFILE now that tracing is live — OpenWRT/Alpine hosts
170+
// often ship a default so low (issue #8, issue #18) that we run out
171+
// of fds under normal proxy load. This logs the before/after values
172+
// at info level so field reports tell us whether the kernel cap is
173+
// the real culprit.
174+
mhrv_rs::rlimit::raise_nofile_limit_best_effort();
175+
173176
match args.command {
174177
Command::Test => {
175178
let ok = test_cmd::run(&config).await;

src/proxy_server.rs

Lines changed: 66 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -180,11 +180,15 @@ impl ProxyServer {
180180
let http_mitm = self.mitm.clone();
181181
let http_ctx = self.rewrite_ctx.clone();
182182
let mut http_task = tokio::spawn(async move {
183+
let mut fd_exhaust_count: u64 = 0;
183184
loop {
184185
let (sock, peer) = match http_listener.accept().await {
185-
Ok(x) => x,
186+
Ok(x) => {
187+
fd_exhaust_count = 0;
188+
x
189+
}
186190
Err(e) => {
187-
tracing::error!("accept (http): {}", e);
191+
accept_backoff("http", &e, &mut fd_exhaust_count).await;
188192
continue;
189193
}
190194
};
@@ -204,11 +208,15 @@ impl ProxyServer {
204208
let socks_mitm = self.mitm.clone();
205209
let socks_ctx = self.rewrite_ctx.clone();
206210
let mut socks_task = tokio::spawn(async move {
211+
let mut fd_exhaust_count: u64 = 0;
207212
loop {
208213
let (sock, peer) = match socks_listener.accept().await {
209-
Ok(x) => x,
214+
Ok(x) => {
215+
fd_exhaust_count = 0;
216+
x
217+
}
210218
Err(e) => {
211-
tracing::error!("accept (socks): {}", e);
219+
accept_backoff("socks", &e, &mut fd_exhaust_count).await;
212220
continue;
213221
}
214222
};
@@ -240,6 +248,60 @@ impl ProxyServer {
240248
}
241249
}
242250

251+
/// Back-off helper for the accept() loop.
252+
///
253+
/// Motivated by issue #18: when the process hits its file-descriptor limit
254+
/// (EMFILE — `No file descriptors available`), `accept()` returns that
255+
/// error synchronously and is immediately ready to fire again. The old
256+
/// loop just `continue`'d, producing a wall of identical ERROR lines
257+
/// thousands per second and starving the tokio runtime of CPU that
258+
/// existing connections would have used to drain and close.
259+
///
260+
/// Two things this does right:
261+
/// 1. Sleeps when `EMFILE` / `ENFILE` are seen, proportional to how long
262+
/// the problem has been going on (exponential-ish, capped at 2s).
263+
/// Gives existing connections a chance to finish and free fds.
264+
/// 2. Rate-limits the log line: first occurrence logs a full warning
265+
/// with fix instructions, subsequent ones log once per 100 errors
266+
/// so the log doesn't fill up.
267+
async fn accept_backoff(kind: &str, err: &std::io::Error, count: &mut u64) {
268+
let is_fd_limit = matches!(
269+
err.raw_os_error(),
270+
Some(libc_emfile) if libc_emfile == 24 || libc_emfile == 23
271+
);
272+
273+
*count = count.saturating_add(1);
274+
275+
if is_fd_limit {
276+
if *count == 1 {
277+
tracing::warn!(
278+
"accept ({}) hit RLIMIT_NOFILE: {}. Backing off. Raise the fd limit: \
279+
`ulimit -n 65536` before starting, or (OpenWRT) use the shipped procd \
280+
init which sets nofile=16384. The listener will keep retrying.",
281+
kind,
282+
err
283+
);
284+
} else if *count % 100 == 0 {
285+
tracing::warn!(
286+
"accept ({}) still fd-limited after {} retries. Current connections \
287+
need to finish before we can accept new ones.",
288+
kind,
289+
*count
290+
);
291+
}
292+
// Back off exponentially-ish up to 2s. First hit: 50ms, 10th hit:
293+
// ~500ms, 50th+: 2s cap.
294+
let backoff_ms = (50u64 * (*count).min(40)).min(2000);
295+
tokio::time::sleep(std::time::Duration::from_millis(backoff_ms)).await;
296+
} else {
297+
// Transient non-EMFILE error (e.g. ECONNABORTED from a client that
298+
// went away during the handshake). One-line log, short sleep to
299+
// avoid a tight loop in case it repeats.
300+
tracing::error!("accept ({}): {}", kind, err);
301+
tokio::time::sleep(std::time::Duration::from_millis(5)).await;
302+
}
303+
}
304+
243305
async fn handle_http_client(
244306
mut sock: TcpStream,
245307
fronter: Arc<DomainFronter>,

src/rlimit.rs

Lines changed: 100 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,27 @@
11
//! Best-effort file descriptor limit bump on Unix.
22
//!
3-
//! Context (issue #8): on OpenWRT routers — and some minimal Alpine / BSD
4-
//! installs — the default `RLIMIT_NOFILE` is so low (often 1024 or even
5-
//! 512) that a browser's burst of ~30 parallel subresource requests fills
6-
//! the limit within seconds. Once the limit is hit `accept(2)` returns
7-
//! `EMFILE` and the user sees:
3+
//! Context (issues #8 + #18): on OpenWRT routers — and some minimal
4+
//! Alpine / BSD installs — the default `RLIMIT_NOFILE` is so low
5+
//! (often 1024 or even 256 / 128 on constrained devices) that a
6+
//! browser's burst of ~30 parallel subresource requests, or a DNS-over-
7+
//! SOCKS5 flood from a client like v2ray, fills the limit within seconds.
8+
//! Once the limit is hit `accept(2)` returns `EMFILE` and the user sees:
89
//!
910
//! ERROR accept (socks): No file descriptors available (os error 24)
1011
//!
11-
//! This helper raises the soft limit up to the hard limit (without
12-
//! requiring root), so the user gets whatever headroom the kernel
13-
//! already allows them. Failures are logged and swallowed.
12+
//! Approach:
13+
//! - Try to raise the SOFT limit to a generous target.
14+
//! - If the HARD limit is also low, try to raise THAT too — Linux lets
15+
//! a non-root process bump its hard limit up to `/proc/sys/fs/nr_open`.
16+
//! - Log what we ended up with so a user filing a bug report can tell
17+
//! us whether their kernel cap is below what a real proxy needs.
1418
1519
#[cfg(unix)]
1620
pub fn raise_nofile_limit_best_effort() {
17-
// Target: 16384 if the hard limit allows it, else whatever the hard
18-
// limit is. 16k matches what most modern desktop distros default to and
19-
// is plenty for a local proxy.
20-
const DESIRED: u64 = 16_384;
21+
// Ambitious target. 65536 is plenty for even heavy router use (a
22+
// whole LAN doing browser + DNS + Telegram over our SOCKS5). Costs
23+
// ~0 kernel memory until actually used.
24+
const DESIRED: u64 = 65_536;
2125

2226
unsafe {
2327
let mut rl = libc::rlimit {
@@ -26,41 +30,101 @@ pub fn raise_nofile_limit_best_effort() {
2630
};
2731
if libc::getrlimit(libc::RLIMIT_NOFILE, &mut rl) != 0 {
2832
let err = std::io::Error::last_os_error();
29-
tracing::debug!("getrlimit(RLIMIT_NOFILE) failed: {}", err);
33+
tracing::warn!("getrlimit(RLIMIT_NOFILE) failed: {}", err);
3034
return;
3135
}
36+
let original_soft = rl.rlim_cur as u64;
37+
let original_hard = rl.rlim_max as u64;
3238

33-
// Already high enough? Leave it.
34-
let current = rl.rlim_cur as u64;
35-
let hard = rl.rlim_max as u64;
36-
if current >= DESIRED {
37-
return;
39+
// Figure out an absolute ceiling. On Linux, /proc/sys/fs/nr_open
40+
// is the highest the kernel will ever let a process set its
41+
// RLIMIT_NOFILE. Read it and use it as our hard-limit target.
42+
// On macOS/BSD this file doesn't exist — we just keep the
43+
// existing hard limit.
44+
let kernel_ceiling = read_nr_open().unwrap_or(original_hard);
45+
let want_hard = DESIRED.max(original_hard).min(kernel_ceiling);
46+
47+
// Step 1: raise the hard limit if it's below what we want. This
48+
// can only go UP on non-privileged processes (lowering it is
49+
// permanent and requires CAP_SYS_RESOURCE to undo).
50+
if want_hard > original_hard {
51+
rl.rlim_max = want_hard as libc::rlim_t;
52+
rl.rlim_cur = want_hard as libc::rlim_t;
53+
if libc::setrlimit(libc::RLIMIT_NOFILE, &rl) != 0 {
54+
let err = std::io::Error::last_os_error();
55+
tracing::debug!(
56+
"setrlimit raising hard {}→{} failed: {} (trying soft-only)",
57+
original_hard,
58+
want_hard,
59+
err
60+
);
61+
// Fall through to step 2 with the unmodified hard limit.
62+
rl.rlim_max = original_hard as libc::rlim_t;
63+
}
3864
}
3965

40-
let new_soft = DESIRED.min(hard);
41-
if new_soft <= current {
42-
return;
66+
// Step 2: raise soft up to whatever hard allows.
67+
let effective_hard = rl.rlim_max as u64;
68+
let want_soft = DESIRED.min(effective_hard);
69+
if want_soft > original_soft {
70+
rl.rlim_cur = want_soft as libc::rlim_t;
71+
if libc::setrlimit(libc::RLIMIT_NOFILE, &rl) != 0 {
72+
let err = std::io::Error::last_os_error();
73+
tracing::warn!(
74+
"setrlimit raising soft {}→{} failed: {}",
75+
original_soft,
76+
want_soft,
77+
err
78+
);
79+
return;
80+
}
4381
}
4482

45-
rl.rlim_cur = new_soft as libc::rlim_t;
46-
if libc::setrlimit(libc::RLIMIT_NOFILE, &rl) != 0 {
47-
let err = std::io::Error::last_os_error();
48-
tracing::debug!(
49-
"setrlimit(RLIMIT_NOFILE) {} -> {} failed: {}",
50-
current,
51-
new_soft,
52-
err
83+
// Re-read and report.
84+
let mut now = libc::rlimit {
85+
rlim_cur: 0,
86+
rlim_max: 0,
87+
};
88+
let _ = libc::getrlimit(libc::RLIMIT_NOFILE, &mut now);
89+
let soft = now.rlim_cur as u64;
90+
let hard = now.rlim_max as u64;
91+
92+
if soft < 4096 {
93+
// This is genuinely too low for a local proxy under LAN load.
94+
// Log loudly so the user knows their system is the bottleneck,
95+
// not us.
96+
tracing::warn!(
97+
"RLIMIT_NOFILE is {}/{} (soft/hard). This is likely too low for a \
98+
proxy under any real load and WILL cause 'No file descriptors \
99+
available' errors. On OpenWRT, ensure you're starting via the \
100+
shipped procd init script (which sets nofile=16384), or add \
101+
`ulimit -n 65536` to your startup script.",
102+
soft,
103+
hard,
104+
);
105+
} else {
106+
tracing::info!(
107+
"RLIMIT_NOFILE = {}/{} (soft/hard), was {}/{} at startup",
108+
soft,
109+
hard,
110+
original_soft,
111+
original_hard,
53112
);
54-
return;
55113
}
56-
tracing::info!(
57-
"raised RLIMIT_NOFILE: {} -> {} (hard={})",
58-
current,
59-
new_soft,
60-
hard
61-
);
62114
}
63115
}
64116

117+
#[cfg(target_os = "linux")]
118+
fn read_nr_open() -> Option<u64> {
119+
std::fs::read_to_string("/proc/sys/fs/nr_open")
120+
.ok()
121+
.and_then(|s| s.trim().parse::<u64>().ok())
122+
}
123+
124+
#[cfg(all(unix, not(target_os = "linux")))]
125+
fn read_nr_open() -> Option<u64> {
126+
None
127+
}
128+
65129
#[cfg(not(unix))]
66130
pub fn raise_nofile_limit_best_effort() {}

0 commit comments

Comments
 (0)