# Linux kernel crash report Crash reported by Anderson Nascimento via LKML: [CAPhRvkyZGKHRTBhV3P2PCCRxmRKGEvJQ0W5a9SMW3qwS2hp2Qw@mail.gmail.com](https://lore.kernel.org/r/CAPhRvkyZGKHRTBhV3P2PCCRxmRKGEvJQ0W5a9SMW3qwS2hp2Qw@mail.gmail.com) ## Key elements | Field | Value | Implication | |-----------------|-------|-------------| | UNAME | `6.18.13-200.fc43.x86_64` | | | DISTRO | Fedora | | | DISTRO_VERSION | fc43 | | | PROCESS | `krxrpcio/7001` | Kernel I/O thread for the rxrpc local endpoint on UDP port 7001 | | HARDWARE | VMware Virtual Platform/440BX Desktop Reference Platform | | | TAINT | Not tainted | | | CRASH_TYPE | `BUG()` — invalid opcode (UD2) | Unconditional kernel assertion failure | | CRASH_LOCATION | `net/rxrpc/conn_client.c:64` | | | CONFIG_REQUIRED | (unconditional — fires in all builds) | Plain `BUG()`, no `CONFIG_` guard | | MSGID | `` | | | MSGID_URL | [CAPhRvky…@mail.gmail.com](https://lore.kernel.org/r/CAPhRvkyZGKHRTBhV3P2PCCRxmRKGEvJQ0W5a9SMW3qwS2hp2Qw@mail.gmail.com) | | | SOURCEDIR | `oops-workdir/linux` @ tag `kernel-6.18.13-0` | | | VMLINUX | `oops-workdir/fedora/files-6.18.13/usr/lib/debug/lib/modules/6.18.13-200.fc43.x86_64/vmlinux` | | | MODULES_DIR | `oops-workdir/fedora/files-6.18.13/usr/lib/debug/lib/modules/6.18.13-200.fc43.x86_64/kernel` | | | INTRODUCED-BY | [`9d35d880e0e4`](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9d35d880e0e4a3ab32d8c12f9e4d76198aadd42d) — rxrpc: Move client call connection to the I/O thread | Restructured connection setup; introduced the race window | | FIXED-BY | [`b1fdb0bb3b65`](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b1fdb0bb3b6513f5bd26f92369fd6ac1a2422d8b) — rxrpc: Fix missing locking causing hanging calls | Fixes the missing `client_call_lock` in the disconnect-before-connect path | ## Kernel modules | Module | Flags | Backtrace | Location | Flag Implication | |--------|-------|-----------|----------|------------------| | rxrpc | | Y | `oops-workdir/fedora/files-6.18.13/.../net/rxrpc/rxrpc.ko.debug` | | | vsock_diag | | | | | | fcrypt | | | | | | pcbc | | | | | | ip6_udp_tunnel | | | | | | krb5 | | | | | | udp_tunnel | | | | | | rfkill | | | | | | *(remaining modules omitted — none appear in backtrace)* | | | | | ## Backtrace | Address | Function | Offset | Size | Context | Module | Source location | |---------|----------|--------|------|---------|--------|-----------------| | `0x160d8 (0x16080 + 0x58)` | `rxrpc_purge_client_connections` | `0x58` | `0xa0` | Task | rxrpc | [conn_client.c:64](#1-rxrpc_destroy_client_conn_ids-inlined--crash-site-conn_clientc64) | | `0x21ab9 (0x219f0 + 0xc9)` | `rxrpc_destroy_local` | `0xc9` | `0xe0` | Task | rxrpc | [local_object.c:451](#2-rxrpc_destroy_local--call-site-local_objectc451) | | `0x1f3cd (0x1ed70 + 0x65d)` | `rxrpc_io_thread` | `0x65d` | `0x750` | Task | rxrpc | io_thread.c:598 | | *(skipped `? __pfx_rxrpc_io_thread`)* | | | | | | | | `0xffffffff813f24ec (0xffffffff813f23f0 + 0xfc)` | `kthread` | `0xfc` | `0x240` | Task | vmlinux | kthread.c:463 | | *(skipped `? __pfx_kthread`)* | | | | | | | | `0xffffffff8132ab54 (0xffffffff8132aa60 + 0xf4)` | `ret_from_fork` | `0xf4` | `0x110` | Task | vmlinux | process.c:158 | | `0xffffffff812d8dca (0xffffffff812d8db0 + 0x1a)` | `ret_from_fork_asm` | `0x1a` | `0x30` | Task | vmlinux | entry_64.S:245 | ## CPU registers | Register | Value | Note | |----------|-------|------| | RIP | `0010:rxrpc_purge_client_connections+0x58/0xa0 [rxrpc]` | Crash site | | RSP | `ffffc900159cfdd8` | | | EFLAGS | `00010246` | ZF set (zero flag) — last comparison was equal | | RAX | `0000000000000000` | Return value = 0 (IDR empty check result) | | RBX | `ffff88810a6b4800` | `struct rxrpc_local *local` | | RDI | `ffff88810a6b4920` | Pointer into local (offset 0x120 = `conn_ids` field) | | CR2 | `00007faf20630030` | User-space address — not relevant to this crash | **Code bytes:** `… 0f 85 49 dd 01 00 <0f> 0b …` The `<0f> 0b` sequence is the UD2 instruction — the x86 encoding of `BUG()`. The preceding `0f 85` (JNZ) is the `if (!idr_is_empty(...))` branch that was taken. ## Backtrace source code ### 1. `rxrpc_destroy_client_conn_ids` (inlined) — crash site (`conn_client.c:64`) addr2line resolves `rxrpc_purge_client_connections+0x58` to `rxrpc_destroy_client_conn_ids` inlined into `rxrpc_purge_client_connections` at `conn_client.c:145`. The crash fires inside the inlined callee: [net/rxrpc/conn_client.c — kernel-6.18.13-0](https://gitlab.com/cki-project/kernel-ark/-/blob/kernel-6.18.13-0/net/rxrpc/conn_client.c?ref_type=tags#L54) ```c 54 static void rxrpc_destroy_client_conn_ids(struct rxrpc_local *local) 55 { 56 struct rxrpc_connection *conn; 57 int id; 58 59 if (!idr_is_empty(&local->conn_ids)) { // ← branch taken: conn_ids is NOT empty 60 idr_for_each_entry(&local->conn_ids, conn, id) { 61 pr_err("AF_RXRPC: Leaked client conn %p {%d}\n", 62 conn, refcount_read(&conn->ref)); 63 } 64 BUG(); // ← crash here — refcount={1}, one connection leaked 65 } 66 67 idr_destroy(&local->conn_ids); 68 } ``` One connection was logged: `00000000bf02a6a7 {1}` — refcount=1 means one reference holder has not released it. ### 2. `rxrpc_destroy_local` — call site (`local_object.c:451`) [net/rxrpc/local_object.c — kernel-6.18.13-0](https://gitlab.com/cki-project/kernel-ark/-/blob/kernel-6.18.13-0/net/rxrpc/local_object.c?ref_type=tags#L420) ```c 420 void rxrpc_destroy_local(struct rxrpc_local *local) 421 { ... 432 rxrpc_clean_up_local_conns(local); // ← only cleans idle_client_conns list 433 rxrpc_service_connection_reaper(&rxnet->service_conn_reaper); 434 ASSERT(!local->service); ... 451 rxrpc_purge_client_connections(local); // ← call site → crash 452 page_frag_cache_drain(&local->tx_alloc); 453 } ``` `rxrpc_clean_up_local_conns()` (called at line 432) only drains the `local->idle_client_conns` list. A connection that has been allocated and added to `conn_ids` but has not yet reached the idle state is invisible to it. ### 3. `rxrpc_clean_up_local_conns` — cleanup gap (`conn_client.c:813`) [net/rxrpc/conn_client.c — kernel-6.18.13-0](https://gitlab.com/cki-project/kernel-ark/-/blob/kernel-6.18.13-0/net/rxrpc/conn_client.c?ref_type=tags#L813) ```c 813 void rxrpc_clean_up_local_conns(struct rxrpc_local *local) 814 { 815 struct rxrpc_connection *conn; 816 817 local->kill_all_client_conns = true; 818 timer_delete_sync(&local->client_conn_reap_timer); 819 820 while ((conn = list_first_entry_or_null(&local->idle_client_conns, // ← only idle! 821 struct rxrpc_connection, cache_link))) { 822 list_del_init(&conn->cache_link); 823 atomic_dec(&conn->active); 824 trace_rxrpc_client(conn, -1, rxrpc_client_discard); 825 rxrpc_unbundle_conn(conn); 826 rxrpc_put_connection(conn, rxrpc_conn_put_local_dead); 827 } 828 } ``` Connections move to `idle_client_conns` only in `rxrpc_disconnect_client_call()`, after all channels on the connection go idle. An in-flight connection never reaches this list before cleanup runs. ## What-how-where analysis ### What The `BUG()` at `net/rxrpc/conn_client.c:64` fires because `local->conn_ids` IDR is non-empty when the local rxrpc endpoint is destroyed. The kernel logged one leaked connection (`00000000bf02a6a7`, refcount=1), meaning exactly one reference holder has not released the connection before destruction. The IDR `conn_ids` is the definitive registry of all live client connections on a local endpoint. Its non-emptiness at destruction time means a connection was allocated (`idr_alloc_cyclic`) but never freed (`idr_remove` via `rxrpc_put_client_connection_id`). That release only happens inside `rxrpc_unbundle_conn()`, which is only reachable via the idle-connection path. ### How The race has three actors: the application thread, the rxrpc I/O thread (`krxrpcio/7001`), and the socket-close thread (spawned by `pthread_create` in the reproducer's server). 1. **Application thread** calls `sendmsg()` with `RXRPC_CHARGE_ACCEPT`, queuing a call onto `local->new_client_calls`. 2. **I/O thread** runs `rxrpc_connect_client_calls()`: it allocates a `rxrpc_bundle`, moves the call to `bundle->waiting_calls` (under `client_call_lock`), and calls `rxrpc_activate_channels()` → `rxrpc_alloc_client_connection()` → **`idr_alloc_cyclic()` adds the new connection to `conn_ids`**. 3. **Close thread** calls `close(sk)` concurrently. This wakes the I/O thread's exit path: `rxrpc_io_thread()` returns → `rxrpc_destroy_local()`: - `rxrpc_clean_up_local_conns()` drains `idle_client_conns` — **the just-allocated connection is not there yet** (it is on `bundle->conns[]`, not idle). - `rxrpc_purge_client_connections()` → `rxrpc_destroy_client_conn_ids()` → `conn_ids` is non-empty → **`BUG()`**. The missing lock identified by commit [`b1fdb0bb3b65`](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b1fdb0bb3b6513f5bd26f92369fd6ac1a2422d8b) widens this window: without `client_call_lock` protecting `list_del_init(&call->wait_link)` in the `conn == NULL` abort path of `rxrpc_disconnect_client_call()`, a concurrent abort can corrupt the `new_client_calls` list. This leaves a stale call visible to the I/O thread, which then allocates a connection for it — a connection that will never be disconnected or freed before the local is destroyed. **Root cause:** The connection lifecycle has a gap: connections that are allocated (in `conn_ids`) but have not yet been moved to `idle_client_conns` are invisible to `rxrpc_clean_up_local_conns()`. The missing `client_call_lock` in the abort-before-connect path is the proximate trigger that makes this race reproducible. ### Where **Immediate fix:** Commit [`b1fdb0bb3b65`](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b1fdb0bb3b6513f5bd26f92369fd6ac1a2422d8b) (upstream: [`fc9de52de38f`](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=fc9de52de38f656399d2ce40f7349a6b5f86e787)) closes the missing-lock race that makes the bug reliably triggerable. It is present in `fedora/master` but **not in `6.18.13-200.fc43.x86_64`**. **Deeper fix:** `rxrpc_clean_up_local_conns()` should also sweep connections that are still bundled (active state, in `bundle->conns[]`) and not yet idle. The safest approach would be to walk `conn_ids` directly during cleanup — the same IDR that `rxrpc_destroy_client_conn_ids()` iterates — and force-unbundle any remaining connections before calling `rxrpc_purge_client_connections()`. This would close the window regardless of lock correctness. ## Bug introduction Commit [`9d35d880e0e4`](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9d35d880e0e4a3ab32d8c12f9e4d76198aadd42d) — *"rxrpc: Move client call connection to the I/O thread"* (David Howells, Oct 2022) — restructured connection setup so that the actual `idr_alloc_cyclic` into `conn_ids` happens on the I/O thread rather than the application thread. This made the connection-allocation window overlap with the socket-close path, introducing the race. The commit's own message notes: *"This also completes the fix for a race that exists between call connection and call disconnection"* — acknowledging that races in this area were a known concern. The missing `client_call_lock` in the abort path ([`b1fdb0bb3b65`](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b1fdb0bb3b6513f5bd26f92369fd6ac1a2422d8b), which itself carries `Fixes: 9d35d880e0e4`) is a direct consequence of the same restructuring. ## Analysis, conclusions and recommendations **Summary:** A client `rxrpc` connection is allocated on the I/O thread and registered in `conn_ids`, but the socket is closed concurrently before the connection can be moved to the idle list. `rxrpc_clean_up_local_conns()` only drains the idle list, so the active connection survives into `rxrpc_destroy_client_conn_ids()` which hits `BUG()`. **Recommendations:** 1. **Apply [`b1fdb0bb3b65`](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b1fdb0bb3b6513f5bd26f92369fd6ac1a2422d8b)** to `6.18.13-200.fc43`. This is a stable-tagged fix already present in `fedora/master`; it should land in the next Fedora 43 kernel update. 2. **Upstream investigation:** The cleanup gap in `rxrpc_clean_up_local_conns()` (only idle connections) should be reported to David Howells. A robust fix would have the cleanup path iterate `conn_ids` directly and force-disconnect any non-idle connections, removing the dependency on the connection having reached the idle state before destruction. 3. **Reproducer confirmed:** Anderson Nascimento's server+client reproducer reliably triggers this. It should be forwarded to the rxrpc maintainer alongside the above analysis.