Crash reported by Anderson Nascimento via LKML: CAPhRvkyZGKHRTBhV3P2PCCRxmRKGEvJQ0W5a9SMW3qwS2hp2Qw@mail.gmail.com
| Field | Value | Implication |
|---|---|---|
| UNAME | 6.18.13-200.fc43.x86_64 |
|
| DISTRO | Fedora | |
| DISTRO_VERSION | fc43 | |
| PROCESS | krxrpcio/7001 |
Kernel I/O thread for the rxrpc local endpoint on UDP port 7001 |
| HARDWARE | VMware Virtual Platform/440BX Desktop Reference Platform | |
| TAINT | Not tainted | |
| CRASH_TYPE | BUG() — invalid opcode (UD2) |
Unconditional kernel assertion failure |
| CRASH_LOCATION | net/rxrpc/conn_client.c:64 |
|
| CONFIG_REQUIRED | (unconditional — fires in all builds) | Plain BUG(), no CONFIG_ guard |
| MSGID | <CAPhRvkyZGKHRTBhV3P2PCCRxmRKGEvJQ0W5a9SMW3qwS2hp2Qw@mail.gmail.com> |
|
| MSGID_URL | CAPhRvky…@mail.gmail.com | |
| SOURCEDIR | oops-workdir/linux @ tag
kernel-6.18.13-0 |
|
| VMLINUX | oops-workdir/fedora/files-6.18.13/usr/lib/debug/lib/modules/6.18.13-200.fc43.x86_64/vmlinux |
|
| MODULES_DIR | oops-workdir/fedora/files-6.18.13/usr/lib/debug/lib/modules/6.18.13-200.fc43.x86_64/kernel |
|
| INTRODUCED-BY | 9d35d880e0e4
— rxrpc: Move client call connection to the I/O thread |
Restructured connection setup; introduced the race window |
| FIXED-BY | b1fdb0bb3b65
— rxrpc: Fix missing locking causing hanging calls |
Fixes the missing client_call_lock in the
disconnect-before-connect path |
| Module | Flags | Backtrace | Location | Flag Implication |
|---|---|---|---|---|
| rxrpc | Y | oops-workdir/fedora/files-6.18.13/.../net/rxrpc/rxrpc.ko.debug |
||
| vsock_diag | ||||
| fcrypt | ||||
| pcbc | ||||
| ip6_udp_tunnel | ||||
| krb5 | ||||
| udp_tunnel | ||||
| rfkill | ||||
| (remaining modules omitted — none appear in backtrace) |
| Address | Function | Offset | Size | Context | Module | Source location |
|---|---|---|---|---|---|---|
0x160d8 (0x16080 + 0x58) |
rxrpc_purge_client_connections |
0x58 |
0xa0 |
Task | rxrpc | conn_client.c:64 |
0x21ab9 (0x219f0 + 0xc9) |
rxrpc_destroy_local |
0xc9 |
0xe0 |
Task | rxrpc | local_object.c:451 |
0x1f3cd (0x1ed70 + 0x65d) |
rxrpc_io_thread |
0x65d |
0x750 |
Task | rxrpc | io_thread.c:598 |
(skipped ? __pfx_rxrpc_io_thread) |
||||||
0xffffffff813f24ec (0xffffffff813f23f0 + 0xfc) |
kthread |
0xfc |
0x240 |
Task | vmlinux | kthread.c:463 |
(skipped ? __pfx_kthread) |
||||||
0xffffffff8132ab54 (0xffffffff8132aa60 + 0xf4) |
ret_from_fork |
0xf4 |
0x110 |
Task | vmlinux | process.c:158 |
0xffffffff812d8dca (0xffffffff812d8db0 + 0x1a) |
ret_from_fork_asm |
0x1a |
0x30 |
Task | vmlinux | entry_64.S:245 |
| Register | Value | Note |
|---|---|---|
| RIP | 0010:rxrpc_purge_client_connections+0x58/0xa0 [rxrpc] |
Crash site |
| RSP | ffffc900159cfdd8 |
|
| EFLAGS | 00010246 |
ZF set (zero flag) — last comparison was equal |
| RAX | 0000000000000000 |
Return value = 0 (IDR empty check result) |
| RBX | ffff88810a6b4800 |
struct rxrpc_local *local |
| RDI | ffff88810a6b4920 |
Pointer into local (offset 0x120 = conn_ids field) |
| CR2 | 00007faf20630030 |
User-space address — not relevant to this crash |
Code bytes:
… 0f 85 49 dd 01 00 <0f> 0b … The
<0f> 0b sequence is the UD2 instruction — the x86
encoding of BUG(). The preceding 0f 85 (JNZ)
is the if (!idr_is_empty(...)) branch that was taken.
rxrpc_destroy_client_conn_ids (inlined) — crash site
(conn_client.c:64)addr2line resolves rxrpc_purge_client_connections+0x58
to rxrpc_destroy_client_conn_ids inlined into
rxrpc_purge_client_connections at
conn_client.c:145. The crash fires inside the inlined
callee:
net/rxrpc/conn_client.c — kernel-6.18.13-0
54 static void rxrpc_destroy_client_conn_ids(struct rxrpc_local *local)
55 {
56 struct rxrpc_connection *conn;
57 int id;
58
59 if (!idr_is_empty(&local->conn_ids)) { // ← branch taken: conn_ids is NOT empty
60 idr_for_each_entry(&local->conn_ids, conn, id) {
61 pr_err("AF_RXRPC: Leaked client conn %p {%d}\n",
62 conn, refcount_read(&conn->ref));
63 }
64 BUG(); // ← crash here — refcount={1}, one connection leaked
65 }
66
67 idr_destroy(&local->conn_ids);
68 }One connection was logged: 00000000bf02a6a7 {1} —
refcount=1 means one reference holder has not released it.
rxrpc_destroy_local — call site
(local_object.c:451)net/rxrpc/local_object.c — kernel-6.18.13-0
420 void rxrpc_destroy_local(struct rxrpc_local *local)
421 {
...
432 rxrpc_clean_up_local_conns(local); // ← only cleans idle_client_conns list
433 rxrpc_service_connection_reaper(&rxnet->service_conn_reaper);
434 ASSERT(!local->service);
...
451 rxrpc_purge_client_connections(local); // ← call site → crash
452 page_frag_cache_drain(&local->tx_alloc);
453 }rxrpc_clean_up_local_conns() (called at line 432) only
drains the local->idle_client_conns list. A connection
that has been allocated and added to conn_ids but has not
yet reached the idle state is invisible to it.
rxrpc_clean_up_local_conns — cleanup gap
(conn_client.c:813)net/rxrpc/conn_client.c — kernel-6.18.13-0
813 void rxrpc_clean_up_local_conns(struct rxrpc_local *local)
814 {
815 struct rxrpc_connection *conn;
816
817 local->kill_all_client_conns = true;
818 timer_delete_sync(&local->client_conn_reap_timer);
819
820 while ((conn = list_first_entry_or_null(&local->idle_client_conns, // ← only idle!
821 struct rxrpc_connection, cache_link))) {
822 list_del_init(&conn->cache_link);
823 atomic_dec(&conn->active);
824 trace_rxrpc_client(conn, -1, rxrpc_client_discard);
825 rxrpc_unbundle_conn(conn);
826 rxrpc_put_connection(conn, rxrpc_conn_put_local_dead);
827 }
828 }Connections move to idle_client_conns only in
rxrpc_disconnect_client_call(), after all channels on the
connection go idle. An in-flight connection never reaches this list
before cleanup runs.
The BUG() at net/rxrpc/conn_client.c:64
fires because local->conn_ids IDR is non-empty when the
local rxrpc endpoint is destroyed. The kernel logged one leaked
connection (00000000bf02a6a7, refcount=1), meaning exactly
one reference holder has not released the connection before
destruction.
The IDR conn_ids is the definitive registry of all live
client connections on a local endpoint. Its non-emptiness at destruction
time means a connection was allocated (idr_alloc_cyclic)
but never freed (idr_remove via
rxrpc_put_client_connection_id). That release only happens
inside rxrpc_unbundle_conn(), which is only reachable via
the idle-connection path.
The race has three actors: the application thread, the rxrpc I/O
thread (krxrpcio/7001), and the socket-close thread
(spawned by pthread_create in the reproducer’s server).
Application thread calls sendmsg()
with RXRPC_CHARGE_ACCEPT, queuing a call onto
local->new_client_calls.
I/O thread runs
rxrpc_connect_client_calls(): it allocates a
rxrpc_bundle, moves the call to
bundle->waiting_calls (under
client_call_lock), and calls
rxrpc_activate_channels() →
rxrpc_alloc_client_connection() →
idr_alloc_cyclic() adds the new connection to
conn_ids.
Close thread calls close(sk)
concurrently. This wakes the I/O thread’s exit path:
rxrpc_io_thread() returns →
rxrpc_destroy_local():
rxrpc_clean_up_local_conns() drains
idle_client_conns — the just-allocated connection
is not there yet (it is on bundle->conns[], not
idle).rxrpc_purge_client_connections() →
rxrpc_destroy_client_conn_ids() → conn_ids is
non-empty → BUG().The missing lock identified by commit b1fdb0bb3b65
widens this window: without client_call_lock protecting
list_del_init(&call->wait_link) in the
conn == NULL abort path of
rxrpc_disconnect_client_call(), a concurrent abort can
corrupt the new_client_calls list. This leaves a stale call
visible to the I/O thread, which then allocates a connection for it — a
connection that will never be disconnected or freed before the local is
destroyed.
Root cause: The connection lifecycle has a gap:
connections that are allocated (in conn_ids) but have not
yet been moved to idle_client_conns are invisible to
rxrpc_clean_up_local_conns(). The missing
client_call_lock in the abort-before-connect path is the
proximate trigger that makes this race reproducible.
Immediate fix: Commit b1fdb0bb3b65
(upstream: fc9de52de38f)
closes the missing-lock race that makes the bug reliably triggerable. It
is present in fedora/master but not in
6.18.13-200.fc43.x86_64.
Deeper fix:
rxrpc_clean_up_local_conns() should also sweep connections
that are still bundled (active state, in
bundle->conns[]) and not yet idle. The safest approach
would be to walk conn_ids directly during cleanup — the
same IDR that rxrpc_destroy_client_conn_ids() iterates —
and force-unbundle any remaining connections before calling
rxrpc_purge_client_connections(). This would close the
window regardless of lock correctness.
Commit 9d35d880e0e4
— “rxrpc: Move client call connection to the I/O thread” (David
Howells, Oct 2022) — restructured connection setup so that the actual
idr_alloc_cyclic into conn_ids happens on the
I/O thread rather than the application thread. This made the
connection-allocation window overlap with the socket-close path,
introducing the race.
The commit’s own message notes: “This also completes the fix for
a race that exists between call connection and call disconnection”
— acknowledging that races in this area were a known concern. The
missing client_call_lock in the abort path (b1fdb0bb3b65,
which itself carries Fixes: 9d35d880e0e4) is a direct
consequence of the same restructuring.
Summary: A client rxrpc connection is
allocated on the I/O thread and registered in conn_ids, but
the socket is closed concurrently before the connection can be moved to
the idle list. rxrpc_clean_up_local_conns() only drains the
idle list, so the active connection survives into
rxrpc_destroy_client_conn_ids() which hits
BUG().
Recommendations:
Apply b1fdb0bb3b65
to 6.18.13-200.fc43. This is a stable-tagged fix already
present in fedora/master; it should land in the next Fedora
43 kernel update.
Upstream investigation: The cleanup gap in
rxrpc_clean_up_local_conns() (only idle connections) should
be reported to David Howells. A robust fix would have the cleanup path
iterate conn_ids directly and force-disconnect any non-idle
connections, removing the dependency on the connection having reached
the idle state before destruction.
Reproducer confirmed: Anderson Nascimento’s server+client reproducer reliably triggers this. It should be forwarded to the rxrpc maintainer alongside the above analysis.