rxrpc BUG() — leaked client connection

Linux kernel crash report

Crash reported by Anderson Nascimento via LKML: CAPhRvkyZGKHRTBhV3P2PCCRxmRKGEvJQ0W5a9SMW3qwS2hp2Qw@mail.gmail.com

Key elements

Field Value Implication
UNAME 6.18.13-200.fc43.x86_64
DISTRO Fedora
DISTRO_VERSION fc43
PROCESS krxrpcio/7001 Kernel I/O thread for the rxrpc local endpoint on UDP port 7001
HARDWARE VMware Virtual Platform/440BX Desktop Reference Platform
TAINT Not tainted
CRASH_TYPE BUG() — invalid opcode (UD2) Unconditional kernel assertion failure
CRASH_LOCATION net/rxrpc/conn_client.c:64
CONFIG_REQUIRED (unconditional — fires in all builds) Plain BUG(), no CONFIG_ guard
MSGID <CAPhRvkyZGKHRTBhV3P2PCCRxmRKGEvJQ0W5a9SMW3qwS2hp2Qw@mail.gmail.com>
MSGID_URL CAPhRvky…@mail.gmail.com
SOURCEDIR oops-workdir/linux @ tag kernel-6.18.13-0
VMLINUX oops-workdir/fedora/files-6.18.13/usr/lib/debug/lib/modules/6.18.13-200.fc43.x86_64/vmlinux
MODULES_DIR oops-workdir/fedora/files-6.18.13/usr/lib/debug/lib/modules/6.18.13-200.fc43.x86_64/kernel
INTRODUCED-BY 9d35d880e0e4 — rxrpc: Move client call connection to the I/O thread Restructured connection setup; introduced the race window
FIXED-BY b1fdb0bb3b65 — rxrpc: Fix missing locking causing hanging calls Fixes the missing client_call_lock in the disconnect-before-connect path

Kernel modules

Module Flags Backtrace Location Flag Implication
rxrpc Y oops-workdir/fedora/files-6.18.13/.../net/rxrpc/rxrpc.ko.debug
vsock_diag
fcrypt
pcbc
ip6_udp_tunnel
krb5
udp_tunnel
rfkill
(remaining modules omitted — none appear in backtrace)

Backtrace

Address Function Offset Size Context Module Source location
0x160d8 (0x16080 + 0x58) rxrpc_purge_client_connections 0x58 0xa0 Task rxrpc conn_client.c:64
0x21ab9 (0x219f0 + 0xc9) rxrpc_destroy_local 0xc9 0xe0 Task rxrpc local_object.c:451
0x1f3cd (0x1ed70 + 0x65d) rxrpc_io_thread 0x65d 0x750 Task rxrpc io_thread.c:598
(skipped ? __pfx_rxrpc_io_thread)
0xffffffff813f24ec (0xffffffff813f23f0 + 0xfc) kthread 0xfc 0x240 Task vmlinux kthread.c:463
(skipped ? __pfx_kthread)
0xffffffff8132ab54 (0xffffffff8132aa60 + 0xf4) ret_from_fork 0xf4 0x110 Task vmlinux process.c:158
0xffffffff812d8dca (0xffffffff812d8db0 + 0x1a) ret_from_fork_asm 0x1a 0x30 Task vmlinux entry_64.S:245

CPU registers

Register Value Note
RIP 0010:rxrpc_purge_client_connections+0x58/0xa0 [rxrpc] Crash site
RSP ffffc900159cfdd8
EFLAGS 00010246 ZF set (zero flag) — last comparison was equal
RAX 0000000000000000 Return value = 0 (IDR empty check result)
RBX ffff88810a6b4800 struct rxrpc_local *local
RDI ffff88810a6b4920 Pointer into local (offset 0x120 = conn_ids field)
CR2 00007faf20630030 User-space address — not relevant to this crash

Code bytes: … 0f 85 49 dd 01 00 <0f> 0b … The <0f> 0b sequence is the UD2 instruction — the x86 encoding of BUG(). The preceding 0f 85 (JNZ) is the if (!idr_is_empty(...)) branch that was taken.

Backtrace source code

1. rxrpc_destroy_client_conn_ids (inlined) — crash site (conn_client.c:64)

addr2line resolves rxrpc_purge_client_connections+0x58 to rxrpc_destroy_client_conn_ids inlined into rxrpc_purge_client_connections at conn_client.c:145. The crash fires inside the inlined callee:

net/rxrpc/conn_client.c — kernel-6.18.13-0

 54 static void rxrpc_destroy_client_conn_ids(struct rxrpc_local *local)
 55 {
 56     struct rxrpc_connection *conn;
 57     int id;
 58
 59     if (!idr_is_empty(&local->conn_ids)) {    // ← branch taken: conn_ids is NOT empty
 60         idr_for_each_entry(&local->conn_ids, conn, id) {
 61             pr_err("AF_RXRPC: Leaked client conn %p {%d}\n",
 62                    conn, refcount_read(&conn->ref));
 63         }
 64         BUG();    // ← crash here — refcount={1}, one connection leaked
 65     }
 66
 67     idr_destroy(&local->conn_ids);
 68 }

One connection was logged: 00000000bf02a6a7 {1} — refcount=1 means one reference holder has not released it.

2. rxrpc_destroy_local — call site (local_object.c:451)

net/rxrpc/local_object.c — kernel-6.18.13-0

 420 void rxrpc_destroy_local(struct rxrpc_local *local)
 421 {
      ...
 432     rxrpc_clean_up_local_conns(local);         // ← only cleans idle_client_conns list
 433     rxrpc_service_connection_reaper(&rxnet->service_conn_reaper);
 434     ASSERT(!local->service);
      ...
 451     rxrpc_purge_client_connections(local);     // ← call site → crash
 452     page_frag_cache_drain(&local->tx_alloc);
 453 }

rxrpc_clean_up_local_conns() (called at line 432) only drains the local->idle_client_conns list. A connection that has been allocated and added to conn_ids but has not yet reached the idle state is invisible to it.

3. rxrpc_clean_up_local_conns — cleanup gap (conn_client.c:813)

net/rxrpc/conn_client.c — kernel-6.18.13-0

 813 void rxrpc_clean_up_local_conns(struct rxrpc_local *local)
 814 {
 815     struct rxrpc_connection *conn;
 816
 817     local->kill_all_client_conns = true;
 818     timer_delete_sync(&local->client_conn_reap_timer);
 819
 820     while ((conn = list_first_entry_or_null(&local->idle_client_conns,  // ← only idle!
 821                                             struct rxrpc_connection, cache_link))) {
 822         list_del_init(&conn->cache_link);
 823         atomic_dec(&conn->active);
 824         trace_rxrpc_client(conn, -1, rxrpc_client_discard);
 825         rxrpc_unbundle_conn(conn);
 826         rxrpc_put_connection(conn, rxrpc_conn_put_local_dead);
 827     }
 828 }

Connections move to idle_client_conns only in rxrpc_disconnect_client_call(), after all channels on the connection go idle. An in-flight connection never reaches this list before cleanup runs.

What-how-where analysis

What

The BUG() at net/rxrpc/conn_client.c:64 fires because local->conn_ids IDR is non-empty when the local rxrpc endpoint is destroyed. The kernel logged one leaked connection (00000000bf02a6a7, refcount=1), meaning exactly one reference holder has not released the connection before destruction.

The IDR conn_ids is the definitive registry of all live client connections on a local endpoint. Its non-emptiness at destruction time means a connection was allocated (idr_alloc_cyclic) but never freed (idr_remove via rxrpc_put_client_connection_id). That release only happens inside rxrpc_unbundle_conn(), which is only reachable via the idle-connection path.

How

The race has three actors: the application thread, the rxrpc I/O thread (krxrpcio/7001), and the socket-close thread (spawned by pthread_create in the reproducer’s server).

  1. Application thread calls sendmsg() with RXRPC_CHARGE_ACCEPT, queuing a call onto local->new_client_calls.

  2. I/O thread runs rxrpc_connect_client_calls(): it allocates a rxrpc_bundle, moves the call to bundle->waiting_calls (under client_call_lock), and calls rxrpc_activate_channels()rxrpc_alloc_client_connection()idr_alloc_cyclic() adds the new connection to conn_ids.

  3. Close thread calls close(sk) concurrently. This wakes the I/O thread’s exit path: rxrpc_io_thread() returns → rxrpc_destroy_local():

The missing lock identified by commit b1fdb0bb3b65 widens this window: without client_call_lock protecting list_del_init(&call->wait_link) in the conn == NULL abort path of rxrpc_disconnect_client_call(), a concurrent abort can corrupt the new_client_calls list. This leaves a stale call visible to the I/O thread, which then allocates a connection for it — a connection that will never be disconnected or freed before the local is destroyed.

Root cause: The connection lifecycle has a gap: connections that are allocated (in conn_ids) but have not yet been moved to idle_client_conns are invisible to rxrpc_clean_up_local_conns(). The missing client_call_lock in the abort-before-connect path is the proximate trigger that makes this race reproducible.

Where

Immediate fix: Commit b1fdb0bb3b65 (upstream: fc9de52de38f) closes the missing-lock race that makes the bug reliably triggerable. It is present in fedora/master but not in 6.18.13-200.fc43.x86_64.

Deeper fix: rxrpc_clean_up_local_conns() should also sweep connections that are still bundled (active state, in bundle->conns[]) and not yet idle. The safest approach would be to walk conn_ids directly during cleanup — the same IDR that rxrpc_destroy_client_conn_ids() iterates — and force-unbundle any remaining connections before calling rxrpc_purge_client_connections(). This would close the window regardless of lock correctness.

Bug introduction

Commit 9d35d880e0e4“rxrpc: Move client call connection to the I/O thread” (David Howells, Oct 2022) — restructured connection setup so that the actual idr_alloc_cyclic into conn_ids happens on the I/O thread rather than the application thread. This made the connection-allocation window overlap with the socket-close path, introducing the race.

The commit’s own message notes: “This also completes the fix for a race that exists between call connection and call disconnection” — acknowledging that races in this area were a known concern. The missing client_call_lock in the abort path (b1fdb0bb3b65, which itself carries Fixes: 9d35d880e0e4) is a direct consequence of the same restructuring.

Analysis, conclusions and recommendations

Summary: A client rxrpc connection is allocated on the I/O thread and registered in conn_ids, but the socket is closed concurrently before the connection can be moved to the idle list. rxrpc_clean_up_local_conns() only drains the idle list, so the active connection survives into rxrpc_destroy_client_conn_ids() which hits BUG().

Recommendations:

  1. Apply b1fdb0bb3b65 to 6.18.13-200.fc43. This is a stable-tagged fix already present in fedora/master; it should land in the next Fedora 43 kernel update.

  2. Upstream investigation: The cleanup gap in rxrpc_clean_up_local_conns() (only idle connections) should be reported to David Howells. A robust fix would have the cleanup path iterate conn_ids directly and force-disconnect any non-idle connections, removing the dependency on the connection having reached the idle state before destruction.

  3. Reproducer confirmed: Anderson Nascimento’s server+client reproducer reliably triggers this. It should be forwarded to the rxrpc maintainer alongside the above analysis.