Syzbot report via email, MSGID 69ed492c.050a0220.e51af.0005.GAE@google.com.
Subject: [syzbot] [bluetooth?] WARNING in hci_send_cmd (4).
Dashboard:
https://syzkaller.appspot.com/bug?extid=00f5a866124dc44cce14
Root cause:
hci_send_cmd()lacks anHCI_UPguard, allowing it to queue work ontohdev->workqueueafter the workqueue has entered the draining/destroying state during HCI device shutdown.
| Field | Value | Implication |
|---|---|---|
| CRASH_TYPE | WARNING | |
| WARNING_MSG | workqueue: cannot queue hci_cmd_work on wq hci0 |
|
| WARNING_SRC | kernel/workqueue.c:2297 |
WARN_ONCE in __queue_work: wq is being
destroyed/drained |
| UNAME | syzkaller #0 PREEMPT(full) |
|
| DISTRO | syzbot / mainline upstream | |
| COMPILER | Debian clang version 21.1.8 | |
| HEAD_COMMIT | b4e07588e743 |
|
| MSGID | <69ed492c.050a0220.e51af.0005.GAE@google.com> |
|
| MSGID_URL | 69ed492c.050a0220.e51af.0005.GAE@google.com | |
| BUG_URL | https://syzkaller.appspot.com/bug?extid=00f5a866124dc44cce14 | |
| SYZBOT_OCCURRENCE | 4th occurrence | Recurring — not first time syzbot has seen this |
| INTRODUCED-BY | c347b765fe70 | queue_work on hdev->workqueue without
HCI_UP guard (2011) |
| PROCESS | kworker/0:3 (PID 1378, CPU 0) |
Kernel worker thread |
| WQ_CONTEXT | Workqueue: events l2cap_info_timeout |
Running in the events workqueue |
| HARDWARE | QEMU Standard PC (Q35 + ICH9, 2009) | syzbot VM |
| BIOS | 1.16.3-debian-1.16.3-2 04/01/2014 | |
| TAINT | Not tainted | No taint flags set |
| CONFIG_REQUIRED | CONFIG_BUG (unconditional when enabled — standard in
all builds) |
|
| VMLINUX | oops-workdir/syzbot/vmlinux-b4e07588 |
|
| SOURCEDIR | oops-workdir/linux |
checked out to b4e07588e743 |
| Module | Flags | Backtrace | Location | Flag Implication |
|---|---|---|---|---|
| (module list not available in this report) |
| Address | Function | Offset | Size | Context | Module | Source location |
|---|---|---|---|---|---|---|
0xffffffff818d5d2a (0xffffffff818d4fe0 + 0xd4a) |
__queue_work |
0xd4a |
0xfc0 |
Task | (built-in) | kernel/workqueue.c:2297 |
0xffffffff818d4f06 (0xffffffff818d4e00 + 0x106) |
queue_work_on |
0x106 |
0x1d0 |
Task | (built-in) | kernel/workqueue.c:2432 |
queue_work (inlined) |
Task | (built-in) | include/linux/workqueue.h:696 | |||
0xffffffff8aaa3767 (0xffffffff8aaa36b0 + 0xb7) |
hci_send_cmd |
0xb7 |
0x1a0 |
Task | (built-in) | net/bluetooth/hci_core.c:3111 |
hci_conn_auth (inlined) |
Task | (built-in) | net/bluetooth/hci_conn.c:2459 | |||
0xffffffff8aabbac9 (0xffffffff8aabb530 + 0x599) |
hci_conn_security |
0x599 |
0xa80 |
Task | (built-in) | net/bluetooth/hci_conn.c:2551 |
0xffffffff8ab70f0c (0xffffffff8ab70b50 + 0x3bc) |
l2cap_conn_start |
0x3bc |
0xf20 |
Task | (built-in) | net/bluetooth/l2cap_core.c:1534 |
0xffffffff8ab707c8 (0xffffffff8ab70760 + 0x68) |
l2cap_info_timeout |
0x68 |
0xa0 |
Task | (built-in) | net/bluetooth/l2cap_core.c:1685 |
process_one_work (inlined) |
Task | (built-in) | kernel/workqueue.c:3302 | |||
0xffffffff818eb2ed (0xffffffff818ea790 + 0xb5d) |
process_scheduled_works |
0xb5d |
0x1860 |
Task | (built-in) | kernel/workqueue.c:3385 |
0xffffffff818f3353 (0xffffffff818f2900 + 0xa53) |
worker_thread |
0xa53 |
0xfc0 |
Task | (built-in) | kernel/workqueue.c:3466 |
0xffffffff8190ae58 (0xffffffff8190aad0 + 0x388) |
kthread |
0x388 |
0x470 |
Task | (built-in) | kernel/kthread.c:436 |
0xffffffff816bfae4 (0xffffffff816bf5d0 + 0x514) |
ret_from_fork |
0x514 |
0xb70 |
Task | (built-in) | arch/x86/kernel/process.c:158 |
0xffffffff813370aa (0xffffffff81337090 + 0x1a) |
ret_from_fork_asm |
0x1a |
0x30 |
Task | (built-in) | arch/x86/entry/entry_64.S:245 |
RIP: 0010:__queue_work+0xd4a/0xfc0 kernel/workqueue.c:2296
RSP: 0018:ffffc9000257f720 EFLAGS: 00010082
RAX: 1ffff110081cc181 [KASAN shadow address]
RBX: 0000000000000008 [work flags / small integer]
RCX: ffff888000260000
RDX: ffff888040182170 [→ wq->name (R15), points to "hci0" string after offset add]
RSI: ffffffff8aa9ccd0 [= hci_cmd_work — work->func loaded from *R13]
RDI: ffffffff90368d70 [= __start___bug_table+0xd530 (R14), BUG table entry for WARN]
RBP: 0000000000000020
R08: ffff888040e60bf7
R09: 1ffff110081cc17e [KASAN shadow address]
R10: dffffc0000000000 [KASAN shadow offset]
R11: ffffed10081cc17f [KASAN address]
R12: dffffc0000000000 [KASAN shadow offset]
R13: ffff888040e60c08 [→ work struct; *R13 = work->func = hci_cmd_work]
R14: ffffffff90368d70 [= __start___bug_table+0xd530, BUG table entry]
R15: ffff888040182170 [→ wq->name after add $0x170 = "hci0"]
FS: 0000000000000000(0000) GS:ffff88808c80c000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000200000005dc0
CR3: 0000000012a31000 CR4: 0000000000352ef0
Code bytes at crash:
Code: 83 c5 18 4c 89 e8 48 c1 e8 03 42 80 3c 20 00 74 08 4c 89 ef e8 17 4d a5 00 49 8b 75 00
49 81 c7 70 01 00 00 4c 89 f7 4c 89 fa <67> 48 0f b9 3a ← trapping instruction (ud1)
48 83 c4 58 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc
Disassembly window from vmlinux around crash site
(__queue_work+0xd4a):
ffffffff818d5cec: mov 0x18(%rsp),%r15
ffffffff818d5cf1: jmp ffffffff818d5cf8 <__queue_work+0xd18>
ffffffff818d5cf3: call ffffffff81c5e0f0 <__sanitizer_cov_trace_pc>
ffffffff818d5cf8: lea 0xea93071(%rip),%r14 # ffffffff90368d70 <__start___bug_table+0xd530>
ffffffff818d5cff: add $0x18,%r13
ffffffff818d5d03: mov %r13,%rax
ffffffff818d5d06: shr $0x3,%rax
ffffffff818d5d0a: cmpb $0x0,(%rax,%r12,1) ; KASAN shadow check
ffffffff818d5d0f: je ffffffff818d5d19 <__queue_work+0xd39>
ffffffff818d5d11: mov %r13,%rdi
ffffffff818d5d14: call ffffffff8232aa30 <__asan_report_load8_noabort>
ffffffff818d5d19: mov 0x0(%r13),%rsi ; RSI = work->func
ffffffff818d5d1d: add $0x170,%r15 ; R15 = wq->name
ffffffff818d5d24: mov %r14,%rdi ; RDI = BUG table entry
ffffffff818d5d27: mov %r15,%rdx ; RDX = wq->name
ffffffff818d5d2a: call ffffffff8bbd4e98 <__SCT__WARN_trap> <<<< crash
ffffffff818d5d2f: add $0x58,%rsp
ffffffff818d5d33: pop %rbx
ffffffff818d5d34: pop %r12
ffffffff818d5d36: pop %r13
ffffffff818d5d38: pop %r14
Note: Code bytes show ud1 (%edx),%rdi (67 48 0f b9 3a)
at the crash address, while vmlinux disasm shows
call __SCT__WARN_trap. The ud1 encoding is the
WARN trap mechanism; the call target __SCT__WARN_trap
contains the ud1 as a static call stub.
__queue_work — crash site
(kernel/workqueue.c:2297)kernel/workqueue.c
@ b4e07588e743
2275 static void __queue_work(int cpu, struct workqueue_struct *wq,
2276 struct work_struct *work)
2277 {
2278 struct pool_workqueue *pwq;
2279 struct worker_pool *last_pool, *pool;
2280 unsigned int work_flags;
2281 unsigned int req_cpu = cpu;
2282
2283 /*
2284 * While a work item is PENDING && off queue, a task trying to
2285 * steal the PENDING will busy-loop waiting for it to either get
2286 * queued or lose PENDING. Grabbing PENDING and queueing should
2287 * happen with IRQ disabled.
2288 */
2289 lockdep_assert_irqs_disabled();
2290
2291 /*
2292 * For a draining wq, only works from the same workqueue are
2293 * allowed. The __WQ_DESTROYING helps to spot the issue that
2294 * queues a new work item to a wq after destroy_workqueue(wq).
2295 */
2296 if (unlikely(wq->flags & (__WQ_DESTROYING | __WQ_DRAINING) &&
2297 → WARN_ONCE(!is_chained_work(wq), "workqueue: cannot queue %ps on wq %s\n",
2298 work->func, wq->name))) {
2299 return;
2300 }queue_work_on (kernel/workqueue.c:2432)kernel/workqueue.c
@ b4e07588e743
2422 bool queue_work_on(int cpu, struct workqueue_struct *wq,
2423 struct work_struct *work)
2424 {
2425 bool ret = false;
2426 unsigned long irq_flags;
2427
2428 local_irq_save(irq_flags);
2429
2430 if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)) &&
2431 !clear_pending_if_disabled(work)) {
2432 → __queue_work(cpu, wq, work);
2433 ret = true;
2434 }
2435
2436 local_irq_restore(irq_flags);
2437 return ret;
2438 }hci_send_cmd
(net/bluetooth/hci_core.c:3111)net/bluetooth/hci_core.c
@ b4e07588e743
3092 int hci_send_cmd(struct hci_dev *hdev, __u16 opcode, __u32 plen,
3093 const void *param)
3094 {
3095 struct sk_buff *skb;
3096
3097 BT_DBG("%s opcode 0x%4.4x plen %d", hdev->name, opcode, plen);
3098
3099 skb = hci_cmd_sync_alloc(hdev, opcode, plen, param, NULL);
3100 if (!skb) {
3101 bt_dev_err(hdev, "no memory for command");
3102 return -ENOMEM;
3103 }
3104
3105 /* Stand-alone HCI commands must be flagged as
3106 * single-command requests.
3107 */
3108 bt_cb(skb)->hci.req_flags |= HCI_REQ_START;
3109
3110 skb_queue_tail(&hdev->cmd_q, skb);
3111 → queue_work(hdev->workqueue, &hdev->cmd_work);
3112
3113 return 0;
3114 }The inlined caller hci_conn_auth
(net/bluetooth/hci_conn.c:2459) calls hci_send_cmd with
HCI_OP_AUTH_REQUESTED:
net/bluetooth/hci_conn.c
@ b4e07588e743
2438 static int hci_conn_auth(struct hci_conn *conn, __u8 sec_level, __u8 auth_type)
2439 {
2440 ...
2455 if (!test_and_set_bit(HCI_CONN_AUTH_PEND, &conn->flags)) {
2456 struct hci_cp_auth_requested cp;
2457
2458 cp.handle = cpu_to_le16(conn->handle);
2459 → hci_send_cmd(conn->hdev, HCI_OP_AUTH_REQUESTED,
2460 sizeof(cp), &cp);
2461 ...
2462 }
2463 return 0;
2464 }hci_conn_security
(net/bluetooth/hci_conn.c:2551)net/bluetooth/hci_conn.c
@ b4e07588e743
2487 int hci_conn_security(struct hci_conn *conn, __u8 sec_level, __u8 auth_type,
2488 bool initiator)
2489 {
2490 ...
2544 auth:
2545 if (test_bit(HCI_CONN_ENCRYPT_PEND, &conn->flags))
2546 return 0;
2547
2548 if (initiator)
2549 set_bit(HCI_CONN_AUTH_INITIATOR, &conn->flags);
2550
2551 → if (!hci_conn_auth(conn, sec_level, auth_type))
2552 return 0;
2553
2554 encrypt:
2555 if (test_bit(HCI_CONN_ENCRYPT, &conn->flags)) {
2556 if (!conn->enc_key_size)
2557 return 0;
2558 return 1;
2559 }
2560
2561 hci_conn_encrypt(conn);
2562 return 0;
2563 }hci_send_cmd() at
net/bluetooth/hci_core.c:3111 calls
queue_work(hdev->workqueue, &hdev->cmd_work)
after hdev->workqueue has already entered the draining
or destroying state during HCI device shutdown. The workqueue core’s
WARN_ONCE in __queue_work() at
kernel/workqueue.c:2296–2298 detects this and fires,
printing:
workqueue: cannot queue hci_cmd_work on wq hci0
The crashing instruction is a ud1 emitted by the WARN
trap mechanism. The function being queued (hci_cmd_work) is
confirmed by RSI = 0xffffffff8aa9ccd0 =
hci_cmd_work, loaded from the work struct pointer in R13.
R15 (+ 0x170 added just before the WARN) points to
hdev->name = "hci0", confirmed in RDX at
crash time.
hci_send_cmd does not check the
HCI_UP flag in hdev->flags before calling
queue_work. All other latecomer entry points into the HCI
device (e.g. hci_recv_frame) guard with
!test_bit(HCI_UP, &hdev->flags); this function is
missing that guard.
Q1: How does
queue_work(hdev->workqueue, …) get called on a
draining/destroying workqueue?
A1: hci_send_cmd() calls
queue_work(hdev->workqueue, &hdev->cmd_work)
unconditionally, without checking whether the device is still up. No
caller in the
l2cap_info_timeout → hci_conn_security → hci_conn_auth → hci_send_cmd
chain checks this either.
Q2: How does hdev->workqueue end up
draining while hci_send_cmd can still be reached?
A2: Race condition in
hci_dev_close_sync()
(net/bluetooth/hci_sync.c):
test_and_clear_bit(HCI_UP, &hdev->flags) clears
HCI_UP.drain_workqueue(hdev->workqueue) is called, which sets
__WQ_DRAINING on hdev->workqueue.hci_conn_hash_flush(hdev)
→ l2cap_conn_del() →
disable_delayed_work_sync(&conn->info_timer) — this
would cancel l2cap_info_timeout, but it runs
after step 2.l2cap_info_timeout runs on the
events workqueue, which is entirely
separate from hdev->workqueue. Draining
hdev->workqueue does not affect the events
workqueue. If l2cap_info_timeout was already scheduled on
the events workqueue before step 3 cancels it, the callback
fires in the window between steps 2 and 3:
l2cap_info_timeout → l2cap_conn_start →
hci_conn_security → hci_conn_auth →
hci_send_cmdhci_send_cmd calls
queue_work(hdev->workqueue, &hdev->cmd_work)hdev->workqueue is in __WQ_DRAINING
state → WARN firesThe same race could also occur at
destroy_workqueue(hdev->workqueue) (lines 2681, 2750) if
the device is unregistered while l2cap_info_timeout is in
flight, giving __WQ_DESTROYING instead of
__WQ_DRAINING.
Fix location: hci_send_cmd() in
net/bluetooth/hci_core.c.
Add an HCI_UP guard at the top of
hci_send_cmd(), matching the pattern already used in
hci_recv_frame() (line 2922) and the
hci_dev_ioctl() handler (line 596). Because
HCI_UP is cleared before
drain_workqueue() in hci_dev_close_sync(),
this guard catches the race precisely.
No resource leaks are introduced: the guard fires before any allocation. No lock imbalance: no locks are held at the guard point. No side effects on callers: all call sites that handle the return value already tolerate negative errno codes.
Proposed fix
(PATCH_BASE: b4e07588e743):
--- a/net/bluetooth/hci_core.c
+++ b/net/bluetooth/hci_core.c
@@ -3092,6 +3092,9 @@ int hci_send_cmd(struct hci_dev *hdev, __u16 opcode, __u32 plen,
BT_DBG("%s opcode 0x%4.4x plen %d", hdev->name, opcode, plen);
+ if (!test_bit(HCI_UP, &hdev->flags))
+ return -ENETDOWN;
+
skb = hci_cmd_sync_alloc(hdev, opcode, plen, param, NULL);
if (!skb) {
bt_dev_err(hdev, "no memory for command");Diff review: - No resource leak: returned before any allocation. - No
lock imbalance: no locks held at this point. - No NULL/uninitialized
dereference: hdev is always non-NULL at this point. - No
error-path coverage gap: hci_cmd_sync_alloc failure is
still handled. - Caller side effects: hci_conn_auth ignores
the return value of hci_send_cmd; returning
-ENETDOWN is safe. All other callers that propagate the
error already handle negative errno.
b4e07588e743 (exact —
PATCH_BASE matched HEAD)git apply --check passed
cleanly (exact, no fuzz)patch-email.txt — LKML-ready mbox patch emailgit-send-email.sh — ready-to-run send scriptqueue_work(hdev->workqueue, &hdev->cmd_work)
at hci_core.c:3111 has been present without an
HCI_UP guard since at least commit c347b765fe70
(“Bluetooth: use module workqueue”, Gustavo Padovan, 2011-12-14), which
introduced queue_work calls against
hdev->workqueue. The hci_send_cmd function
itself dates to the original 2.6 import (1da177e4c3f4).
This is a long-standing gap: the HCI_UP guard was added
to many paths but not to hci_send_cmd. The crash was first
surfaced by syzbot fuzzing L2CAP-over-Bluetooth teardown scenarios.
No upstream fix commit was identified within the search budget that
specifically adds the HCI_UP check to
hci_send_cmd.
| Field | Value | Implication |
|---|---|---|
| INTRODUCED-BY | c347b765fe70 | queue_work on hdev->workqueue added
without HCI_UP guard |
drain_workqueue(hdev->workqueue) in
hci_dev_close_sync().(Detailed factcheck log: see factcheck.md in the same
directory)
Verdict: PASS
All checklist items verified. No serious issues found. The patch
correctly guards hci_send_cmd() against device shutdown by
checking the HCI_UP flag before any allocation or queueing
operation occurs. The guard precedes the first risky operation (line
3099: hci_cmd_sync_alloc()), preventing resource leaks. No
lock imbalance or NULL dereference issues introduced. Error handling is
correct—callers either ignore the return value or already tolerate
negative errno codes; adding -ENETDOWN to the possible
return set breaks no existing contracts. The fix matches the identical
pattern already used in hci_recv_frame() (line 2922) and
hci_dev_ioctl() (line 596). The guard placement is optimal
because HCI_UP is cleared before
drain_workqueue() in hci_dev_close_sync(),
catching the race condition precisely at the vulnerability point.
All checked items verified. Marker normalisation applied to align with reporting standards.