Linux kernel crash report

Syzbot report via email, MSGID 69ed492c.050a0220.e51af.0005.GAE@google.com. Subject: [syzbot] [bluetooth?] WARNING in hci_send_cmd (4). Dashboard: https://syzkaller.appspot.com/bug?extid=00f5a866124dc44cce14

Root cause: hci_send_cmd() lacks an HCI_UP guard, allowing it to queue work onto hdev->workqueue after the workqueue has entered the draining/destroying state during HCI device shutdown.

Key elements

Field Value Implication
CRASH_TYPE WARNING
WARNING_MSG workqueue: cannot queue hci_cmd_work on wq hci0
WARNING_SRC kernel/workqueue.c:2297 WARN_ONCE in __queue_work: wq is being destroyed/drained
UNAME syzkaller #0 PREEMPT(full)
DISTRO syzbot / mainline upstream
COMPILER Debian clang version 21.1.8
HEAD_COMMIT b4e07588e743
MSGID <69ed492c.050a0220.e51af.0005.GAE@google.com>
MSGID_URL 69ed492c.050a0220.e51af.0005.GAE@google.com
BUG_URL https://syzkaller.appspot.com/bug?extid=00f5a866124dc44cce14
SYZBOT_OCCURRENCE 4th occurrence Recurring — not first time syzbot has seen this
INTRODUCED-BY c347b765fe70 queue_work on hdev->workqueue without HCI_UP guard (2011)
PROCESS kworker/0:3 (PID 1378, CPU 0) Kernel worker thread
WQ_CONTEXT Workqueue: events l2cap_info_timeout Running in the events workqueue
HARDWARE QEMU Standard PC (Q35 + ICH9, 2009) syzbot VM
BIOS 1.16.3-debian-1.16.3-2 04/01/2014
TAINT Not tainted No taint flags set
CONFIG_REQUIRED CONFIG_BUG (unconditional when enabled — standard in all builds)
VMLINUX oops-workdir/syzbot/vmlinux-b4e07588
SOURCEDIR oops-workdir/linux checked out to b4e07588e743

Kernel modules

Module Flags Backtrace Location Flag Implication
(module list not available in this report)

Backtrace

Address Function Offset Size Context Module Source location
0xffffffff818d5d2a (0xffffffff818d4fe0 + 0xd4a) __queue_work 0xd4a 0xfc0 Task (built-in) kernel/workqueue.c:2297
0xffffffff818d4f06 (0xffffffff818d4e00 + 0x106) queue_work_on 0x106 0x1d0 Task (built-in) kernel/workqueue.c:2432
queue_work (inlined) Task (built-in) include/linux/workqueue.h:696
0xffffffff8aaa3767 (0xffffffff8aaa36b0 + 0xb7) hci_send_cmd 0xb7 0x1a0 Task (built-in) net/bluetooth/hci_core.c:3111
hci_conn_auth (inlined) Task (built-in) net/bluetooth/hci_conn.c:2459
0xffffffff8aabbac9 (0xffffffff8aabb530 + 0x599) hci_conn_security 0x599 0xa80 Task (built-in) net/bluetooth/hci_conn.c:2551
0xffffffff8ab70f0c (0xffffffff8ab70b50 + 0x3bc) l2cap_conn_start 0x3bc 0xf20 Task (built-in) net/bluetooth/l2cap_core.c:1534
0xffffffff8ab707c8 (0xffffffff8ab70760 + 0x68) l2cap_info_timeout 0x68 0xa0 Task (built-in) net/bluetooth/l2cap_core.c:1685
process_one_work (inlined) Task (built-in) kernel/workqueue.c:3302
0xffffffff818eb2ed (0xffffffff818ea790 + 0xb5d) process_scheduled_works 0xb5d 0x1860 Task (built-in) kernel/workqueue.c:3385
0xffffffff818f3353 (0xffffffff818f2900 + 0xa53) worker_thread 0xa53 0xfc0 Task (built-in) kernel/workqueue.c:3466
0xffffffff8190ae58 (0xffffffff8190aad0 + 0x388) kthread 0x388 0x470 Task (built-in) kernel/kthread.c:436
0xffffffff816bfae4 (0xffffffff816bf5d0 + 0x514) ret_from_fork 0x514 0xb70 Task (built-in) arch/x86/kernel/process.c:158
0xffffffff813370aa (0xffffffff81337090 + 0x1a) ret_from_fork_asm 0x1a 0x30 Task (built-in) arch/x86/entry/entry_64.S:245

CPU Registers

RIP: 0010:__queue_work+0xd4a/0xfc0 kernel/workqueue.c:2296
RSP: 0018:ffffc9000257f720 EFLAGS: 00010082
RAX: 1ffff110081cc181  [KASAN shadow address]
RBX: 0000000000000008  [work flags / small integer]
RCX: ffff888000260000
RDX: ffff888040182170  [→ wq->name (R15), points to "hci0" string after offset add]
RSI: ffffffff8aa9ccd0  [= hci_cmd_work — work->func loaded from *R13]
RDI: ffffffff90368d70  [= __start___bug_table+0xd530 (R14), BUG table entry for WARN]
RBP: 0000000000000020
R08: ffff888040e60bf7
R09: 1ffff110081cc17e  [KASAN shadow address]
R10: dffffc0000000000  [KASAN shadow offset]
R11: ffffed10081cc17f  [KASAN address]
R12: dffffc0000000000  [KASAN shadow offset]
R13: ffff888040e60c08  [→ work struct; *R13 = work->func = hci_cmd_work]
R14: ffffffff90368d70  [= __start___bug_table+0xd530, BUG table entry]
R15: ffff888040182170  [→ wq->name after add $0x170 = "hci0"]
FS:  0000000000000000(0000)  GS:ffff88808c80c000(0000)  knlGS:0000000000000000
CS:  0010  DS: 0000  ES: 0000  CR0: 0000000080050033
CR2: 0000200000005dc0
CR3: 0000000012a31000  CR4: 0000000000352ef0

Code bytes at crash:

Code: 83 c5 18 4c 89 e8 48 c1 e8 03 42 80 3c 20 00 74 08 4c 89 ef e8 17 4d a5 00 49 8b 75 00
      49 81 c7 70 01 00 00 4c 89 f7 4c 89 fa <67> 48 0f b9 3a  ← trapping instruction (ud1)
      48 83 c4 58 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc

Disassembly window from vmlinux around crash site (__queue_work+0xd4a):

ffffffff818d5cec:  mov    0x18(%rsp),%r15
ffffffff818d5cf1:  jmp    ffffffff818d5cf8 <__queue_work+0xd18>
ffffffff818d5cf3:  call   ffffffff81c5e0f0 <__sanitizer_cov_trace_pc>
ffffffff818d5cf8:  lea    0xea93071(%rip),%r14   # ffffffff90368d70 <__start___bug_table+0xd530>
ffffffff818d5cff:  add    $0x18,%r13
ffffffff818d5d03:  mov    %r13,%rax
ffffffff818d5d06:  shr    $0x3,%rax
ffffffff818d5d0a:  cmpb   $0x0,(%rax,%r12,1)    ; KASAN shadow check
ffffffff818d5d0f:  je     ffffffff818d5d19 <__queue_work+0xd39>
ffffffff818d5d11:  mov    %r13,%rdi
ffffffff818d5d14:  call   ffffffff8232aa30 <__asan_report_load8_noabort>
ffffffff818d5d19:  mov    0x0(%r13),%rsi         ; RSI = work->func
ffffffff818d5d1d:  add    $0x170,%r15            ; R15 = wq->name
ffffffff818d5d24:  mov    %r14,%rdi              ; RDI = BUG table entry
ffffffff818d5d27:  mov    %r15,%rdx              ; RDX = wq->name
ffffffff818d5d2a:  call   ffffffff8bbd4e98 <__SCT__WARN_trap>   <<<< crash
ffffffff818d5d2f:  add    $0x58,%rsp
ffffffff818d5d33:  pop    %rbx
ffffffff818d5d34:  pop    %r12
ffffffff818d5d36:  pop    %r13
ffffffff818d5d38:  pop    %r14

Note: Code bytes show ud1 (%edx),%rdi (67 48 0f b9 3a) at the crash address, while vmlinux disasm shows call __SCT__WARN_trap. The ud1 encoding is the WARN trap mechanism; the call target __SCT__WARN_trap contains the ud1 as a static call stub.

Backtrace source code

1. __queue_work — crash site (kernel/workqueue.c:2297)

kernel/workqueue.c @ b4e07588e743

2275  static void __queue_work(int cpu, struct workqueue_struct *wq,
2276                           struct work_struct *work)
2277  {
2278      struct pool_workqueue *pwq;
2279      struct worker_pool *last_pool, *pool;
2280      unsigned int work_flags;
2281      unsigned int req_cpu = cpu;
2282  
2283      /*
2284       * While a work item is PENDING && off queue, a task trying to
2285       * steal the PENDING will busy-loop waiting for it to either get
2286       * queued or lose PENDING.  Grabbing PENDING and queueing should
2287       * happen with IRQ disabled.
2288       */
2289      lockdep_assert_irqs_disabled();
2290  
2291      /*
2292       * For a draining wq, only works from the same workqueue are
2293       * allowed. The __WQ_DESTROYING helps to spot the issue that
2294       * queues a new work item to a wq after destroy_workqueue(wq).
2295       */
2296      if (unlikely(wq->flags & (__WQ_DESTROYING | __WQ_DRAINING) &&
2297  →              WARN_ONCE(!is_chained_work(wq), "workqueue: cannot queue %ps on wq %s\n",
2298                             work->func, wq->name))) {
2299          return;
2300      }

2. queue_work_on (kernel/workqueue.c:2432)

kernel/workqueue.c @ b4e07588e743

2422  bool queue_work_on(int cpu, struct workqueue_struct *wq,
2423                     struct work_struct *work)
2424  {
2425      bool ret = false;
2426      unsigned long irq_flags;
2427  
2428      local_irq_save(irq_flags);
2429  
2430      if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)) &&
2431          !clear_pending_if_disabled(work)) {
2432  →          __queue_work(cpu, wq, work);
2433          ret = true;
2434      }
2435  
2436      local_irq_restore(irq_flags);
2437      return ret;
2438  }

3. hci_send_cmd (net/bluetooth/hci_core.c:3111)

net/bluetooth/hci_core.c @ b4e07588e743

3092  int hci_send_cmd(struct hci_dev *hdev, __u16 opcode, __u32 plen,
3093                   const void *param)
3094  {
3095      struct sk_buff *skb;
3096  
3097      BT_DBG("%s opcode 0x%4.4x plen %d", hdev->name, opcode, plen);
3098  
3099      skb = hci_cmd_sync_alloc(hdev, opcode, plen, param, NULL);
3100      if (!skb) {
3101          bt_dev_err(hdev, "no memory for command");
3102          return -ENOMEM;
3103      }
3104  
3105      /* Stand-alone HCI commands must be flagged as
3106       * single-command requests.
3107       */
3108      bt_cb(skb)->hci.req_flags |= HCI_REQ_START;
3109  
3110      skb_queue_tail(&hdev->cmd_q, skb);
3111  →  queue_work(hdev->workqueue, &hdev->cmd_work);
3112  
3113      return 0;
3114  }

The inlined caller hci_conn_auth (net/bluetooth/hci_conn.c:2459) calls hci_send_cmd with HCI_OP_AUTH_REQUESTED:

net/bluetooth/hci_conn.c @ b4e07588e743

2438  static int hci_conn_auth(struct hci_conn *conn, __u8 sec_level, __u8 auth_type)
2439  {
2440      ...
2455      if (!test_and_set_bit(HCI_CONN_AUTH_PEND, &conn->flags)) {
2456          struct hci_cp_auth_requested cp;
2457  
2458          cp.handle = cpu_to_le16(conn->handle);
2459  →      hci_send_cmd(conn->hdev, HCI_OP_AUTH_REQUESTED,
2460                       sizeof(cp), &cp);
2461          ...
2462      }
2463      return 0;
2464  }

4. hci_conn_security (net/bluetooth/hci_conn.c:2551)

net/bluetooth/hci_conn.c @ b4e07588e743

2487  int hci_conn_security(struct hci_conn *conn, __u8 sec_level, __u8 auth_type,
2488                        bool initiator)
2489  {
2490      ...
2544  auth:
2545      if (test_bit(HCI_CONN_ENCRYPT_PEND, &conn->flags))
2546          return 0;
2547  
2548      if (initiator)
2549          set_bit(HCI_CONN_AUTH_INITIATOR, &conn->flags);
2550  
2551if (!hci_conn_auth(conn, sec_level, auth_type))
2552          return 0;
2553  
2554  encrypt:
2555      if (test_bit(HCI_CONN_ENCRYPT, &conn->flags)) {
2556          if (!conn->enc_key_size)
2557              return 0;
2558          return 1;
2559      }
2560  
2561      hci_conn_encrypt(conn);
2562      return 0;
2563  }

What

hci_send_cmd() at net/bluetooth/hci_core.c:3111 calls queue_work(hdev->workqueue, &hdev->cmd_work) after hdev->workqueue has already entered the draining or destroying state during HCI device shutdown. The workqueue core’s WARN_ONCE in __queue_work() at kernel/workqueue.c:2296–2298 detects this and fires, printing:

workqueue: cannot queue hci_cmd_work on wq hci0

The crashing instruction is a ud1 emitted by the WARN trap mechanism. The function being queued (hci_cmd_work) is confirmed by RSI = 0xffffffff8aa9ccd0 = hci_cmd_work, loaded from the work struct pointer in R13. R15 (+ 0x170 added just before the WARN) points to hdev->name = "hci0", confirmed in RDX at crash time.

hci_send_cmd does not check the HCI_UP flag in hdev->flags before calling queue_work. All other latecomer entry points into the HCI device (e.g. hci_recv_frame) guard with !test_bit(HCI_UP, &hdev->flags); this function is missing that guard.


How

Q1: How does queue_work(hdev->workqueue, …) get called on a draining/destroying workqueue?

A1: hci_send_cmd() calls queue_work(hdev->workqueue, &hdev->cmd_work) unconditionally, without checking whether the device is still up. No caller in the l2cap_info_timeout → hci_conn_security → hci_conn_auth → hci_send_cmd chain checks this either.

Q2: How does hdev->workqueue end up draining while hci_send_cmd can still be reached?

A2: Race condition in hci_dev_close_sync() (net/bluetooth/hci_sync.c):

  1. Line 5322test_and_clear_bit(HCI_UP, &hdev->flags) clears HCI_UP.
  2. Line 5353drain_workqueue(hdev->workqueue) is called, which sets __WQ_DRAINING on hdev->workqueue.
  3. Line 5368hci_conn_hash_flush(hdev)l2cap_conn_del()disable_delayed_work_sync(&conn->info_timer) — this would cancel l2cap_info_timeout, but it runs after step 2.

l2cap_info_timeout runs on the events workqueue, which is entirely separate from hdev->workqueue. Draining hdev->workqueue does not affect the events workqueue. If l2cap_info_timeout was already scheduled on the events workqueue before step 3 cancels it, the callback fires in the window between steps 2 and 3:

The same race could also occur at destroy_workqueue(hdev->workqueue) (lines 2681, 2750) if the device is unregistered while l2cap_info_timeout is in flight, giving __WQ_DESTROYING instead of __WQ_DRAINING.


Where

Fix location: hci_send_cmd() in net/bluetooth/hci_core.c.

Add an HCI_UP guard at the top of hci_send_cmd(), matching the pattern already used in hci_recv_frame() (line 2922) and the hci_dev_ioctl() handler (line 596). Because HCI_UP is cleared before drain_workqueue() in hci_dev_close_sync(), this guard catches the race precisely.

No resource leaks are introduced: the guard fires before any allocation. No lock imbalance: no locks are held at the guard point. No side effects on callers: all call sites that handle the return value already tolerate negative errno codes.

Proposed fix (PATCH_BASE: b4e07588e743):

--- a/net/bluetooth/hci_core.c
+++ b/net/bluetooth/hci_core.c
@@ -3092,6 +3092,9 @@ int hci_send_cmd(struct hci_dev *hdev, __u16 opcode, __u32 plen,

    BT_DBG("%s opcode 0x%4.4x plen %d", hdev->name, opcode, plen);

+   if (!test_bit(HCI_UP, &hdev->flags))
+       return -ENETDOWN;
+
    skb = hci_cmd_sync_alloc(hdev, opcode, plen, param, NULL);
    if (!skb) {
        bt_dev_err(hdev, "no memory for command");

Diff review: - No resource leak: returned before any allocation. - No lock imbalance: no locks held at this point. - No NULL/uninitialized dereference: hdev is always non-NULL at this point. - No error-path coverage gap: hci_cmd_sync_alloc failure is still handled. - Caller side effects: hci_conn_auth ignores the return value of hci_send_cmd; returning -ENETDOWN is safe. All other callers that propagate the error already handle negative errno.

Patch


Bug Introduction

queue_work(hdev->workqueue, &hdev->cmd_work) at hci_core.c:3111 has been present without an HCI_UP guard since at least commit c347b765fe70 (“Bluetooth: use module workqueue”, Gustavo Padovan, 2011-12-14), which introduced queue_work calls against hdev->workqueue. The hci_send_cmd function itself dates to the original 2.6 import (1da177e4c3f4).

This is a long-standing gap: the HCI_UP guard was added to many paths but not to hci_send_cmd. The crash was first surfaced by syzbot fuzzing L2CAP-over-Bluetooth teardown scenarios.

No upstream fix commit was identified within the search budget that specifically adds the HCI_UP check to hci_send_cmd.

Field Value Implication
INTRODUCED-BY c347b765fe70 queue_work on hdev->workqueue added without HCI_UP guard

Fact Check

(Detailed factcheck log: see factcheck.md in the same directory)


Patch Review

Verdict: PASS

All checklist items verified. No serious issues found. The patch correctly guards hci_send_cmd() against device shutdown by checking the HCI_UP flag before any allocation or queueing operation occurs. The guard precedes the first risky operation (line 3099: hci_cmd_sync_alloc()), preventing resource leaks. No lock imbalance or NULL dereference issues introduced. Error handling is correct—callers either ignore the return value or already tolerate negative errno codes; adding -ENETDOWN to the possible return set breaks no existing contracts. The fix matches the identical pattern already used in hci_recv_frame() (line 2922) and hci_dev_ioctl() (line 596). The guard placement is optimal because HCI_UP is cleared before drain_workqueue() in hci_dev_close_sync(), catching the race condition precisely at the vulnerability point.

Fact Check

All checked items verified. Marker normalisation applied to align with reporting standards.