# Linux kernel crash report Syzbot report: BUG: sleeping function called from invalid context in `shmem_undo_range`, filed against the upstream kernel at HEAD commit `2a4c0c11c019`. Original report: [69eab803.a00a0220.17a17.004b.GAE@google.com](https://lore.kernel.org/all/69eab803.a00a0220.17a17.004b.GAE@google.com/) ## Key elements | Field | Value | Implication | | ----- | ----- | ----------- | | CRASH_TYPE | `BUG: sleeping function called from invalid context` | `cond_resched()` called while RCU read-side critical section is active | | UNAME | `syzkaller #0 PREEMPT(full)` | Full-preemption kernel; `rcu_read_lock()` does NOT disable preemption but increments `current->rcu_read_lock_nesting` | | HEAD_COMMIT | `2a4c0c11c019` | Merge tag 's390-7.1-1' — upstream 7.1-rc1 merge window | | GIT_TREE | `upstream` | | | PROCESS | `rm` | Userspace `unlink(2)` syscall on a tmpfs/shmem file | | PID | `5904` | | | HARDWARE | `QEMU Standard PC (Q35 + ICH9, 2009)` | | | BIOS | `1.16.3-debian-1.16.3-2 04/01/2014` | | | COMPILER | `gcc (Debian 14.2.0-19) 14.2.0` | | | TAINT | `W` (first BUG — Not tainted; subsequent WARNINGs — G W) | W = previous warning; G = only GPL modules | | VMLINUX | `oops-workdir/vmlinux-2a4c0c11` | Downloaded from syzbot assets | | SOURCEDIR | `oops-workdir/linux` | Local tree at `6596a02b20788` (HEAD, ~2 days after oops commit) | | MSGID | `<69eab803.a00a0220.17a17.004b.GAE@google.com>` | | | MSGID_URL | [69eab803.a00a0220.17a17.004b.GAE@google.com](https://lore.kernel.org/all/69eab803.a00a0220.17a17.004b.GAE@google.com/) | | ## Kernel modules | Module | Flags | Backtrace | Location | Flag Implication | | ------ | ----- | --------- | -------- | ---------------- | | *(none — no modules linked in)* | | | | | ## Backtrace The report contains **three related crash events** from the same process (rm/5904). Only the **first** — the primary BUG — is analysed per fundamentals.md. The two subsequent WARNINGs are consequences of the same root cause and are noted separately. ### Primary crash: BUG: sleeping function called from invalid context | Address | Function | Offset | Size | Context | Module | Source location | | ------- | -------- | ------ | ---- | ------- | ------ | --------------- | | | `__dump_stack` (inlined) | | | Task | | `lib/dump_stack.c:94` | | `0xffffffff8121ab04 (0xffffffff8121ab04 + 0x0)` | `dump_stack_lvl` | `0x100` | `0x190` | Task | | `lib/dump_stack.c:120` | | `0xffffffff8121acd0 (0xffffffff8121aae4 + 0x1ec)` | `__might_resched.cold` | `0x1ec` | `0x232` | Task | | [kernel/sched/core.c:9163](#3-__might_resched-assert-site) | | `0xffffffff8250f297 (0xffffffff8250ee50 + 0x447)` | `shmem_undo_range` | `0x447` | `0x1570` | Task | | [mm/shmem.c:1150](#1-shmem_undo_range--crash-site-shmemcl1150) | | | `shmem_truncate_range` (inlined) | | | Task | | `mm/shmem.c:1277` | | `0xffffffff825108a3 (0xffffffff825104b0 + 0x3f3)` | `shmem_evict_inode` | `0x3f3` | `0xc40` | Task | | [mm/shmem.c:1407](#2-shmem_evict_inode--callsite-shmemcl1407) | | `0xffffffff8290f9e2 (0xffffffff8290f620 + 0x3c2)` | `evict` | `0x3c2` | `0xad0` | Task | | `fs/inode.c:841` | | | `iput_final` (inlined) | | | Task | | `fs/inode.c:1960` | | `0xffffffff829113f5 (0xffffffff82910df0 + 0x605)` | `iput.part.0` | `0x605` | `0xf50` | Task | | `fs/inode.c:2009` | | `0xffffffff82911d85 (0xffffffff82911d50 + 0x35)` | `iput` | `0x35` | `0x40` | Task | | `fs/inode.c:1975` | | `0xffffffff828d9216 (0xffffffff828d8db0 + 0x466)` | `filename_unlinkat` | `0x466` | `0x730` | Task | | [fs/namei.c:5572](#4-filename_unlinkat--caller-namei-c5572) | | | `__do_sys_unlink` (inlined) | | | Task | | `fs/namei.c:5603` | | | `__se_sys_unlink` (inlined) | | | Task | | `fs/namei.c:5600` | | `0xffffffff828d97b6 (0xffffffff828d9770 + 0x46)` | `__x64_sys_unlink` | `0x46` | `0x70` | Task | | `fs/namei.c:5600` | | | `do_syscall_x64` (inlined) | | | Task | | `arch/x86/entry/syscall_64.c:63` | | `0xffffffff8b97718b (0xffffffff8b977080 + 0x10b)` | `do_syscall_64` | `0x10b` | `0xf80` | Task | | `arch/x86/entry/syscall_64.c:94` | | | `entry_SYSCALL_64_after_hwframe` | `0x77` | `0x7f` | Task | | `arch/x86/entry/entry_64.S` | ### Subsequent WARNING: lock held when returning to user space After the BUG fires, the RCU read lock is still held when `rm` returns to userspace. This is a distinct lockdep WARNING recorded in the report. ### Subsequent WARNING: Voluntary context switch within RCU read-side critical section Triggered by an APIC timer interrupt firing and the scheduler detecting a context switch while the RCU read lock is still held. RIP: `rcu_note_context_switch+0x859/0x19c0 kernel/rcu/tree_plugin.h:332` (fires `ud1` — the `BUG()` macro). --- ## Locks held At the time of the primary BUG crash, two locks were held by `rm/5904`: | # | Lock name | Flags | Acquired in function | Offset | Source | In backtrace? | |---|-----------|-------|---------------------|--------|--------|---------------| | 0 | `sb_writers#5` | `{.+.+}-{0:0}` | `filename_unlinkat` | `0x1ad/0x730` | `fs/namei.c:5545` | Yes | | 1 | `rcu_read_lock` | `{....}-{1:3}` | `rcu_lock_acquire.constprop.0` | `0x7/0x30` | `include/linux/rcupdate.h:300` | **No** | **Lock #0** (`sb_writers#5`): acquired at `filename_unlinkat+0x1ad` which resolves to `fs/namei.c:5545` — the `mnt_want_write(path.mnt)` call. This is the filesystem superblock write lock protecting the mount point. It is in the backtrace. **Lock #1** (`rcu_read_lock`): the acquiring function `rcu_lock_acquire.constprop.0` is **not in the backtrace**. Per lockdep analysis rule, the acquisition site is recorded but not chased further. Crucially, this lock was acquired *after* lock #0 (the list is in acquisition order), meaning it was taken somewhere between `mnt_want_write` at line 5545 and the `iput` call at line 5572 in `filename_unlinkat`. It is never released. --- ## Backtrace source code ### 1. `shmem_undo_range` — crash site (`shmem.c:1150`) [`mm/shmem.c` at commit 2a4c0c11c019](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/shmem.c?id=2a4c0c11c019#n1108) ```c 1104 /* 1105 * Remove range of pages and swap entries from page cache, and free them. 1106 * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. 1107 */ 1108 static void shmem_undo_range(struct inode *inode, loff_t lstart, uoff_t lend, 1109 bool unfalloc) 1110 { 1111 struct address_space *mapping = inode->i_mapping; 1112 struct shmem_inode_info *info = SHMEM_I(inode); 1113 pgoff_t start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT; 1114 pgoff_t end = (lend + 1) >> PAGE_SHIFT; 1115 struct folio_batch fbatch; 1116 pgoff_t indices[FOLIO_BATCH_SIZE]; 1117 struct folio *folio; 1118 bool same_folio; 1119 long nr_swaps_freed = 0; 1120 pgoff_t index; 1121 int i; 1122 1123 if (lend == -1) 1124 end = -1; /* unsigned, so actually very big */ 1125 1126 if (info->fallocend > start && info->fallocend <= end && !unfalloc) 1127 info->fallocend = start; 1128 1129 folio_batch_init(&fbatch); 1130 index = start; 1131 while (index < end && find_lock_entries(mapping, &index, end - 1, 1132 &fbatch, indices)) { 1133 for (i = 0; i < folio_batch_count(&fbatch); i++) { 1134 folio = fbatch.folios[i]; 1135 1136 if (xa_is_value(folio)) { 1137 if (unfalloc) 1138 continue; 1139 nr_swaps_freed += shmem_free_swap(mapping, indices[i], 1140 end - 1, folio); 1141 continue; 1142 } 1143 1144 if (!unfalloc || !folio_test_uptodate(folio)) 1145 truncate_inode_folio(mapping, folio); 1146 folio_unlock(folio); 1147 } 1148 folio_batch_remove_exceptionals(&fbatch); 1149 folio_batch_release(&fbatch); 1150 cond_resched(); // <-- BUG: RCU nest depth 1, expected 0 1151 } ``` ### 2. `shmem_evict_inode` — callsite (`shmem.c:1407`) [`mm/shmem.c` at commit 2a4c0c11c019](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/shmem.c?id=2a4c0c11c019#n1397) ```c 1397 static void shmem_evict_inode(struct inode *inode) 1398 { 1399 struct shmem_inode_info *info = SHMEM_I(inode); 1400 struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); 1401 size_t freed = 0; 1402 1403 if (shmem_mapping(inode->i_mapping)) { 1404 shmem_unacct_size(info->flags, inode->i_size); 1405 inode->i_size = 0; 1406 mapping_set_exiting(inode->i_mapping); 1407 shmem_truncate_range(inode, 0, (loff_t)-1); // <-- called here ... ``` ### 3. `__might_resched` — assert site [`kernel/sched/core.c` near line 9163](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/core.c?id=2a4c0c11c019) `__might_resched()` is an assert/precondition function that checks whether a reschedule point is legal given the current lock state. On a `PREEMPT(full)` kernel the RCU nesting depth is separately tracked in `current->rcu_read_lock_nesting`. When that depth is non-zero, calling `cond_resched()` (which internally calls `__might_resched()`) triggers the BUG message with: ``` in_atomic(): 0, irqs_disabled(): 0, non_block: 0 RCU nest depth: 1, expected: 0 ``` ### 4. `filename_unlinkat` — caller (`namei.c:5572`) [`fs/namei.c` at commit 2a4c0c11c019](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c?id=2a4c0c11c019#n5526) ```c 5526 int filename_unlinkat(int dfd, struct filename *name) 5527 { ... 5536 retry: 5537 error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type); 5538 if (error) 5539 return error; ... 5545 error = mnt_want_write(path.mnt); // <-- lock #0 (sb_writers) acquired here 5546 if (error) 5547 goto exit_path_put; 5548 retry_deleg: 5549 dentry = start_dirop(path.dentry, &last, lookup_flags); ... 5563 inode = dentry->d_inode; 5564 ihold(inode); 5565 error = security_path_unlink(&path, dentry); 5566 if (error) 5567 goto exit_end_dirop; 5568 error = vfs_unlink(mnt_idmap(path.mnt), path.dentry->d_inode, 5569 dentry, &delegated_inode); // <-- lock #1 (rcu_read_lock) acquired in here? 5570 exit_end_dirop: 5571 end_dirop(dentry); 5572 iput(inode); // <-- truncate the inode here → shmem_evict_inode → crash ... 5579 mnt_drop_write(path.mnt); // <-- lock #0 (sb_writers) released ... 5587 } ``` --- ## What-how-where analysis ### What At `mm/shmem.c:1150`, the function `shmem_undo_range()` calls `cond_resched()` inside a while loop that iterates over the page cache entries of the inode being evicted. On a `PREEMPT(full)` kernel, `cond_resched()` calls `__cond_resched()` → `__might_resched()`. That function checks `current->rcu_read_lock_nesting`, which tracks active RCU read-side critical sections. At this point the nesting depth is **1** (one active `rcu_read_lock()` that was never matched by an `rcu_read_unlock()`). This is the "sleeping function called from invalid context" check firing: ``` RCU nest depth: 1, expected: 0 ``` The kernel logs the BUG message, adds taint flag `W`, and continues execution — the system does **not** panic. ### How The call chain starts with the `unlink(2)` system call on a tmpfs/shmem file: ``` sys_unlink → filename_unlinkat → iput → iput_final → evict → shmem_evict_inode → shmem_truncate_range → shmem_undo_range → cond_resched() ← BUG fires here ``` The lockdep state at crash time shows two locks held (in acquisition order): 1. `sb_writers#5` — acquired at `filename_unlinkat+0x1ad` (line 5545, `mnt_want_write(path.mnt)`) — this is the normal filesystem write lock. 2. `rcu_read_lock` — acquired at `rcu_lock_acquire.constprop.0` (not present in the backtrace). Because lockdep records locks in acquisition order, this lock was taken **after** `mnt_want_write` and **before** `iput` — i.e., somewhere between `filename_unlinkat:5545` and `filename_unlinkat:5572`. The candidates in that window are: `start_dirop()`, `security_path_unlink()`, `vfs_unlink()` (and its callees: `fsnotify_unlink()`, `d_delete_notify()`). The `rcu_read_lock()` that was acquired in this window was never matched by an `rcu_read_unlock()`. It persists through the entire eviction path and is still held when `rm` returns to userspace — confirmed by the subsequent "lock held when returning to user space!" warning. A third warning, "Voluntary context switch within RCU read-side critical section!", fires from an APIC timer interrupt when the scheduler tries to switch away from the process while the RCU lock is still held. This is a **How — Negative How**: there is a missing `rcu_read_unlock()` somewhere in the `filename_unlinkat` call chain (between lines 5545 and 5572). The exact acquisition site is not directly visible in the backtrace; per the lockdep analysis rules the acquisition function (`rcu_lock_acquire.constprop.0`) is not present and is not chased further. ### Where The fix needs to identify and add the missing `rcu_read_unlock()` in whichever function between `mnt_want_write` and `iput` acquires the RCU lock without releasing it. The prime suspects (all called from `filename_unlinkat` between lines 5549–5571 on the success path) are: - `vfs_unlink()` (fs/namei.c:5472) — calls `fsnotify_unlink()` and `d_delete_notify()` on the success path; either could internally acquire an RCU lock and fail to release it on some code path. - `start_dirop()` → `lookup_one_qstr_excl()` — does dentry lookup which may enter RCU walk mode. - `security_path_unlink()` — LSM hooks are a less likely source but should not be excluded. Recommended debugging approach: add a `lockdep_assert_not_held(&rcu_lock_map)` or `WARN_ON(rcu_read_lock_held())` in `iput()` (in addition to the existing `might_sleep()` check) to narrow down the acquisition site. Alternatively, use dynamic lockdep tracing (`CONFIG_PROVE_RCU`) and review the acquisition stack recorded by lockdep. Until the root cause is found, a defensive mitigation (not a fix) would be to add a `WARN_ON(rcu_read_lock_held())` near the top of `shmem_evict_inode()` to produce an earlier, more informative trace on future occurrences. --- ## Bug introduction The exact acquisition site of the leaked `rcu_read_lock()` is not visible in the backtrace (the acquiring function is not in the call trace). Because the How analysis could not identify a specific root-cause line, the git blame step cannot be targeted. As a result: **Bug introduction commit not identified — the RCU lock acquisition is not visible in the backtrace, preventing targeted git blame analysis of the acquisition site.** Upstream fix search (git log across `mm/shmem.c`, `fs/namei.c`, `mm/filemap.c`, `fs/inode.c`) found no commit within the search budget that appears to address this specific RCU lock imbalance in the unlink/eviction path. --- ## Analysis, conclusions and recommendations An `rcu_read_lock()` is acquired somewhere in the `filename_unlinkat()` call chain (between the `mnt_want_write()` and `iput()` calls) without a matching `rcu_read_unlock()`. This leaked lock is carried through the entire tmpfs inode eviction path until `shmem_undo_range()` calls `cond_resched()`, which on a `PREEMPT(full)` kernel checks the RCU nesting depth and triggers: > BUG: sleeping function called from invalid context at mm/shmem.c:1150 > RCU nest depth: 1, expected: 0 The same leaked lock is still held on return to userspace ("lock held when returning to user space!") and causes a "Voluntary context switch within RCU read-side critical section!" warning from a timer interrupt. All three messages are symptoms of a single leaked `rcu_read_lock()` call. **Immediate action**: The syzbot maintainers (akpm, baolin.wang, hughd) should investigate recent changes to `vfs_unlink()` and its fsnotify/dcache callees for an `rcu_read_lock()` without a matching `rcu_read_unlock()`. A `WARN_ON(rcu_read_lock_held())` check added to `iput()` (alongside the existing `might_sleep()`) would produce a more actionable backtrace on reproduction.