# Linux kernel crash report

Syzbot report: BUG: sleeping function called from invalid context in
`shmem_undo_range`, filed against the upstream kernel at HEAD commit
`2a4c0c11c019`.

Original report: [69eab803.a00a0220.17a17.004b.GAE@google.com](https://lore.kernel.org/all/69eab803.a00a0220.17a17.004b.GAE@google.com/)

## Key elements

| Field | Value | Implication |
| ----- | ----- | ----------- |
| CRASH_TYPE | `BUG: sleeping function called from invalid context` | `cond_resched()` called while RCU read-side critical section is active |
| UNAME | `syzkaller #0 PREEMPT(full)` | Full-preemption kernel; `rcu_read_lock()` does NOT disable preemption but increments `current->rcu_read_lock_nesting` |
| HEAD_COMMIT | `2a4c0c11c019` | Merge tag 's390-7.1-1' — upstream 7.1-rc1 merge window |
| GIT_TREE | `upstream` | |
| PROCESS | `rm` | Userspace `unlink(2)` syscall on a tmpfs/shmem file |
| PID | `5904` | |
| HARDWARE | `QEMU Standard PC (Q35 + ICH9, 2009)` | |
| BIOS | `1.16.3-debian-1.16.3-2 04/01/2014` | |
| COMPILER | `gcc (Debian 14.2.0-19) 14.2.0` | |
| TAINT | `W` (first BUG — Not tainted; subsequent WARNINGs — G W) | W = previous warning; G = only GPL modules |
| VMLINUX | `oops-workdir/vmlinux-2a4c0c11` | Downloaded from syzbot assets |
| SOURCEDIR | `oops-workdir/linux` | Local tree at `6596a02b20788` (HEAD, ~2 days after oops commit) |
| MSGID | `<69eab803.a00a0220.17a17.004b.GAE@google.com>` | |
| MSGID_URL | [69eab803.a00a0220.17a17.004b.GAE@google.com](https://lore.kernel.org/all/69eab803.a00a0220.17a17.004b.GAE@google.com/) | |


## Kernel modules

| Module | Flags | Backtrace | Location | Flag Implication |
| ------ | ----- | --------- | -------- | ---------------- |
| *(none — no modules linked in)* | | | | |


## Backtrace

The report contains **three related crash events** from the same process (rm/5904).
Only the **first** — the primary BUG — is analysed per fundamentals.md. The two
subsequent WARNINGs are consequences of the same root cause and are noted separately.

### Primary crash: BUG: sleeping function called from invalid context

| Address | Function | Offset | Size | Context | Module | Source location |
| ------- | -------- | ------ | ---- | ------- | ------ | --------------- |
| | `__dump_stack` (inlined) | | | Task | | `lib/dump_stack.c:94` |
| `0xffffffff8121ab04 (0xffffffff8121ab04 + 0x0)` | `dump_stack_lvl` | `0x100` | `0x190` | Task | | `lib/dump_stack.c:120` |
| `0xffffffff8121acd0 (0xffffffff8121aae4 + 0x1ec)` | `__might_resched.cold` | `0x1ec` | `0x232` | Task | | [kernel/sched/core.c:9163](#3-__might_resched-assert-site) |
| `0xffffffff8250f297 (0xffffffff8250ee50 + 0x447)` | `shmem_undo_range` | `0x447` | `0x1570` | Task | | [mm/shmem.c:1150](#1-shmem_undo_range--crash-site-shmemcl1150) |
| | `shmem_truncate_range` (inlined) | | | Task | | `mm/shmem.c:1277` |
| `0xffffffff825108a3 (0xffffffff825104b0 + 0x3f3)` | `shmem_evict_inode` | `0x3f3` | `0xc40` | Task | | [mm/shmem.c:1407](#2-shmem_evict_inode--callsite-shmemcl1407) |
| `0xffffffff8290f9e2 (0xffffffff8290f620 + 0x3c2)` | `evict` | `0x3c2` | `0xad0` | Task | | `fs/inode.c:841` |
| | `iput_final` (inlined) | | | Task | | `fs/inode.c:1960` |
| `0xffffffff829113f5 (0xffffffff82910df0 + 0x605)` | `iput.part.0` | `0x605` | `0xf50` | Task | | `fs/inode.c:2009` |
| `0xffffffff82911d85 (0xffffffff82911d50 + 0x35)` | `iput` | `0x35` | `0x40` | Task | | `fs/inode.c:1975` |
| `0xffffffff828d9216 (0xffffffff828d8db0 + 0x466)` | `filename_unlinkat` | `0x466` | `0x730` | Task | | [fs/namei.c:5572](#4-filename_unlinkat--caller-namei-c5572) |
| | `__do_sys_unlink` (inlined) | | | Task | | `fs/namei.c:5603` |
| | `__se_sys_unlink` (inlined) | | | Task | | `fs/namei.c:5600` |
| `0xffffffff828d97b6 (0xffffffff828d9770 + 0x46)` | `__x64_sys_unlink` | `0x46` | `0x70` | Task | | `fs/namei.c:5600` |
| | `do_syscall_x64` (inlined) | | | Task | | `arch/x86/entry/syscall_64.c:63` |
| `0xffffffff8b97718b (0xffffffff8b977080 + 0x10b)` | `do_syscall_64` | `0x10b` | `0xf80` | Task | | `arch/x86/entry/syscall_64.c:94` |
| | `entry_SYSCALL_64_after_hwframe` | `0x77` | `0x7f` | Task | | `arch/x86/entry/entry_64.S` |

### Subsequent WARNING: lock held when returning to user space

After the BUG fires, the RCU read lock is still held when `rm` returns to
userspace. This is a distinct lockdep WARNING recorded in the report.

### Subsequent WARNING: Voluntary context switch within RCU read-side critical section

Triggered by an APIC timer interrupt firing and the scheduler detecting a
context switch while the RCU read lock is still held. RIP:
`rcu_note_context_switch+0x859/0x19c0 kernel/rcu/tree_plugin.h:332` (fires
`ud1` — the `BUG()` macro).

---

## Locks held

At the time of the primary BUG crash, two locks were held by `rm/5904`:

| # | Lock name | Flags | Acquired in function | Offset | Source | In backtrace? |
|---|-----------|-------|---------------------|--------|--------|---------------|
| 0 | `sb_writers#5` | `{.+.+}-{0:0}` | `filename_unlinkat` | `0x1ad/0x730` | `fs/namei.c:5545` | Yes |
| 1 | `rcu_read_lock` | `{....}-{1:3}` | `rcu_lock_acquire.constprop.0` | `0x7/0x30` | `include/linux/rcupdate.h:300` | **No** |

**Lock #0** (`sb_writers#5`): acquired at `filename_unlinkat+0x1ad` which resolves
to `fs/namei.c:5545` — the `mnt_want_write(path.mnt)` call. This is the
filesystem superblock write lock protecting the mount point. It is in the
backtrace.

**Lock #1** (`rcu_read_lock`): the acquiring function `rcu_lock_acquire.constprop.0`
is **not in the backtrace**. Per lockdep analysis rule, the acquisition site is
recorded but not chased further. Crucially, this lock was acquired *after* lock #0
(the list is in acquisition order), meaning it was taken somewhere between
`mnt_want_write` at line 5545 and the `iput` call at line 5572 in
`filename_unlinkat`. It is never released.

---

## Backtrace source code

### 1. `shmem_undo_range` — crash site (`shmem.c:1150`)

[`mm/shmem.c` at commit 2a4c0c11c019](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/shmem.c?id=2a4c0c11c019#n1108)

```c
1104 /*
1105  * Remove range of pages and swap entries from page cache, and free them.
1106  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
1107  */
1108 static void shmem_undo_range(struct inode *inode, loff_t lstart, uoff_t lend,
1109                                                          bool unfalloc)
1110 {
1111     struct address_space *mapping = inode->i_mapping;
1112     struct shmem_inode_info *info = SHMEM_I(inode);
1113     pgoff_t start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
1114     pgoff_t end = (lend + 1) >> PAGE_SHIFT;
1115     struct folio_batch fbatch;
1116     pgoff_t indices[FOLIO_BATCH_SIZE];
1117     struct folio *folio;
1118     bool same_folio;
1119     long nr_swaps_freed = 0;
1120     pgoff_t index;
1121     int i;
1122
1123     if (lend == -1)
1124         end = -1;   /* unsigned, so actually very big */
1125
1126     if (info->fallocend > start && info->fallocend <= end && !unfalloc)
1127         info->fallocend = start;
1128
1129     folio_batch_init(&fbatch);
1130     index = start;
1131     while (index < end && find_lock_entries(mapping, &index, end - 1,
1132             &fbatch, indices)) {
1133         for (i = 0; i < folio_batch_count(&fbatch); i++) {
1134             folio = fbatch.folios[i];
1135
1136             if (xa_is_value(folio)) {
1137                 if (unfalloc)
1138                     continue;
1139                 nr_swaps_freed += shmem_free_swap(mapping, indices[i],
1140                                                   end - 1, folio);
1141                 continue;
1142             }
1143
1144             if (!unfalloc || !folio_test_uptodate(folio))
1145                 truncate_inode_folio(mapping, folio);
1146             folio_unlock(folio);
1147         }
1148         folio_batch_remove_exceptionals(&fbatch);
1149         folio_batch_release(&fbatch);
1150         cond_resched();   // <-- BUG: RCU nest depth 1, expected 0
1151     }
```

### 2. `shmem_evict_inode` — callsite (`shmem.c:1407`)

[`mm/shmem.c` at commit 2a4c0c11c019](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/shmem.c?id=2a4c0c11c019#n1397)

```c
1397 static void shmem_evict_inode(struct inode *inode)
1398 {
1399     struct shmem_inode_info *info = SHMEM_I(inode);
1400     struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
1401     size_t freed = 0;
1402
1403     if (shmem_mapping(inode->i_mapping)) {
1404         shmem_unacct_size(info->flags, inode->i_size);
1405         inode->i_size = 0;
1406         mapping_set_exiting(inode->i_mapping);
1407         shmem_truncate_range(inode, 0, (loff_t)-1);   // <-- called here
...
```

### 3. `__might_resched` — assert site

[`kernel/sched/core.c` near line 9163](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/core.c?id=2a4c0c11c019)

`__might_resched()` is an assert/precondition function that checks whether
a reschedule point is legal given the current lock state. On a `PREEMPT(full)`
kernel the RCU nesting depth is separately tracked in
`current->rcu_read_lock_nesting`. When that depth is non-zero, calling
`cond_resched()` (which internally calls `__might_resched()`) triggers the
BUG message with:

```
in_atomic(): 0, irqs_disabled(): 0, non_block: 0
RCU nest depth: 1, expected: 0
```

### 4. `filename_unlinkat` — caller (`namei.c:5572`)

[`fs/namei.c` at commit 2a4c0c11c019](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c?id=2a4c0c11c019#n5526)

```c
5526 int filename_unlinkat(int dfd, struct filename *name)
5527 {
...
5536 retry:
5537     error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
5538     if (error)
5539         return error;
...
5545     error = mnt_want_write(path.mnt);   // <-- lock #0 (sb_writers) acquired here
5546     if (error)
5547         goto exit_path_put;
5548 retry_deleg:
5549     dentry = start_dirop(path.dentry, &last, lookup_flags);
...
5563     inode = dentry->d_inode;
5564     ihold(inode);
5565     error = security_path_unlink(&path, dentry);
5566     if (error)
5567         goto exit_end_dirop;
5568     error = vfs_unlink(mnt_idmap(path.mnt), path.dentry->d_inode,
5569                        dentry, &delegated_inode);   // <-- lock #1 (rcu_read_lock) acquired in here?
5570 exit_end_dirop:
5571     end_dirop(dentry);
5572     iput(inode);   // <-- truncate the inode here → shmem_evict_inode → crash
...
5579     mnt_drop_write(path.mnt);   // <-- lock #0 (sb_writers) released
...
5587 }
```

---

## What-how-where analysis

### What

At `mm/shmem.c:1150`, the function `shmem_undo_range()` calls `cond_resched()`
inside a while loop that iterates over the page cache entries of the inode being
evicted. On a `PREEMPT(full)` kernel, `cond_resched()` calls
`__cond_resched()` → `__might_resched()`. That function checks
`current->rcu_read_lock_nesting`, which tracks active RCU read-side critical
sections. At this point the nesting depth is **1** (one active `rcu_read_lock()`
that was never matched by an `rcu_read_unlock()`). This is the
"sleeping function called from invalid context" check firing:

```
RCU nest depth: 1, expected: 0
```

The kernel logs the BUG message, adds taint flag `W`, and continues
execution — the system does **not** panic.

### How

The call chain starts with the `unlink(2)` system call on a tmpfs/shmem
file:

```
sys_unlink → filename_unlinkat → iput → iput_final → evict
          → shmem_evict_inode → shmem_truncate_range → shmem_undo_range
          → cond_resched()   ← BUG fires here
```

The lockdep state at crash time shows two locks held (in acquisition order):

1. `sb_writers#5` — acquired at `filename_unlinkat+0x1ad` (line 5545,
   `mnt_want_write(path.mnt)`) — this is the normal filesystem write lock.
2. `rcu_read_lock` — acquired at `rcu_lock_acquire.constprop.0`
   (not present in the backtrace). Because lockdep records locks in
   acquisition order, this lock was taken **after** `mnt_want_write` and
   **before** `iput` — i.e., somewhere between `filename_unlinkat:5545`
   and `filename_unlinkat:5572`. The candidates in that window are:
   `start_dirop()`, `security_path_unlink()`, `vfs_unlink()` (and its
   callees: `fsnotify_unlink()`, `d_delete_notify()`).

The `rcu_read_lock()` that was acquired in this window was never matched
by an `rcu_read_unlock()`. It persists through the entire eviction path
and is still held when `rm` returns to userspace — confirmed by the
subsequent "lock held when returning to user space!" warning. A third
warning, "Voluntary context switch within RCU read-side critical section!",
fires from an APIC timer interrupt when the scheduler tries to switch away
from the process while the RCU lock is still held.

This is a **How — Negative How**: there is a missing `rcu_read_unlock()`
somewhere in the `filename_unlinkat` call chain (between lines 5545 and
5572). The exact acquisition site is not directly visible in the backtrace;
per the lockdep analysis rules the acquisition function
(`rcu_lock_acquire.constprop.0`) is not present and is not chased further.

### Where

The fix needs to identify and add the missing `rcu_read_unlock()` in
whichever function between `mnt_want_write` and `iput` acquires the RCU
lock without releasing it. The prime suspects (all called from
`filename_unlinkat` between lines 5549–5571 on the success path) are:

- `vfs_unlink()` (fs/namei.c:5472) — calls `fsnotify_unlink()` and
  `d_delete_notify()` on the success path; either could internally acquire
  an RCU lock and fail to release it on some code path.
- `start_dirop()` → `lookup_one_qstr_excl()` — does dentry lookup which
  may enter RCU walk mode.
- `security_path_unlink()` — LSM hooks are a less likely source but should
  not be excluded.

Recommended debugging approach: add a `lockdep_assert_not_held(&rcu_lock_map)`
or `WARN_ON(rcu_read_lock_held())` in `iput()` (in addition to the existing
`might_sleep()` check) to narrow down the acquisition site. Alternatively,
use dynamic lockdep tracing (`CONFIG_PROVE_RCU`) and review the acquisition
stack recorded by lockdep.

Until the root cause is found, a defensive mitigation (not a fix) would be
to add a `WARN_ON(rcu_read_lock_held())` near the top of `shmem_evict_inode()`
to produce an earlier, more informative trace on future occurrences.

---

## Bug introduction

The exact acquisition site of the leaked `rcu_read_lock()` is not visible in
the backtrace (the acquiring function is not in the call trace). Because the
How analysis could not identify a specific root-cause line, the git blame
step cannot be targeted. As a result:

**Bug introduction commit not identified — the RCU lock acquisition is not
visible in the backtrace, preventing targeted git blame analysis of the
acquisition site.**

Upstream fix search (git log across `mm/shmem.c`, `fs/namei.c`,
`mm/filemap.c`, `fs/inode.c`) found no commit within the search budget that
appears to address this specific RCU lock imbalance in the unlink/eviction
path.

---

## Analysis, conclusions and recommendations

An `rcu_read_lock()` is acquired somewhere in the `filename_unlinkat()` call
chain (between the `mnt_want_write()` and `iput()` calls) without a matching
`rcu_read_unlock()`. This leaked lock is carried through the entire tmpfs inode
eviction path until `shmem_undo_range()` calls `cond_resched()`, which on a
`PREEMPT(full)` kernel checks the RCU nesting depth and triggers:

> BUG: sleeping function called from invalid context at mm/shmem.c:1150
> RCU nest depth: 1, expected: 0

The same leaked lock is still held on return to userspace ("lock held when
returning to user space!") and causes a "Voluntary context switch within RCU
read-side critical section!" warning from a timer interrupt. All three
messages are symptoms of a single leaked `rcu_read_lock()` call.

**Immediate action**: The syzbot maintainers (akpm, baolin.wang, hughd) should
investigate recent changes to `vfs_unlink()` and its fsnotify/dcache callees
for an `rcu_read_lock()` without a matching `rcu_read_unlock()`. A
`WARN_ON(rcu_read_lock_held())` check added to `iput()` (alongside the
existing `might_sleep()`) would produce a more actionable backtrace on
reproduction.