Linux kernel crash report

Syzbot report: BUG: sleeping function called from invalid context in shmem_undo_range, filed against the upstream kernel at HEAD commit 2a4c0c11c019.

Original report: 69eab803.a00a0220.17a17.004b.GAE@google.com

Key elements

Field	Value	Implication
CRASH_TYPE	`BUG: sleeping function called from invalid context`	`cond_resched()` called while RCU read-side critical section is active
UNAME	`syzkaller #0 PREEMPT(full)`	Full-preemption kernel; `rcu_read_lock()` does NOT disable preemption but increments `current->rcu_read_lock_nesting`
HEAD_COMMIT	`2a4c0c11c019`	Merge tag ‘s390-7.1-1’ — upstream 7.1-rc1 merge window
GIT_TREE	`upstream`
PROCESS	`rm`	Userspace `unlink(2)` syscall on a tmpfs/shmem file
PID	`5904`
HARDWARE	`QEMU Standard PC (Q35 + ICH9, 2009)`
BIOS	`1.16.3-debian-1.16.3-2 04/01/2014`
COMPILER	`gcc (Debian 14.2.0-19) 14.2.0`
TAINT	`W` (first BUG — Not tainted; subsequent WARNINGs — G W)	W = previous warning; G = only GPL modules
VMLINUX	`oops-workdir/vmlinux-2a4c0c11`	Downloaded from syzbot assets
SOURCEDIR	`oops-workdir/linux`	Local tree at `6596a02b20788` (HEAD, ~2 days after oops commit)
MSGID	`<69eab803.a00a0220.17a17.004b.GAE@google.com>`
MSGID_URL	69eab803.a00a0220.17a17.004b.GAE@google.com

Kernel modules

Module	Flags	Backtrace	Location	Flag Implication
(none — no modules linked in)

Backtrace

The report contains three related crash events from the same process (rm/5904). Only the first — the primary BUG — is analysed per fundamentals.md. The two subsequent WARNINGs are consequences of the same root cause and are noted separately.

Primary crash: BUG: sleeping function called from invalid context

Address	Function	Offset	Size	Context	Source location
	`__dump_stack` (inlined)			Task	`lib/dump_stack.c:94`
`0xffffffff8121ab04 (0xffffffff8121ab04 + 0x0)`	`dump_stack_lvl`	`0x100`	`0x190`	Task	`lib/dump_stack.c:120`
`0xffffffff8121acd0 (0xffffffff8121aae4 + 0x1ec)`	`__might_resched.cold`	`0x1ec`	`0x232`	Task	kernel/sched/core.c:9163
`0xffffffff8250f297 (0xffffffff8250ee50 + 0x447)`	`shmem_undo_range`	`0x447`	`0x1570`	Task	mm/shmem.c:1150
	`shmem_truncate_range` (inlined)			Task	`mm/shmem.c:1277`
`0xffffffff825108a3 (0xffffffff825104b0 + 0x3f3)`	`shmem_evict_inode`	`0x3f3`	`0xc40`	Task	mm/shmem.c:1407
`0xffffffff8290f9e2 (0xffffffff8290f620 + 0x3c2)`	`evict`	`0x3c2`	`0xad0`	Task	`fs/inode.c:841`
	`iput_final` (inlined)			Task	`fs/inode.c:1960`
`0xffffffff829113f5 (0xffffffff82910df0 + 0x605)`	`iput.part.0`	`0x605`	`0xf50`	Task	`fs/inode.c:2009`
`0xffffffff82911d85 (0xffffffff82911d50 + 0x35)`	`iput`	`0x35`	`0x40`	Task	`fs/inode.c:1975`
`0xffffffff828d9216 (0xffffffff828d8db0 + 0x466)`	`filename_unlinkat`	`0x466`	`0x730`	Task	fs/namei.c:5572
	`__do_sys_unlink` (inlined)			Task	`fs/namei.c:5603`
	`__se_sys_unlink` (inlined)			Task	`fs/namei.c:5600`
`0xffffffff828d97b6 (0xffffffff828d9770 + 0x46)`	`__x64_sys_unlink`	`0x46`	`0x70`	Task	`fs/namei.c:5600`
	`do_syscall_x64` (inlined)			Task	`arch/x86/entry/syscall_64.c:63`
`0xffffffff8b97718b (0xffffffff8b977080 + 0x10b)`	`do_syscall_64`	`0x10b`	`0xf80`	Task	`arch/x86/entry/syscall_64.c:94`
	`entry_SYSCALL_64_after_hwframe`	`0x77`	`0x7f`	Task	`arch/x86/entry/entry_64.S`

Subsequent WARNING: lock held when returning to user space

After the BUG fires, the RCU read lock is still held when rm returns to userspace. This is a distinct lockdep WARNING recorded in the report.

Subsequent WARNING: Voluntary context switch within RCU read-side critical section

Triggered by an APIC timer interrupt firing and the scheduler detecting a context switch while the RCU read lock is still held. RIP: rcu_note_context_switch+0x859/0x19c0 kernel/rcu/tree_plugin.h:332 (fires ud1 — the BUG() macro).

Locks held

At the time of the primary BUG crash, two locks were held by rm/5904:

#	Lock name	Flags	Acquired in function	Offset	Source	In backtrace?
0	`sb_writers#5`	`{.+.+}-{0:0}`	`filename_unlinkat`	`0x1ad/0x730`	`fs/namei.c:5545`	Yes
1	`rcu_read_lock`	`{....}-{1:3}`	`rcu_lock_acquire.constprop.0`	`0x7/0x30`	`include/linux/rcupdate.h:300`	No

Lock #0 (sb_writers#5): acquired at filename_unlinkat+0x1ad which resolves to fs/namei.c:5545 — the mnt_want_write(path.mnt) call. This is the filesystem superblock write lock protecting the mount point. It is in the backtrace.

Lock #1 (rcu_read_lock): the acquiring function rcu_lock_acquire.constprop.0 is not in the backtrace. Per lockdep analysis rule, the acquisition site is recorded but not chased further. Crucially, this lock was acquired after lock #0 (the list is in acquisition order), meaning it was taken somewhere between mnt_want_write at line 5545 and the iput call at line 5572 in filename_unlinkat. It is never released.

Backtrace source code

1. `shmem_undo_range` — crash site (`shmem.c:1150`)

mm/shmem.c at commit 2a4c0c11c019

1104 /*
1105  * Remove range of pages and swap entries from page cache, and free them.
1106  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
1107  */
1108 static void shmem_undo_range(struct inode *inode, loff_t lstart, uoff_t lend,
1109                                                          bool unfalloc)
1110 {
1111     struct address_space *mapping = inode->i_mapping;
1112     struct shmem_inode_info *info = SHMEM_I(inode);
1113     pgoff_t start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
1114     pgoff_t end = (lend + 1) >> PAGE_SHIFT;
1115     struct folio_batch fbatch;
1116     pgoff_t indices[FOLIO_BATCH_SIZE];
1117     struct folio *folio;
1118     bool same_folio;
1119     long nr_swaps_freed = 0;
1120     pgoff_t index;
1121     int i;
1122
1123     if (lend == -1)
1124         end = -1;   /* unsigned, so actually very big */
1125
1126     if (info->fallocend > start && info->fallocend <= end && !unfalloc)
1127         info->fallocend = start;
1128
1129     folio_batch_init(&fbatch);
1130     index = start;
1131     while (index < end && find_lock_entries(mapping, &index, end - 1,
1132             &fbatch, indices)) {
1133         for (i = 0; i < folio_batch_count(&fbatch); i++) {
1134             folio = fbatch.folios[i];
1135
1136             if (xa_is_value(folio)) {
1137                 if (unfalloc)
1138                     continue;
1139                 nr_swaps_freed += shmem_free_swap(mapping, indices[i],
1140                                                   end - 1, folio);
1141                 continue;
1142             }
1143
1144             if (!unfalloc || !folio_test_uptodate(folio))
1145                 truncate_inode_folio(mapping, folio);
1146             folio_unlock(folio);
1147         }
1148         folio_batch_remove_exceptionals(&fbatch);
1149         folio_batch_release(&fbatch);
1150         cond_resched();   // <-- BUG: RCU nest depth 1, expected 0
1151     }

2. `shmem_evict_inode` — callsite (`shmem.c:1407`)

mm/shmem.c at commit 2a4c0c11c019

1397 static void shmem_evict_inode(struct inode *inode)
1398 {
1399     struct shmem_inode_info *info = SHMEM_I(inode);
1400     struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
1401     size_t freed = 0;
1402
1403     if (shmem_mapping(inode->i_mapping)) {
1404         shmem_unacct_size(info->flags, inode->i_size);
1405         inode->i_size = 0;
1406         mapping_set_exiting(inode->i_mapping);
1407         shmem_truncate_range(inode, 0, (loff_t)-1);   // <-- called here
...

3. `__might_resched` — assert site

kernel/sched/core.c near line 9163

__might_resched() is an assert/precondition function that checks whether a reschedule point is legal given the current lock state. On a PREEMPT(full) kernel the RCU nesting depth is separately tracked in current->rcu_read_lock_nesting. When that depth is non-zero, calling cond_resched() (which internally calls __might_resched()) triggers the BUG message with:

in_atomic(): 0, irqs_disabled(): 0, non_block: 0
RCU nest depth: 1, expected: 0

4. `filename_unlinkat` — caller (`namei.c:5572`)

fs/namei.c at commit 2a4c0c11c019

5526 int filename_unlinkat(int dfd, struct filename *name)
5527 {
...
5536 retry:
5537     error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
5538     if (error)
5539         return error;
...
5545     error = mnt_want_write(path.mnt);   // <-- lock #0 (sb_writers) acquired here
5546     if (error)
5547         goto exit_path_put;
5548 retry_deleg:
5549     dentry = start_dirop(path.dentry, &last, lookup_flags);
...
5563     inode = dentry->d_inode;
5564     ihold(inode);
5565     error = security_path_unlink(&path, dentry);
5566     if (error)
5567         goto exit_end_dirop;
5568     error = vfs_unlink(mnt_idmap(path.mnt), path.dentry->d_inode,
5569                        dentry, &delegated_inode);   // <-- lock #1 (rcu_read_lock) acquired in here?
5570 exit_end_dirop:
5571     end_dirop(dentry);
5572     iput(inode);   // <-- truncate the inode here → shmem_evict_inode → crash
...
5579     mnt_drop_write(path.mnt);   // <-- lock #0 (sb_writers) released
...
5587 }

What-how-where analysis

What

At mm/shmem.c:1150, the function shmem_undo_range() calls cond_resched() inside a while loop that iterates over the page cache entries of the inode being evicted. On a PREEMPT(full) kernel, cond_resched() calls __cond_resched() → __might_resched(). That function checks current->rcu_read_lock_nesting, which tracks active RCU read-side critical sections. At this point the nesting depth is 1 (one active rcu_read_lock() that was never matched by an rcu_read_unlock()). This is the “sleeping function called from invalid context” check firing:

RCU nest depth: 1, expected: 0

The kernel logs the BUG message, adds taint flag W, and continues execution — the system does not panic.

How

The call chain starts with the unlink(2) system call on a tmpfs/shmem file:

sys_unlink → filename_unlinkat → iput → iput_final → evict
          → shmem_evict_inode → shmem_truncate_range → shmem_undo_range
          → cond_resched()   ← BUG fires here

The lockdep state at crash time shows two locks held (in acquisition order):

sb_writers#5 — acquired at filename_unlinkat+0x1ad (line 5545, mnt_want_write(path.mnt)) — this is the normal filesystem write lock.
rcu_read_lock — acquired at rcu_lock_acquire.constprop.0 (not present in the backtrace). Because lockdep records locks in acquisition order, this lock was taken after mnt_want_write and before iput — i.e., somewhere between filename_unlinkat:5545 and filename_unlinkat:5572. The candidates in that window are: start_dirop(), security_path_unlink(), vfs_unlink() (and its callees: fsnotify_unlink(), d_delete_notify()).

The rcu_read_lock() that was acquired in this window was never matched by an rcu_read_unlock(). It persists through the entire eviction path and is still held when rm returns to userspace — confirmed by the subsequent “lock held when returning to user space!” warning. A third warning, “Voluntary context switch within RCU read-side critical section!”, fires from an APIC timer interrupt when the scheduler tries to switch away from the process while the RCU lock is still held.

This is a How — Negative How: there is a missing rcu_read_unlock() somewhere in the filename_unlinkat call chain (between lines 5545 and 5572). The exact acquisition site is not directly visible in the backtrace; per the lockdep analysis rules the acquisition function (rcu_lock_acquire.constprop.0) is not present and is not chased further.

Where

The fix needs to identify and add the missing rcu_read_unlock() in whichever function between mnt_want_write and iput acquires the RCU lock without releasing it. The prime suspects (all called from filename_unlinkat between lines 5549–5571 on the success path) are:

vfs_unlink() (fs/namei.c:5472) — calls fsnotify_unlink() and d_delete_notify() on the success path; either could internally acquire an RCU lock and fail to release it on some code path.
start_dirop() → lookup_one_qstr_excl() — does dentry lookup which may enter RCU walk mode.
security_path_unlink() — LSM hooks are a less likely source but should not be excluded.

Recommended debugging approach: add a lockdep_assert_not_held(&rcu_lock_map) or WARN_ON(rcu_read_lock_held()) in iput() (in addition to the existing might_sleep() check) to narrow down the acquisition site. Alternatively, use dynamic lockdep tracing (CONFIG_PROVE_RCU) and review the acquisition stack recorded by lockdep.

Until the root cause is found, a defensive mitigation (not a fix) would be to add a WARN_ON(rcu_read_lock_held()) near the top of shmem_evict_inode() to produce an earlier, more informative trace on future occurrences.

Bug introduction

The exact acquisition site of the leaked rcu_read_lock() is not visible in the backtrace (the acquiring function is not in the call trace). Because the How analysis could not identify a specific root-cause line, the git blame step cannot be targeted. As a result:

Bug introduction commit not identified — the RCU lock acquisition is not visible in the backtrace, preventing targeted git blame analysis of the acquisition site.

Upstream fix search (git log across mm/shmem.c, fs/namei.c, mm/filemap.c, fs/inode.c) found no commit within the search budget that appears to address this specific RCU lock imbalance in the unlink/eviction path.

Analysis, conclusions and recommendations

An rcu_read_lock() is acquired somewhere in the filename_unlinkat() call chain (between the mnt_want_write() and iput() calls) without a matching rcu_read_unlock(). This leaked lock is carried through the entire tmpfs inode eviction path until shmem_undo_range() calls cond_resched(), which on a PREEMPT(full) kernel checks the RCU nesting depth and triggers:

BUG: sleeping function called from invalid context at mm/shmem.c:1150 RCU nest depth: 1, expected: 0

The same leaked lock is still held on return to userspace (“lock held when returning to user space!”) and causes a “Voluntary context switch within RCU read-side critical section!” warning from a timer interrupt. All three messages are symptoms of a single leaked rcu_read_lock() call.

Immediate action: The syzbot maintainers (akpm, baolin.wang, hughd) should investigate recent changes to vfs_unlink() and its fsnotify/dcache callees for an rcu_read_lock() without a matching rcu_read_unlock(). A WARN_ON(rcu_read_lock_held()) check added to iput() (alongside the existing might_sleep()) would produce a more actionable backtrace on reproduction.