Linux kernel crash report

Syzbot report: BUG: sleeping function called from invalid context in shmem_undo_range, filed against the upstream kernel at HEAD commit 2a4c0c11c019.

Original report: 69eab803.a00a0220.17a17.004b.GAE@google.com

Key elements

Field Value Implication
CRASH_TYPE BUG: sleeping function called from invalid context cond_resched() called while RCU read-side critical section is active
UNAME syzkaller #0 PREEMPT(full) Full-preemption kernel; rcu_read_lock() does NOT disable preemption but increments current->rcu_read_lock_nesting
HEAD_COMMIT 2a4c0c11c019 Merge tag ‘s390-7.1-1’ — upstream 7.1-rc1 merge window
GIT_TREE upstream
PROCESS rm Userspace unlink(2) syscall on a tmpfs/shmem file
PID 5904
HARDWARE QEMU Standard PC (Q35 + ICH9, 2009)
BIOS 1.16.3-debian-1.16.3-2 04/01/2014
COMPILER gcc (Debian 14.2.0-19) 14.2.0
TAINT W (first BUG — Not tainted; subsequent WARNINGs — G W) W = previous warning; G = only GPL modules
VMLINUX oops-workdir/vmlinux-2a4c0c11 Downloaded from syzbot assets
SOURCEDIR oops-workdir/linux Local tree at 6596a02b20788 (HEAD, ~2 days after oops commit)
MSGID <69eab803.a00a0220.17a17.004b.GAE@google.com>
MSGID_URL 69eab803.a00a0220.17a17.004b.GAE@google.com

Kernel modules

Module Flags Backtrace Location Flag Implication
(none — no modules linked in)

Backtrace

The report contains three related crash events from the same process (rm/5904). Only the first — the primary BUG — is analysed per fundamentals.md. The two subsequent WARNINGs are consequences of the same root cause and are noted separately.

Primary crash: BUG: sleeping function called from invalid context

Address Function Offset Size Context Module Source location
__dump_stack (inlined) Task lib/dump_stack.c:94
0xffffffff8121ab04 (0xffffffff8121ab04 + 0x0) dump_stack_lvl 0x100 0x190 Task lib/dump_stack.c:120
0xffffffff8121acd0 (0xffffffff8121aae4 + 0x1ec) __might_resched.cold 0x1ec 0x232 Task kernel/sched/core.c:9163
0xffffffff8250f297 (0xffffffff8250ee50 + 0x447) shmem_undo_range 0x447 0x1570 Task mm/shmem.c:1150
shmem_truncate_range (inlined) Task mm/shmem.c:1277
0xffffffff825108a3 (0xffffffff825104b0 + 0x3f3) shmem_evict_inode 0x3f3 0xc40 Task mm/shmem.c:1407
0xffffffff8290f9e2 (0xffffffff8290f620 + 0x3c2) evict 0x3c2 0xad0 Task fs/inode.c:841
iput_final (inlined) Task fs/inode.c:1960
0xffffffff829113f5 (0xffffffff82910df0 + 0x605) iput.part.0 0x605 0xf50 Task fs/inode.c:2009
0xffffffff82911d85 (0xffffffff82911d50 + 0x35) iput 0x35 0x40 Task fs/inode.c:1975
0xffffffff828d9216 (0xffffffff828d8db0 + 0x466) filename_unlinkat 0x466 0x730 Task fs/namei.c:5572
__do_sys_unlink (inlined) Task fs/namei.c:5603
__se_sys_unlink (inlined) Task fs/namei.c:5600
0xffffffff828d97b6 (0xffffffff828d9770 + 0x46) __x64_sys_unlink 0x46 0x70 Task fs/namei.c:5600
do_syscall_x64 (inlined) Task arch/x86/entry/syscall_64.c:63
0xffffffff8b97718b (0xffffffff8b977080 + 0x10b) do_syscall_64 0x10b 0xf80 Task arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe 0x77 0x7f Task arch/x86/entry/entry_64.S

Subsequent WARNING: lock held when returning to user space

After the BUG fires, the RCU read lock is still held when rm returns to userspace. This is a distinct lockdep WARNING recorded in the report.

Subsequent WARNING: Voluntary context switch within RCU read-side critical section

Triggered by an APIC timer interrupt firing and the scheduler detecting a context switch while the RCU read lock is still held. RIP: rcu_note_context_switch+0x859/0x19c0 kernel/rcu/tree_plugin.h:332 (fires ud1 — the BUG() macro).


Locks held

At the time of the primary BUG crash, two locks were held by rm/5904:

# Lock name Flags Acquired in function Offset Source In backtrace?
0 sb_writers#5 {.+.+}-{0:0} filename_unlinkat 0x1ad/0x730 fs/namei.c:5545 Yes
1 rcu_read_lock {....}-{1:3} rcu_lock_acquire.constprop.0 0x7/0x30 include/linux/rcupdate.h:300 No

Lock #0 (sb_writers#5): acquired at filename_unlinkat+0x1ad which resolves to fs/namei.c:5545 — the mnt_want_write(path.mnt) call. This is the filesystem superblock write lock protecting the mount point. It is in the backtrace.

Lock #1 (rcu_read_lock): the acquiring function rcu_lock_acquire.constprop.0 is not in the backtrace. Per lockdep analysis rule, the acquisition site is recorded but not chased further. Crucially, this lock was acquired after lock #0 (the list is in acquisition order), meaning it was taken somewhere between mnt_want_write at line 5545 and the iput call at line 5572 in filename_unlinkat. It is never released.


Backtrace source code

1. shmem_undo_range — crash site (shmem.c:1150)

mm/shmem.c at commit 2a4c0c11c019

1104 /*
1105  * Remove range of pages and swap entries from page cache, and free them.
1106  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
1107  */
1108 static void shmem_undo_range(struct inode *inode, loff_t lstart, uoff_t lend,
1109                                                          bool unfalloc)
1110 {
1111     struct address_space *mapping = inode->i_mapping;
1112     struct shmem_inode_info *info = SHMEM_I(inode);
1113     pgoff_t start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
1114     pgoff_t end = (lend + 1) >> PAGE_SHIFT;
1115     struct folio_batch fbatch;
1116     pgoff_t indices[FOLIO_BATCH_SIZE];
1117     struct folio *folio;
1118     bool same_folio;
1119     long nr_swaps_freed = 0;
1120     pgoff_t index;
1121     int i;
1122
1123     if (lend == -1)
1124         end = -1;   /* unsigned, so actually very big */
1125
1126     if (info->fallocend > start && info->fallocend <= end && !unfalloc)
1127         info->fallocend = start;
1128
1129     folio_batch_init(&fbatch);
1130     index = start;
1131     while (index < end && find_lock_entries(mapping, &index, end - 1,
1132             &fbatch, indices)) {
1133         for (i = 0; i < folio_batch_count(&fbatch); i++) {
1134             folio = fbatch.folios[i];
1135
1136             if (xa_is_value(folio)) {
1137                 if (unfalloc)
1138                     continue;
1139                 nr_swaps_freed += shmem_free_swap(mapping, indices[i],
1140                                                   end - 1, folio);
1141                 continue;
1142             }
1143
1144             if (!unfalloc || !folio_test_uptodate(folio))
1145                 truncate_inode_folio(mapping, folio);
1146             folio_unlock(folio);
1147         }
1148         folio_batch_remove_exceptionals(&fbatch);
1149         folio_batch_release(&fbatch);
1150         cond_resched();   // <-- BUG: RCU nest depth 1, expected 0
1151     }

2. shmem_evict_inode — callsite (shmem.c:1407)

mm/shmem.c at commit 2a4c0c11c019

1397 static void shmem_evict_inode(struct inode *inode)
1398 {
1399     struct shmem_inode_info *info = SHMEM_I(inode);
1400     struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
1401     size_t freed = 0;
1402
1403     if (shmem_mapping(inode->i_mapping)) {
1404         shmem_unacct_size(info->flags, inode->i_size);
1405         inode->i_size = 0;
1406         mapping_set_exiting(inode->i_mapping);
1407         shmem_truncate_range(inode, 0, (loff_t)-1);   // <-- called here
...

3. __might_resched — assert site

kernel/sched/core.c near line 9163

__might_resched() is an assert/precondition function that checks whether a reschedule point is legal given the current lock state. On a PREEMPT(full) kernel the RCU nesting depth is separately tracked in current->rcu_read_lock_nesting. When that depth is non-zero, calling cond_resched() (which internally calls __might_resched()) triggers the BUG message with:

in_atomic(): 0, irqs_disabled(): 0, non_block: 0
RCU nest depth: 1, expected: 0

4. filename_unlinkat — caller (namei.c:5572)

fs/namei.c at commit 2a4c0c11c019

5526 int filename_unlinkat(int dfd, struct filename *name)
5527 {
...
5536 retry:
5537     error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
5538     if (error)
5539         return error;
...
5545     error = mnt_want_write(path.mnt);   // <-- lock #0 (sb_writers) acquired here
5546     if (error)
5547         goto exit_path_put;
5548 retry_deleg:
5549     dentry = start_dirop(path.dentry, &last, lookup_flags);
...
5563     inode = dentry->d_inode;
5564     ihold(inode);
5565     error = security_path_unlink(&path, dentry);
5566     if (error)
5567         goto exit_end_dirop;
5568     error = vfs_unlink(mnt_idmap(path.mnt), path.dentry->d_inode,
5569                        dentry, &delegated_inode);   // <-- lock #1 (rcu_read_lock) acquired in here?
5570 exit_end_dirop:
5571     end_dirop(dentry);
5572     iput(inode);   // <-- truncate the inode here → shmem_evict_inode → crash
...
5579     mnt_drop_write(path.mnt);   // <-- lock #0 (sb_writers) released
...
5587 }

What-how-where analysis

What

At mm/shmem.c:1150, the function shmem_undo_range() calls cond_resched() inside a while loop that iterates over the page cache entries of the inode being evicted. On a PREEMPT(full) kernel, cond_resched() calls __cond_resched()__might_resched(). That function checks current->rcu_read_lock_nesting, which tracks active RCU read-side critical sections. At this point the nesting depth is 1 (one active rcu_read_lock() that was never matched by an rcu_read_unlock()). This is the “sleeping function called from invalid context” check firing:

RCU nest depth: 1, expected: 0

The kernel logs the BUG message, adds taint flag W, and continues execution — the system does not panic.

How

The call chain starts with the unlink(2) system call on a tmpfs/shmem file:

sys_unlink → filename_unlinkat → iput → iput_final → evict
          → shmem_evict_inode → shmem_truncate_range → shmem_undo_range
          → cond_resched()   ← BUG fires here

The lockdep state at crash time shows two locks held (in acquisition order):

  1. sb_writers#5 — acquired at filename_unlinkat+0x1ad (line 5545, mnt_want_write(path.mnt)) — this is the normal filesystem write lock.
  2. rcu_read_lock — acquired at rcu_lock_acquire.constprop.0 (not present in the backtrace). Because lockdep records locks in acquisition order, this lock was taken after mnt_want_write and before iput — i.e., somewhere between filename_unlinkat:5545 and filename_unlinkat:5572. The candidates in that window are: start_dirop(), security_path_unlink(), vfs_unlink() (and its callees: fsnotify_unlink(), d_delete_notify()).

The rcu_read_lock() that was acquired in this window was never matched by an rcu_read_unlock(). It persists through the entire eviction path and is still held when rm returns to userspace — confirmed by the subsequent “lock held when returning to user space!” warning. A third warning, “Voluntary context switch within RCU read-side critical section!”, fires from an APIC timer interrupt when the scheduler tries to switch away from the process while the RCU lock is still held.

This is a How — Negative How: there is a missing rcu_read_unlock() somewhere in the filename_unlinkat call chain (between lines 5545 and 5572). The exact acquisition site is not directly visible in the backtrace; per the lockdep analysis rules the acquisition function (rcu_lock_acquire.constprop.0) is not present and is not chased further.

Where

The fix needs to identify and add the missing rcu_read_unlock() in whichever function between mnt_want_write and iput acquires the RCU lock without releasing it. The prime suspects (all called from filename_unlinkat between lines 5549–5571 on the success path) are:

Recommended debugging approach: add a lockdep_assert_not_held(&rcu_lock_map) or WARN_ON(rcu_read_lock_held()) in iput() (in addition to the existing might_sleep() check) to narrow down the acquisition site. Alternatively, use dynamic lockdep tracing (CONFIG_PROVE_RCU) and review the acquisition stack recorded by lockdep.

Until the root cause is found, a defensive mitigation (not a fix) would be to add a WARN_ON(rcu_read_lock_held()) near the top of shmem_evict_inode() to produce an earlier, more informative trace on future occurrences.


Bug introduction

The exact acquisition site of the leaked rcu_read_lock() is not visible in the backtrace (the acquiring function is not in the call trace). Because the How analysis could not identify a specific root-cause line, the git blame step cannot be targeted. As a result:

Bug introduction commit not identified — the RCU lock acquisition is not visible in the backtrace, preventing targeted git blame analysis of the acquisition site.

Upstream fix search (git log across mm/shmem.c, fs/namei.c, mm/filemap.c, fs/inode.c) found no commit within the search budget that appears to address this specific RCU lock imbalance in the unlink/eviction path.


Analysis, conclusions and recommendations

An rcu_read_lock() is acquired somewhere in the filename_unlinkat() call chain (between the mnt_want_write() and iput() calls) without a matching rcu_read_unlock(). This leaked lock is carried through the entire tmpfs inode eviction path until shmem_undo_range() calls cond_resched(), which on a PREEMPT(full) kernel checks the RCU nesting depth and triggers:

BUG: sleeping function called from invalid context at mm/shmem.c:1150 RCU nest depth: 1, expected: 0

The same leaked lock is still held on return to userspace (“lock held when returning to user space!”) and causes a “Voluntary context switch within RCU read-side critical section!” warning from a timer interrupt. All three messages are symptoms of a single leaked rcu_read_lock() call.

Immediate action: The syzbot maintainers (akpm, baolin.wang, hughd) should investigate recent changes to vfs_unlink() and its fsnotify/dcache callees for an rcu_read_lock() without a matching rcu_read_unlock(). A WARN_ON(rcu_read_lock_held()) check added to iput() (alongside the existing might_sleep()) would produce a more actionable backtrace on reproduction.