Syzbot report: BUG: sleeping function called from invalid context in
shmem_undo_range, filed against the upstream kernel at HEAD
commit 2a4c0c11c019.
Original report: 69eab803.a00a0220.17a17.004b.GAE@google.com
| Field | Value | Implication |
|---|---|---|
| CRASH_TYPE | BUG: sleeping function called from invalid context |
cond_resched() called while RCU read-side critical
section is active |
| UNAME | syzkaller #0 PREEMPT(full) |
Full-preemption kernel; rcu_read_lock() does NOT
disable preemption but increments
current->rcu_read_lock_nesting |
| HEAD_COMMIT | 2a4c0c11c019 |
Merge tag ‘s390-7.1-1’ — upstream 7.1-rc1 merge window |
| GIT_TREE | upstream |
|
| PROCESS | rm |
Userspace unlink(2) syscall on a tmpfs/shmem file |
| PID | 5904 |
|
| HARDWARE | QEMU Standard PC (Q35 + ICH9, 2009) |
|
| BIOS | 1.16.3-debian-1.16.3-2 04/01/2014 |
|
| COMPILER | gcc (Debian 14.2.0-19) 14.2.0 |
|
| TAINT | W (first BUG — Not tainted; subsequent WARNINGs — G
W) |
W = previous warning; G = only GPL modules |
| VMLINUX | oops-workdir/vmlinux-2a4c0c11 |
Downloaded from syzbot assets |
| SOURCEDIR | oops-workdir/linux |
Local tree at 6596a02b20788 (HEAD, ~2 days after oops
commit) |
| MSGID | <69eab803.a00a0220.17a17.004b.GAE@google.com> |
|
| MSGID_URL | 69eab803.a00a0220.17a17.004b.GAE@google.com |
| Module | Flags | Backtrace | Location | Flag Implication |
|---|---|---|---|---|
| (none — no modules linked in) |
The report contains three related crash events from the same process (rm/5904). Only the first — the primary BUG — is analysed per fundamentals.md. The two subsequent WARNINGs are consequences of the same root cause and are noted separately.
| Address | Function | Offset | Size | Context | Module | Source location |
|---|---|---|---|---|---|---|
__dump_stack (inlined) |
Task | lib/dump_stack.c:94 |
||||
0xffffffff8121ab04 (0xffffffff8121ab04 + 0x0) |
dump_stack_lvl |
0x100 |
0x190 |
Task | lib/dump_stack.c:120 |
|
0xffffffff8121acd0 (0xffffffff8121aae4 + 0x1ec) |
__might_resched.cold |
0x1ec |
0x232 |
Task | kernel/sched/core.c:9163 | |
0xffffffff8250f297 (0xffffffff8250ee50 + 0x447) |
shmem_undo_range |
0x447 |
0x1570 |
Task | mm/shmem.c:1150 | |
shmem_truncate_range (inlined) |
Task | mm/shmem.c:1277 |
||||
0xffffffff825108a3 (0xffffffff825104b0 + 0x3f3) |
shmem_evict_inode |
0x3f3 |
0xc40 |
Task | mm/shmem.c:1407 | |
0xffffffff8290f9e2 (0xffffffff8290f620 + 0x3c2) |
evict |
0x3c2 |
0xad0 |
Task | fs/inode.c:841 |
|
iput_final (inlined) |
Task | fs/inode.c:1960 |
||||
0xffffffff829113f5 (0xffffffff82910df0 + 0x605) |
iput.part.0 |
0x605 |
0xf50 |
Task | fs/inode.c:2009 |
|
0xffffffff82911d85 (0xffffffff82911d50 + 0x35) |
iput |
0x35 |
0x40 |
Task | fs/inode.c:1975 |
|
0xffffffff828d9216 (0xffffffff828d8db0 + 0x466) |
filename_unlinkat |
0x466 |
0x730 |
Task | fs/namei.c:5572 | |
__do_sys_unlink (inlined) |
Task | fs/namei.c:5603 |
||||
__se_sys_unlink (inlined) |
Task | fs/namei.c:5600 |
||||
0xffffffff828d97b6 (0xffffffff828d9770 + 0x46) |
__x64_sys_unlink |
0x46 |
0x70 |
Task | fs/namei.c:5600 |
|
do_syscall_x64 (inlined) |
Task | arch/x86/entry/syscall_64.c:63 |
||||
0xffffffff8b97718b (0xffffffff8b977080 + 0x10b) |
do_syscall_64 |
0x10b |
0xf80 |
Task | arch/x86/entry/syscall_64.c:94 |
|
entry_SYSCALL_64_after_hwframe |
0x77 |
0x7f |
Task | arch/x86/entry/entry_64.S |
After the BUG fires, the RCU read lock is still held when
rm returns to userspace. This is a distinct lockdep WARNING
recorded in the report.
Triggered by an APIC timer interrupt firing and the scheduler
detecting a context switch while the RCU read lock is still held. RIP:
rcu_note_context_switch+0x859/0x19c0 kernel/rcu/tree_plugin.h:332
(fires ud1 — the BUG() macro).
At the time of the primary BUG crash, two locks were held by
rm/5904:
| # | Lock name | Flags | Acquired in function | Offset | Source | In backtrace? |
|---|---|---|---|---|---|---|
| 0 | sb_writers#5 |
{.+.+}-{0:0} |
filename_unlinkat |
0x1ad/0x730 |
fs/namei.c:5545 |
Yes |
| 1 | rcu_read_lock |
{....}-{1:3} |
rcu_lock_acquire.constprop.0 |
0x7/0x30 |
include/linux/rcupdate.h:300 |
No |
Lock #0 (sb_writers#5): acquired at
filename_unlinkat+0x1ad which resolves to
fs/namei.c:5545 — the mnt_want_write(path.mnt)
call. This is the filesystem superblock write lock protecting the mount
point. It is in the backtrace.
Lock #1 (rcu_read_lock): the acquiring
function rcu_lock_acquire.constprop.0 is not in the
backtrace. Per lockdep analysis rule, the acquisition site is
recorded but not chased further. Crucially, this lock was acquired
after lock #0 (the list is in acquisition order), meaning it
was taken somewhere between mnt_want_write at line 5545 and
the iput call at line 5572 in
filename_unlinkat. It is never released.
shmem_undo_range — crash site
(shmem.c:1150)mm/shmem.c
at commit 2a4c0c11c019
1104 /*
1105 * Remove range of pages and swap entries from page cache, and free them.
1106 * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
1107 */
1108 static void shmem_undo_range(struct inode *inode, loff_t lstart, uoff_t lend,
1109 bool unfalloc)
1110 {
1111 struct address_space *mapping = inode->i_mapping;
1112 struct shmem_inode_info *info = SHMEM_I(inode);
1113 pgoff_t start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
1114 pgoff_t end = (lend + 1) >> PAGE_SHIFT;
1115 struct folio_batch fbatch;
1116 pgoff_t indices[FOLIO_BATCH_SIZE];
1117 struct folio *folio;
1118 bool same_folio;
1119 long nr_swaps_freed = 0;
1120 pgoff_t index;
1121 int i;
1122
1123 if (lend == -1)
1124 end = -1; /* unsigned, so actually very big */
1125
1126 if (info->fallocend > start && info->fallocend <= end && !unfalloc)
1127 info->fallocend = start;
1128
1129 folio_batch_init(&fbatch);
1130 index = start;
1131 while (index < end && find_lock_entries(mapping, &index, end - 1,
1132 &fbatch, indices)) {
1133 for (i = 0; i < folio_batch_count(&fbatch); i++) {
1134 folio = fbatch.folios[i];
1135
1136 if (xa_is_value(folio)) {
1137 if (unfalloc)
1138 continue;
1139 nr_swaps_freed += shmem_free_swap(mapping, indices[i],
1140 end - 1, folio);
1141 continue;
1142 }
1143
1144 if (!unfalloc || !folio_test_uptodate(folio))
1145 truncate_inode_folio(mapping, folio);
1146 folio_unlock(folio);
1147 }
1148 folio_batch_remove_exceptionals(&fbatch);
1149 folio_batch_release(&fbatch);
1150 cond_resched(); // <-- BUG: RCU nest depth 1, expected 0
1151 }shmem_evict_inode — callsite
(shmem.c:1407)mm/shmem.c
at commit 2a4c0c11c019
1397 static void shmem_evict_inode(struct inode *inode)
1398 {
1399 struct shmem_inode_info *info = SHMEM_I(inode);
1400 struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
1401 size_t freed = 0;
1402
1403 if (shmem_mapping(inode->i_mapping)) {
1404 shmem_unacct_size(info->flags, inode->i_size);
1405 inode->i_size = 0;
1406 mapping_set_exiting(inode->i_mapping);
1407 shmem_truncate_range(inode, 0, (loff_t)-1); // <-- called here
...__might_resched —
assert sitekernel/sched/core.c
near line 9163
__might_resched() is an assert/precondition function
that checks whether a reschedule point is legal given the current lock
state. On a PREEMPT(full) kernel the RCU nesting depth is
separately tracked in current->rcu_read_lock_nesting.
When that depth is non-zero, calling cond_resched() (which
internally calls __might_resched()) triggers the BUG
message with:
in_atomic(): 0, irqs_disabled(): 0, non_block: 0
RCU nest depth: 1, expected: 0
filename_unlinkat — caller (namei.c:5572)fs/namei.c
at commit 2a4c0c11c019
5526 int filename_unlinkat(int dfd, struct filename *name)
5527 {
...
5536 retry:
5537 error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
5538 if (error)
5539 return error;
...
5545 error = mnt_want_write(path.mnt); // <-- lock #0 (sb_writers) acquired here
5546 if (error)
5547 goto exit_path_put;
5548 retry_deleg:
5549 dentry = start_dirop(path.dentry, &last, lookup_flags);
...
5563 inode = dentry->d_inode;
5564 ihold(inode);
5565 error = security_path_unlink(&path, dentry);
5566 if (error)
5567 goto exit_end_dirop;
5568 error = vfs_unlink(mnt_idmap(path.mnt), path.dentry->d_inode,
5569 dentry, &delegated_inode); // <-- lock #1 (rcu_read_lock) acquired in here?
5570 exit_end_dirop:
5571 end_dirop(dentry);
5572 iput(inode); // <-- truncate the inode here → shmem_evict_inode → crash
...
5579 mnt_drop_write(path.mnt); // <-- lock #0 (sb_writers) released
...
5587 }At mm/shmem.c:1150, the function
shmem_undo_range() calls cond_resched() inside
a while loop that iterates over the page cache entries of the inode
being evicted. On a PREEMPT(full) kernel,
cond_resched() calls __cond_resched() →
__might_resched(). That function checks
current->rcu_read_lock_nesting, which tracks active RCU
read-side critical sections. At this point the nesting depth is
1 (one active rcu_read_lock() that was
never matched by an rcu_read_unlock()). This is the
“sleeping function called from invalid context” check firing:
RCU nest depth: 1, expected: 0
The kernel logs the BUG message, adds taint flag W, and
continues execution — the system does not panic.
The call chain starts with the unlink(2) system call on
a tmpfs/shmem file:
sys_unlink → filename_unlinkat → iput → iput_final → evict
→ shmem_evict_inode → shmem_truncate_range → shmem_undo_range
→ cond_resched() ← BUG fires here
The lockdep state at crash time shows two locks held (in acquisition order):
sb_writers#5 — acquired at
filename_unlinkat+0x1ad (line 5545,
mnt_want_write(path.mnt)) — this is the normal filesystem
write lock.rcu_read_lock — acquired at
rcu_lock_acquire.constprop.0 (not present in the
backtrace). Because lockdep records locks in acquisition order, this
lock was taken after mnt_want_write and
before iput — i.e., somewhere between
filename_unlinkat:5545 and
filename_unlinkat:5572. The candidates in that window are:
start_dirop(), security_path_unlink(),
vfs_unlink() (and its callees:
fsnotify_unlink(), d_delete_notify()).The rcu_read_lock() that was acquired in this window was
never matched by an rcu_read_unlock(). It persists through
the entire eviction path and is still held when rm returns
to userspace — confirmed by the subsequent “lock held when returning to
user space!” warning. A third warning, “Voluntary context switch within
RCU read-side critical section!”, fires from an APIC timer interrupt
when the scheduler tries to switch away from the process while the RCU
lock is still held.
This is a How — Negative How: there is a missing
rcu_read_unlock() somewhere in the
filename_unlinkat call chain (between lines 5545 and 5572).
The exact acquisition site is not directly visible in the backtrace; per
the lockdep analysis rules the acquisition function
(rcu_lock_acquire.constprop.0) is not present and is not
chased further.
The fix needs to identify and add the missing
rcu_read_unlock() in whichever function between
mnt_want_write and iput acquires the RCU lock
without releasing it. The prime suspects (all called from
filename_unlinkat between lines 5549–5571 on the success
path) are:
vfs_unlink() (fs/namei.c:5472) — calls
fsnotify_unlink() and d_delete_notify() on the
success path; either could internally acquire an RCU lock and fail to
release it on some code path.start_dirop() → lookup_one_qstr_excl() —
does dentry lookup which may enter RCU walk mode.security_path_unlink() — LSM hooks are a less likely
source but should not be excluded.Recommended debugging approach: add a
lockdep_assert_not_held(&rcu_lock_map) or
WARN_ON(rcu_read_lock_held()) in iput() (in
addition to the existing might_sleep() check) to narrow
down the acquisition site. Alternatively, use dynamic lockdep tracing
(CONFIG_PROVE_RCU) and review the acquisition stack
recorded by lockdep.
Until the root cause is found, a defensive mitigation (not a fix)
would be to add a WARN_ON(rcu_read_lock_held()) near the
top of shmem_evict_inode() to produce an earlier, more
informative trace on future occurrences.
The exact acquisition site of the leaked rcu_read_lock()
is not visible in the backtrace (the acquiring function is not in the
call trace). Because the How analysis could not identify a specific
root-cause line, the git blame step cannot be targeted. As a result:
Bug introduction commit not identified — the RCU lock acquisition is not visible in the backtrace, preventing targeted git blame analysis of the acquisition site.
Upstream fix search (git log across mm/shmem.c,
fs/namei.c, mm/filemap.c,
fs/inode.c) found no commit within the search budget that
appears to address this specific RCU lock imbalance in the
unlink/eviction path.
An rcu_read_lock() is acquired somewhere in the
filename_unlinkat() call chain (between the
mnt_want_write() and iput() calls) without a
matching rcu_read_unlock(). This leaked lock is carried
through the entire tmpfs inode eviction path until
shmem_undo_range() calls cond_resched(), which
on a PREEMPT(full) kernel checks the RCU nesting depth and
triggers:
BUG: sleeping function called from invalid context at mm/shmem.c:1150 RCU nest depth: 1, expected: 0
The same leaked lock is still held on return to userspace (“lock held
when returning to user space!”) and causes a “Voluntary context switch
within RCU read-side critical section!” warning from a timer interrupt.
All three messages are symptoms of a single leaked
rcu_read_lock() call.
Immediate action: The syzbot maintainers (akpm,
baolin.wang, hughd) should investigate recent changes to
vfs_unlink() and its fsnotify/dcache callees for an
rcu_read_lock() without a matching
rcu_read_unlock(). A
WARN_ON(rcu_read_lock_held()) check added to
iput() (alongside the existing might_sleep())
would produce a more actionable backtrace on reproduction.