Source: kernel Bugzilla bug #221376 — “AMD RADEON RX 9070 XT - modprobe amdgpu is fail.”
| Field | Value | Implication |
|---|---|---|
| BUG_URL | bugzilla.kernel.org #221376 | |
| UNAME | 6.18.22-x86_64 |
Custom kernel compiled by the reporter (build string:
#1 SMP PREEMPT_DYNAMIC Fri Apr 17 01:05:45 +07 2026) |
| DISTRO | (none — custom kernel) | No distro debug packages available; vmlinux not obtainable |
| HARDWARE | GIGABYTE MD72-HB1-00 (dual-socket Intel Xeon Silver 4310 @ 2.10 GHz, 48 threads) | |
| BIOS | F40 12/09/2025 | |
| PROCESS | kworker/12:1 |
Kernel worker thread on CPU 12 |
| WORKQUEUE | events work_for_cpu_fn |
Device probe via local_pci_probe run on a specific
CPU |
| TAINT | (Not tainted) | Clean kernel; no proprietary or out-of-tree code |
| CRASH_TYPE | BUG/BUG_ON | DRM_MM_BUG_ON(start + size <= start) in
drm_mm_init() |
| CRASH_SITE | drivers/gpu/drm/drm_mm.c:930 |
Assertion inside the DRM memory-range allocator initialiser |
| SOURCEDIR | /sdb1/arjan/git/oops-skill/oops-workdir/linux (tag
v6.18.16) |
Nearest stable tag; 6.18.22 not present — analysis is approximate |
| VMLINUX | (not available — custom kernel) | addr2line mapping not possible; source used
directly |
| Module | Flags | Backtrace | Location | Flag Implication |
|---|---|---|---|---|
| amdgpu | + | Y | Being loaded at time of crash (module_init path likely
involved) |
|
| amdxcp | ||||
| drm_ttm_helper | ||||
| ttm | Y | |||
| drm_exec | ||||
| drm_panel_backlight_quirks | ||||
| gpu_sched | ||||
| drm_suballoc_helper | ||||
| drm_buddy | ||||
| drm_display_helper | ||||
| cec | ||||
| rc_core | ||||
| igb | ||||
| i2c_algo_bit |
| Address | Function | Offset | Size | Context | Module | Source location |
|---|---|---|---|---|---|---|
| (RIP) | drm_mm_init |
0xc1 |
0xd0 |
Task | (built-in) | drm_mm.c:930 |
ttm_range_man_init_nocheck |
0x9d |
0x180 |
Task | ttm (build-ID: e3a55dbbe0be) |
ttm_range_manager.c:198 | |
amdgpu_ttm_init.cold |
0x45e |
0x5cc |
Task | amdgpu (build-ID: 48030e986eac) |
amdgpu_ttm.c:2103 | |
amdgpu_bo_init.cold |
0x5e |
0x77 |
Task | amdgpu |
amdgpu_object.c:1088 |
|
gmc_v12_0_sw_init |
0x470 |
0x6f0 |
Task | amdgpu |
gmc_v12_0.c:847 |
|
amdgpu_device_ip_init |
0x8f |
0xb43 |
Task | amdgpu |
||
amdgpu_device_init.cold |
0x1495 |
0x1abe |
Task | amdgpu |
||
amdgpu_driver_load_kms |
0x1a |
0x80 |
Task | amdgpu |
||
amdgpu_pci_probe |
0x28e |
0x760 |
Task | amdgpu |
||
local_pci_probe |
0x51 |
0xc0 |
Task | (built-in) | ||
work_for_cpu_fn |
0x1d |
0x30 |
Task | (built-in) | ||
process_scheduled_works |
0x2bc |
0x680 |
Task | (built-in) | ||
worker_thread |
0x1a6 |
0x4a0 |
Task | (built-in) | ||
kthread |
0x1a4 |
0x3a0 |
Task | (built-in) | ||
ret_from_fork |
0x1f8 |
0x3b0 |
Task | (built-in) | ||
ret_from_fork_asm |
0x1a |
0x30 |
Task | (built-in) |
| Register | Value | Note |
|---|---|---|
| RIP | drm_mm_init+0xc1/0xd0 |
Points to ud2 (BUG) instruction at crash |
| RSP | ffa000000d723b60 |
Valid kernel stack address |
| RAX | 0000000000000000 |
Zero |
| RBX | ff11004093c27400 |
Pointer to struct drm_mm
(rman->mm) |
| RCX | 000000000000001c |
|
| RDX | 0000000000000000 |
size arg to drm_mm_init =
0 — this is the bad value |
| RSI | 0000000000000000 |
start arg to drm_mm_init =
0 |
| RDI | ff11004093c27480 |
mm arg to
drm_mm_init |
| RBP | 0000000000000003 |
|
| R08 | 0000000000000dc0 |
|
| R09 | 00000000ffffffff |
|
| R10 | ff11004093c27400 |
|
| R11 | 0000000000000100 |
|
| R12 | ff1100408b60f048 |
|
| R13 | 0000000000000000 |
|
| R14 | 0000000000000000 |
|
| R15 | 000000000000000b |
|
| CR2 | 00007f15ed7d9f55 |
Page-fault address (from a previous unrelated fault) |
| CR4 | 0000000000771ef0 |
|
| EFLAGS | 00010246 |
ZF=1 (zero flag set — result of the failed comparison) |
Key observation: RSI = 0 (start) and RDX = 0 (size)
confirm that drm_mm_init was called with
start=0, size=0. The assertion
start + size <= start → 0 + 0 ≤ 0 →
true triggers the BUG.
Code: 83 05 c2 c5 11 04 01 48 c7 83 f0 00 00 00 00 00 00 00 e8 a2 79 cd ff
48 83 05 b2 c5 11 04 01 5b c3 cc cc cc cc 0f 1f 40 00 90
<0f> 0b ← ud2 (BUG()) at drm_mm_init+0xc1
66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90
The ud2 at <0f> 0b is placed at the
end of drm_mm_init (offset 0xc1 in a 0xd0-byte function).
The compiler’s normal path returns via ret (c3) earlier;
the BUG path jumps here. This is the classic Linux BUG() layout.
drm_mm_init
— crash site (drm_mm.c:930)drivers/gpu/drm/drm_mm.c
at v6.18.16
920 /**
921 * drm_mm_init - initialize a drm-mm allocator
922 * @mm: the drm_mm structure to initialize
923 * @start: start of the range managed by @mm
924 * @size: end of the range managed by @mm
925 *
926 * Note that @mm must be cleared to 0 before calling this function.
927 */
928 void drm_mm_init(struct drm_mm *mm, u64 start, u64 size)
929 {
930 DRM_MM_BUG_ON(start + size <= start); // ← CRASH HERE (size == 0, start == 0 → 0 ≤ 0)
931
932 mm->color_adjust = NULL;
...
950 }ttm_range_man_init_nocheck — call site
(ttm_range_manager.c:198)drivers/gpu/drm/ttm/ttm_range_manager.c
at v6.18.16
180 int ttm_range_man_init_nocheck(struct ttm_device *bdev,
181 unsigned type, bool use_tt,
182 unsigned long p_size)
183 {
184 struct ttm_resource_manager *man;
185 struct ttm_range_manager *rman;
186
187 rman = kzalloc(sizeof(*rman), GFP_KERNEL);
188 if (!rman)
189 return -ENOMEM;
190
191 man = &rman->manager;
192 man->use_tt = use_tt;
193 man->func = &ttm_range_manager_func;
194
195 ttm_resource_manager_init(man, bdev, p_size);
196
197 drm_mm_init(&rman->mm, 0, p_size); // ← call here; p_size == 0 when GDS/GWS/OA absent
// ← RSI = 0 (start), RDX = 0 (p_size)
198 spin_lock_init(&rman->lock);
199
200 ttm_set_driver_manager(bdev, type, &rman->manager);
201 ttm_resource_manager_set_used(man, true);
202 return 0;
203 }amdgpu_ttm_init — call site
(amdgpu_ttm.c:2103)drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
at v6.18.16
2095 /* Initialize various on-chip memory pools */
2103 r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_GDS, adev->gds.gds_size); // ← call here; gds_size == 0
2104 if (r) {
2105 dev_err(adev->dev, "Failed initializing GDS heap.\n");
2106 return r;
2107 }
2108
2109 r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_GWS, adev->gds.gws_size); // ← same issue
2110 if (r) {
2111 dev_err(adev->dev, "Failed initializing gws heap.\n");
2112 return r;
2113 }
2114
2115 r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_OA, adev->gds.oa_size); // ← same issue
2116 if (r) {
2117 dev_err(adev->dev, "Failed initializing oa heap.\n");
2118 return r;
2119 }amdgpu_ttm_init_on_chip (lines 74–80) is a
one-liner:
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
at v6.18.16
74 static int amdgpu_ttm_init_on_chip(struct amdgpu_device *adev,
75 unsigned int type,
76 uint64_t size_in_page)
77 {
78 return ttm_range_man_init(&adev->mman.bdev, type,
79 false, size_in_page); // ← no guard for size_in_page == 0
80 }gfx_v11_0 (and all older GFX generations) populate
adev->gds.gds_size during
gfx_v11_0_set_gds_init() (line 7399–7409 of
gfx_v11_0.c):
7399 static void gfx_v11_0_set_gds_init(struct amdgpu_device *adev)
7400 {
7401 unsigned total_cu = ...;
7402 adev->gds.gds_size = 0x1000; // ← non-zero for GFX11 and earlier
7403 adev->gds.gds_compute_max_wave_id = total_cu * 32 - 1;
7404 adev->gds.gws_size = 64;
7405 adev->gds.oa_size = 16;
7406 }gfx_v12_0.c has no equivalent function
— GDS/GWS/OA were removed in RDNA4 (GFX 12).
adev->gds.{gds,gws,oa}_size are never set and remain
zero.
drm_mm_init() fires a DRM_MM_BUG_ON
assertion at drm_mm.c:930:
DRM_MM_BUG_ON(start + size <= start);The assertion is triggered because both start and
size are 0: 0 + 0 = 0 ≤ 0 →
condition is true → BUG fires.
This is confirmed by the register dump: RSI = 0 (start)
and RDX = 0 (size), which are the second and third
arguments to drm_mm_init(mm, start, size) on x86-64 System
V ABI.
The assertion is deliberately strict: the DRM memory-range manager
requires a non-empty range. Passing size=0 is a programming
error on the caller’s side.
Q1: Why was drm_mm_init called with
size = 0?
A1: ttm_range_man_init_nocheck passes its
p_size parameter directly to drm_mm_init as
the size argument (line 197–198 of
ttm_range_manager.c). p_size was 0 — no check
is made for this case.
Q2: Why was ttm_range_man_init_nocheck called
with p_size = 0?
A2: amdgpu_ttm_init_on_chip (line 78–79 of
amdgpu_ttm.c) calls ttm_range_man_init with
size_in_page as-is. It was passed
adev->gds.gds_size.
Q3: Why is
adev->gds.gds_size == 0?
A3: The AMD GFX 12 / RDNA4 architecture (used by the RX 9070 XT) does
not have GDS (Global Data Share), GWS (Global Wave Sync), or OA
(On-chip Accumulator) hardware. These resources were present on
GFX 11 and earlier. gfx_v11_0.c explicitly initialises them
in gfx_v11_0_set_gds_init(). gfx_v12_0.c has
no such function — adev->gds.{gds,gws,oa}_size are left
at their zero-initialised default values.
Root cause (Negative How): The
amdgpu_ttm_init() code that initialises GDS/GWS/OA TTM
memory pools at lines 2103–2119 of amdgpu_ttm.c has existed
since before GFX 12 was introduced. It correctly assumes non-zero sizes
because all previous architectures have these resources. When GFX 12
support was added to gfx_v12_0.c, no equivalent of
gfx_v11_0_set_gds_init() was added (correctly, since the
hardware no longer has GDS/GWS/OA), but the code in
amdgpu_ttm_init() was not updated to guard against zero
sizes.
The bug: amdgpu_ttm_init_on_chip does
not guard against size_in_page == 0. All callers that pass
GDS/GWS/OA sizes will pass 0 for RDNA4 and any future architecture that
lacks these resources.
Preferred fix location:
amdgpu_ttm_init_on_chip() in amdgpu_ttm.c.
Adding an early return for zero size is the most defensive approach and
protects all three call sites (GDS, GWS, OA) at once.
Proposed fix (diff form):
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -74,6 +74,9 @@ static int amdgpu_ttm_init_on_chip(struct amdgpu_device *adev,
unsigned int type,
uint64_t size_in_page)
{
+ if (!size_in_page)
+ return 0;
+
return ttm_range_man_init(&adev->mman.bdev, type,
false, size_in_page);
}An alternative (slightly more explicit) fix is to guard the three
call sites individually inside amdgpu_ttm_init():
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -2100,18 +2100,24 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
/* Initialize various on-chip memory pools */
- r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_GDS, adev->gds.gds_size);
- if (r) {
- dev_err(adev->dev, "Failed initializing GDS heap.\n");
- return r;
+ if (adev->gds.gds_size) {
+ r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_GDS, adev->gds.gds_size);
+ if (r) {
+ dev_err(adev->dev, "Failed initializing GDS heap.\n");
+ return r;
+ }
}
- r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_GWS, adev->gds.gws_size);
- if (r) {
- dev_err(adev->dev, "Failed initializing gws heap.\n");
- return r;
+ if (adev->gds.gws_size) {
+ r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_GWS, adev->gds.gws_size);
+ if (r) {
+ dev_err(adev->dev, "Failed initializing gws heap.\n");
+ return r;
+ }
}
- r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_OA, adev->gds.oa_size);
- if (r) {
- dev_err(adev->dev, "Failed initializing oa heap.\n");
- return r;
+ if (adev->gds.oa_size) {
+ r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_OA, adev->gds.oa_size);
+ if (r) {
+ dev_err(adev->dev, "Failed initializing oa heap.\n");
+ return r;
+ }
}The first (single-function) patch is preferred because it is smaller
and protects any future callers of amdgpu_ttm_init_on_chip
that might also receive a zero size.
The bug was introduced in two stages:
The GDS/GWS/OA TTM pool initialisation code was added to
amdgpu_ttm_init() in commit 473633540c2f
(Christian König, 2020-07-23). At the time, all supported architectures
had GDS and the code was correct.
RDNA4 / GFX 12 support was added to gfx_v12_0.c
without a gfx_v12_0_set_gds_init() equivalent — correctly,
since the hardware removed GDS/GWS/OA — but the
amdgpu_ttm_init() caller was not updated to handle zero
sizes. The first commit touching gmc_v12_0.c that brought
RDNA4 support pre-dates v6.18.16 (not individually identified within
budget).
The primary bug introduction is the absence of a zero-size
guard in amdgpu_ttm_init_on_chip() combined with
the legitimate non-initialisation of adev->gds.*_size
for GFX 12.
Bug introduction commit not identified with full precision within search budget; the RDNA4 bring-up series predates v6.18.16 and is the effective origin.
Summary: Loading the amdgpu module for
an AMD RX 9070 XT (RDNA4 / GFX 12) immediately crashes the kernel with a
BUG assertion in drm_mm_init(). The root cause is that
amdgpu_ttm_init() unconditionally tries to create TTM
memory pools for GDS, GWS, and OA on-chip resources, passing
size=0 to the DRM range allocator, which rejects a
zero-sized range as invalid.
RDNA4 removed GDS/GWS/OA from the hardware; gfx_v12_0.c
correctly leaves those size fields at zero. The missing piece is a
zero-size guard in amdgpu_ttm_init_on_chip().
Recommendation for the reporter:
Apply the one-liner fix to amdgpu_ttm.c:
static int amdgpu_ttm_init_on_chip(struct amdgpu_device *adev,
unsigned int type,
uint64_t size_in_page)
{
if (!size_in_page) // ← add this guard
return 0;
return ttm_range_man_init(&adev->mman.bdev, type,
false, size_in_page);
}No upstream fix was found in the git history through v6.19.13 / origin/master within the search budget. This appears to be an unresolved bug that should be reported to the amdgpu mailing list (amd-gfx@lists.freedesktop.org) with this analysis attached.
Confidence: High — the register dump (RSI=0, RDX=0)
directly confirms the zero-size call; the source code absence of GDS
init in gfx_v12_0.c is conclusive.