Linux kernel crash report

Source: kernel Bugzilla bug #221376 — “AMD RADEON RX 9070 XT - modprobe amdgpu is fail.”

Key elements

Field Value Implication
BUG_URL bugzilla.kernel.org #221376
UNAME 6.18.22-x86_64 Custom kernel compiled by the reporter (build string: #1 SMP PREEMPT_DYNAMIC Fri Apr 17 01:05:45 +07 2026)
DISTRO (none — custom kernel) No distro debug packages available; vmlinux not obtainable
HARDWARE GIGABYTE MD72-HB1-00 (dual-socket Intel Xeon Silver 4310 @ 2.10 GHz, 48 threads)
BIOS F40 12/09/2025
PROCESS kworker/12:1 Kernel worker thread on CPU 12
WORKQUEUE events work_for_cpu_fn Device probe via local_pci_probe run on a specific CPU
TAINT (Not tainted) Clean kernel; no proprietary or out-of-tree code
CRASH_TYPE BUG/BUG_ON DRM_MM_BUG_ON(start + size <= start) in drm_mm_init()
CRASH_SITE drivers/gpu/drm/drm_mm.c:930 Assertion inside the DRM memory-range allocator initialiser
SOURCEDIR /sdb1/arjan/git/oops-skill/oops-workdir/linux (tag v6.18.16) Nearest stable tag; 6.18.22 not present — analysis is approximate
VMLINUX (not available — custom kernel) addr2line mapping not possible; source used directly

Kernel modules

Module Flags Backtrace Location Flag Implication
amdgpu + Y Being loaded at time of crash (module_init path likely involved)
amdxcp
drm_ttm_helper
ttm Y
drm_exec
drm_panel_backlight_quirks
gpu_sched
drm_suballoc_helper
drm_buddy
drm_display_helper
cec
rc_core
igb
i2c_algo_bit

Backtrace

Address Function Offset Size Context Module Source location
(RIP) drm_mm_init 0xc1 0xd0 Task (built-in) drm_mm.c:930
ttm_range_man_init_nocheck 0x9d 0x180 Task ttm (build-ID: e3a55dbbe0be) ttm_range_manager.c:198
amdgpu_ttm_init.cold 0x45e 0x5cc Task amdgpu (build-ID: 48030e986eac) amdgpu_ttm.c:2103
amdgpu_bo_init.cold 0x5e 0x77 Task amdgpu amdgpu_object.c:1088
gmc_v12_0_sw_init 0x470 0x6f0 Task amdgpu gmc_v12_0.c:847
amdgpu_device_ip_init 0x8f 0xb43 Task amdgpu
amdgpu_device_init.cold 0x1495 0x1abe Task amdgpu
amdgpu_driver_load_kms 0x1a 0x80 Task amdgpu
amdgpu_pci_probe 0x28e 0x760 Task amdgpu
local_pci_probe 0x51 0xc0 Task (built-in)
work_for_cpu_fn 0x1d 0x30 Task (built-in)
process_scheduled_works 0x2bc 0x680 Task (built-in)
worker_thread 0x1a6 0x4a0 Task (built-in)
kthread 0x1a4 0x3a0 Task (built-in)
ret_from_fork 0x1f8 0x3b0 Task (built-in)
ret_from_fork_asm 0x1a 0x30 Task (built-in)

CPU registers

Register Value Note
RIP drm_mm_init+0xc1/0xd0 Points to ud2 (BUG) instruction at crash
RSP ffa000000d723b60 Valid kernel stack address
RAX 0000000000000000 Zero
RBX ff11004093c27400 Pointer to struct drm_mm (rman->mm)
RCX 000000000000001c
RDX 0000000000000000 size arg to drm_mm_init = 0 — this is the bad value
RSI 0000000000000000 start arg to drm_mm_init = 0
RDI ff11004093c27480 mm arg to drm_mm_init
RBP 0000000000000003
R08 0000000000000dc0
R09 00000000ffffffff
R10 ff11004093c27400
R11 0000000000000100
R12 ff1100408b60f048
R13 0000000000000000
R14 0000000000000000
R15 000000000000000b
CR2 00007f15ed7d9f55 Page-fault address (from a previous unrelated fault)
CR4 0000000000771ef0
EFLAGS 00010246 ZF=1 (zero flag set — result of the failed comparison)

Key observation: RSI = 0 (start) and RDX = 0 (size) confirm that drm_mm_init was called with start=0, size=0. The assertion start + size <= start0 + 0 ≤ 0true triggers the BUG.

Code bytes

Code: 83 05 c2 c5 11 04 01 48 c7 83 f0 00 00 00 00 00 00 00 e8 a2 79 cd ff
      48 83 05 b2 c5 11 04 01 5b c3 cc cc cc cc 0f 1f 40 00 90
      <0f> 0b   ← ud2 (BUG()) at drm_mm_init+0xc1
      66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90

The ud2 at <0f> 0b is placed at the end of drm_mm_init (offset 0xc1 in a 0xd0-byte function). The compiler’s normal path returns via ret (c3) earlier; the BUG path jumps here. This is the classic Linux BUG() layout.

Backtrace source code

1. drm_mm_init — crash site (drm_mm.c:930)

drivers/gpu/drm/drm_mm.c at v6.18.16

920  /**
921   * drm_mm_init - initialize a drm-mm allocator
922   * @mm: the drm_mm structure to initialize
923   * @start: start of the range managed by @mm
924   * @size: end of the range managed by @mm
925   *
926   * Note that @mm must be cleared to 0 before calling this function.
927   */
928  void drm_mm_init(struct drm_mm *mm, u64 start, u64 size)
929  {
930      DRM_MM_BUG_ON(start + size <= start);   // ← CRASH HERE (size == 0, start == 0 → 0 ≤ 0)
931
932      mm->color_adjust = NULL;
     ...
950  }

2. ttm_range_man_init_nocheck — call site (ttm_range_manager.c:198)

drivers/gpu/drm/ttm/ttm_range_manager.c at v6.18.16

180  int ttm_range_man_init_nocheck(struct ttm_device *bdev,
181                         unsigned type, bool use_tt,
182                         unsigned long p_size)
183  {
184      struct ttm_resource_manager *man;
185      struct ttm_range_manager *rman;
186
187      rman = kzalloc(sizeof(*rman), GFP_KERNEL);
188      if (!rman)
189          return -ENOMEM;
190
191      man = &rman->manager;
192      man->use_tt = use_tt;
193      man->func = &ttm_range_manager_func;
194
195      ttm_resource_manager_init(man, bdev, p_size);
196
197      drm_mm_init(&rman->mm, 0, p_size);   // ← call here; p_size == 0 when GDS/GWS/OA absent
     // ← RSI = 0 (start), RDX = 0 (p_size)
198      spin_lock_init(&rman->lock);
199
200      ttm_set_driver_manager(bdev, type, &rman->manager);
201      ttm_resource_manager_set_used(man, true);
202      return 0;
203  }

3. amdgpu_ttm_init — call site (amdgpu_ttm.c:2103)

drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c at v6.18.16

2095      /* Initialize various on-chip memory pools */
2103      r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_GDS, adev->gds.gds_size);  // ← call here; gds_size == 0
2104      if (r) {
2105          dev_err(adev->dev, "Failed initializing GDS heap.\n");
2106          return r;
2107      }
2108
2109      r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_GWS, adev->gds.gws_size);  // ← same issue
2110      if (r) {
2111          dev_err(adev->dev, "Failed initializing gws heap.\n");
2112          return r;
2113      }
2114
2115      r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_OA, adev->gds.oa_size);   // ← same issue
2116      if (r) {
2117          dev_err(adev->dev, "Failed initializing oa heap.\n");
2118          return r;
2119      }

amdgpu_ttm_init_on_chip (lines 74–80) is a one-liner:

drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c at v6.18.16

74  static int amdgpu_ttm_init_on_chip(struct amdgpu_device *adev,
75                                      unsigned int type,
76                                      uint64_t size_in_page)
77  {
78      return ttm_range_man_init(&adev->mman.bdev, type,
79                                false, size_in_page);  // ← no guard for size_in_page == 0
80  }

gfx_v11_0 (and all older GFX generations) populate adev->gds.gds_size during gfx_v11_0_set_gds_init() (line 7399–7409 of gfx_v11_0.c):

7399  static void gfx_v11_0_set_gds_init(struct amdgpu_device *adev)
7400  {
7401      unsigned total_cu = ...;
7402      adev->gds.gds_size = 0x1000;    // ← non-zero for GFX11 and earlier
7403      adev->gds.gds_compute_max_wave_id = total_cu * 32 - 1;
7404      adev->gds.gws_size = 64;
7405      adev->gds.oa_size = 16;
7406  }

gfx_v12_0.c has no equivalent function — GDS/GWS/OA were removed in RDNA4 (GFX 12). adev->gds.{gds,gws,oa}_size are never set and remain zero.

What-how-where analysis

What

drm_mm_init() fires a DRM_MM_BUG_ON assertion at drm_mm.c:930:

DRM_MM_BUG_ON(start + size <= start);

The assertion is triggered because both start and size are 0: 0 + 0 = 0 ≤ 0 → condition is true → BUG fires.

This is confirmed by the register dump: RSI = 0 (start) and RDX = 0 (size), which are the second and third arguments to drm_mm_init(mm, start, size) on x86-64 System V ABI.

The assertion is deliberately strict: the DRM memory-range manager requires a non-empty range. Passing size=0 is a programming error on the caller’s side.

How

Q1: Why was drm_mm_init called with size = 0?

A1: ttm_range_man_init_nocheck passes its p_size parameter directly to drm_mm_init as the size argument (line 197–198 of ttm_range_manager.c). p_size was 0 — no check is made for this case.

Q2: Why was ttm_range_man_init_nocheck called with p_size = 0?

A2: amdgpu_ttm_init_on_chip (line 78–79 of amdgpu_ttm.c) calls ttm_range_man_init with size_in_page as-is. It was passed adev->gds.gds_size.

Q3: Why is adev->gds.gds_size == 0?

A3: The AMD GFX 12 / RDNA4 architecture (used by the RX 9070 XT) does not have GDS (Global Data Share), GWS (Global Wave Sync), or OA (On-chip Accumulator) hardware. These resources were present on GFX 11 and earlier. gfx_v11_0.c explicitly initialises them in gfx_v11_0_set_gds_init(). gfx_v12_0.c has no such function — adev->gds.{gds,gws,oa}_size are left at their zero-initialised default values.

Root cause (Negative How): The amdgpu_ttm_init() code that initialises GDS/GWS/OA TTM memory pools at lines 2103–2119 of amdgpu_ttm.c has existed since before GFX 12 was introduced. It correctly assumes non-zero sizes because all previous architectures have these resources. When GFX 12 support was added to gfx_v12_0.c, no equivalent of gfx_v11_0_set_gds_init() was added (correctly, since the hardware no longer has GDS/GWS/OA), but the code in amdgpu_ttm_init() was not updated to guard against zero sizes.

Where

The bug: amdgpu_ttm_init_on_chip does not guard against size_in_page == 0. All callers that pass GDS/GWS/OA sizes will pass 0 for RDNA4 and any future architecture that lacks these resources.

Preferred fix location: amdgpu_ttm_init_on_chip() in amdgpu_ttm.c. Adding an early return for zero size is the most defensive approach and protects all three call sites (GDS, GWS, OA) at once.

Proposed fix (diff form):

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -74,6 +74,9 @@ static int amdgpu_ttm_init_on_chip(struct amdgpu_device *adev,
                        unsigned int type,
                        uint64_t size_in_page)
 {
+   if (!size_in_page)
+       return 0;
+
    return ttm_range_man_init(&adev->mman.bdev, type,
                  false, size_in_page);
 }

An alternative (slightly more explicit) fix is to guard the three call sites individually inside amdgpu_ttm_init():

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -2100,18 +2100,24 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
 
    /* Initialize various on-chip memory pools */
-   r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_GDS, adev->gds.gds_size);
-   if (r) {
-       dev_err(adev->dev, "Failed initializing GDS heap.\n");
-       return r;
+   if (adev->gds.gds_size) {
+       r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_GDS, adev->gds.gds_size);
+       if (r) {
+           dev_err(adev->dev, "Failed initializing GDS heap.\n");
+           return r;
+       }
    }
 
-   r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_GWS, adev->gds.gws_size);
-   if (r) {
-       dev_err(adev->dev, "Failed initializing gws heap.\n");
-       return r;
+   if (adev->gds.gws_size) {
+       r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_GWS, adev->gds.gws_size);
+       if (r) {
+           dev_err(adev->dev, "Failed initializing gws heap.\n");
+           return r;
+       }
    }
 
-   r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_OA, adev->gds.oa_size);
-   if (r) {
-       dev_err(adev->dev, "Failed initializing oa heap.\n");
-       return r;
+   if (adev->gds.oa_size) {
+       r = amdgpu_ttm_init_on_chip(adev, AMDGPU_PL_OA, adev->gds.oa_size);
+       if (r) {
+           dev_err(adev->dev, "Failed initializing oa heap.\n");
+           return r;
+       }
    }

The first (single-function) patch is preferred because it is smaller and protects any future callers of amdgpu_ttm_init_on_chip that might also receive a zero size.

Bug introduction

The bug was introduced in two stages:

  1. The GDS/GWS/OA TTM pool initialisation code was added to amdgpu_ttm_init() in commit 473633540c2f (Christian König, 2020-07-23). At the time, all supported architectures had GDS and the code was correct.

  2. RDNA4 / GFX 12 support was added to gfx_v12_0.c without a gfx_v12_0_set_gds_init() equivalent — correctly, since the hardware removed GDS/GWS/OA — but the amdgpu_ttm_init() caller was not updated to handle zero sizes. The first commit touching gmc_v12_0.c that brought RDNA4 support pre-dates v6.18.16 (not individually identified within budget).

The primary bug introduction is the absence of a zero-size guard in amdgpu_ttm_init_on_chip() combined with the legitimate non-initialisation of adev->gds.*_size for GFX 12.

Bug introduction commit not identified with full precision within search budget; the RDNA4 bring-up series predates v6.18.16 and is the effective origin.

Analysis, conclusions and recommendations

Summary: Loading the amdgpu module for an AMD RX 9070 XT (RDNA4 / GFX 12) immediately crashes the kernel with a BUG assertion in drm_mm_init(). The root cause is that amdgpu_ttm_init() unconditionally tries to create TTM memory pools for GDS, GWS, and OA on-chip resources, passing size=0 to the DRM range allocator, which rejects a zero-sized range as invalid.

RDNA4 removed GDS/GWS/OA from the hardware; gfx_v12_0.c correctly leaves those size fields at zero. The missing piece is a zero-size guard in amdgpu_ttm_init_on_chip().

Recommendation for the reporter:

Apply the one-liner fix to amdgpu_ttm.c:

static int amdgpu_ttm_init_on_chip(struct amdgpu_device *adev,
                                    unsigned int type,
                                    uint64_t size_in_page)
{
    if (!size_in_page)        // ← add this guard
        return 0;
    return ttm_range_man_init(&adev->mman.bdev, type,
                              false, size_in_page);
}

No upstream fix was found in the git history through v6.19.13 / origin/master within the search budget. This appears to be an unresolved bug that should be reported to the amdgpu mailing list (amd-gfx@lists.freedesktop.org) with this analysis attached.

Confidence: High — the register dump (RSI=0, RDX=0) directly confirms the zero-size call; the source code absence of GDS init in gfx_v12_0.c is conclusive.