[SOLVED] AMDGPU error leads to crash

[SOLVED] AMDGPU error leads to crash

Postby Homebody » Jul 28th, '21, 03:07

Greetings,

I recently installed Mageia 8 (x86-64) using the Classic DVD installer, wiping my /root partition and retaining my /home from Mageia 7.1. Since then, my system (built around an AMD FX-8320e and ASRock 990FX Extreme4 motherboard) has been crashing sporadically, at least once per day. The crashes occur with both the 5.10.16 kernel from the DVD and the current 5.10.48. They have happened under both Cinnamon and IceWM. My routine is 1) logging in, 2) enabling Internet connection, 3) accessing VPN, 4) opening Firefox or LibreOffice or even Terminal, and then the screen aside from the mouse cursor becomes unresponsive. No clicks register and pressing Caps Lock will not trigger the LED on/off. The only recourse for me has been to press Alt+SysRQ and R E I S U B to reboot the system.

After using the command
Code: Select all
journalctl --since=today --priority=3
, I saw these errors immediately prior to my reboot (with Cinnamon):
Code: Select all
Jul 23 10:42:35 localhost kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=14148, emitted seq=14150
Jul 23 10:42:35 localhost kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1226 thread Xorg:cs0 pid 1468
Jul 23 10:42:36 localhost kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=1138, emitted seq=1140
Jul 23 10:42:36 localhost kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Jul 23 10:43:13 localhost cinnamon-session[1596]: CRITICAL: t+389.69049s: We failed, but the fail whale is dead. Sorry....
Jul 23 10:43:14 localhost systemd[1]: Failed to start Flush Journal to Persistent Storage.
Jul 23 10:43:14 localhost systemd-xdg-autostart-generator[4175]: Failed to create unit file /run/user/500/systemd/generator.late/app-caribou\x2dautostart-autostart.service: File exists
Jul 23 10:43:14 localhost systemd-xdg-autostart-generator[4175]: Failed to create unit file /run/user/500/systemd/generator.late/app-mageia\x2dmgaonline-autostart.service: File exists
Jul 23 10:43:14 localhost systemd-xdg-autostart-generator[4175]: Failed to create unit file /run/user/500/systemd/generator.late/app-light\x2dlocker-autostart.service: File exists
Jul 23 10:43:14 localhost systemd-xdg-autostart-generator[4175]: Failed to create unit file /run/user/500/systemd/generator.late/app-org.gnome.Evolution\x2dalarm\x2dnotify-autostart.service: File exists
Jul 23 10:43:14 localhost systemd-xdg-autostart-generator[4175]: Failed to create unit file /run/user/500/systemd/generator.late/app-tracker\x2dminer\x2drss\x2d3-autostart.service: File exists
Jul 23 10:43:14 localhost systemd-xdg-autostart-generator[4175]: Failed to create unit file /run/user/500/systemd/generator.late/app-nm\x2dapplet-autostart.service: File exists
Jul 23 10:43:14 localhost systemd-xdg-autostart-generator[4175]: Failed to create unit file /run/user/500/systemd/generator.late/app-net_applet-autostart.service: File exists
Jul 23 10:43:14 localhost systemd-xdg-autostart-generator[4175]: Failed to create unit file /run/user/500/systemd/generator.late/app-user\x2ddirs\x2dupdate\x2dgtk-autostart.service: File exists
Jul 23 10:43:14 localhost systemd-xdg-autostart-generator[4175]: Failed to create unit file /run/user/500/systemd/generator.late/app-mageiawelcome-autostart.service: File exists
Jul 23 10:43:14 localhost systemd-xdg-autostart-generator[4175]: Failed to create unit file /run/user/500/systemd/generator.late/app-polkit\x2dmate\x2dauthentication\x2dagent\x2d1-autostart.service: File exists
Jul 23 10:43:14 localhost systemd-xdg-autostart-generator[4175]: Failed to create unit file /run/user/500/systemd/generator.late/app-tracker\x2dminer\x2dfs\x2d3-autostart.service: File exists
-- Reboot --


With IceWM, the entries end after drm:amdgpu_job_timedout fizzles out.

My video card is an AMD Radeon R7 250E. Here is the
Code: Select all
lspci -vvk
output:
Code: Select all
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde PRO [Radeon HD 7750/8740 / R7 250E] (prog-if 00 [VGA controller])
   Subsystem: XFX Pine Group Inc. Device 7251
   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
   Latency: 0, Cache Line Size: 64 bytes
   Interrupt: pin A routed to IRQ 39
   NUMA node: 0
   IOMMU group: 18
   Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
   Region 2: Memory at fea00000 (64-bit, non-prefetchable) [size=256K]
   Region 4: I/O ports at e000 [size=256]
   Expansion ROM at 000c0000 [disabled] [size=128K]
   Capabilities: [48] Vendor Specific Information: Len=08 <?>
   Capabilities: [50] Power Management version 3
      Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-)
      Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
   Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
      DevCap:   MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
         ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
      DevCtl:   CorrErr- NonFatalErr- FatalErr- UnsupReq-
         RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
         MaxPayload 128 bytes, MaxReadReq 512 bytes
      DevSta:   CorrErr+ NonFatalErr+ FatalErr- UnsupReq+ AuxPwr- TransPend-
      LnkCap:   Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
      LnkCtl:   ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
      LnkSta:   Speed 5GT/s (downgraded), Width x4 (downgraded)
         TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
      DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR-
          10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
          EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
          FRS-
          AtomicOpsCap: 32bit- 64bit- 128bitCAS-
      DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
          AtomicOpsCtl: ReqEn-
      LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
      LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
          Compliance De-emphasis: -6dB
      LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-
          EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
          Retimer- 2Retimers- CrosslinkRes: unsupported
   Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
      Address: 00000000fee00000  Data: 0000
   Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
   Capabilities: [150 v2] Advanced Error Reporting
      UESta:   DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
      UEMsk:   DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
      UESvrt:   DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
      CESta:   RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
      CEMsk:   RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
      AERCap:   First Error Pointer: 14, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
         MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
      HeaderLog: 40001001 00000001 000b8000 00000000
   Capabilities: [200 v1] Physical Resizable BAR
      BAR 0: current size: 256MB, supported: 256MB 512MB 1GB
   Capabilities: [270 v1] Secondary PCI Express
      LnkCtl3: LnkEquIntrruptEn- PerformEqu-
      LaneErrStat: 0
   Kernel driver in use: amdgpu
   Kernel modules: radeon, amdgpu


I think Mageia 7.1 defaulted to the radeon driver for this card, but Mageia 8 now defaults to amdgpu. I did not have issues with this card running Mageia 7.1's 5.10.14 kernel.

Does anyone think I have a hardware or software problem? What should my next steps be?

Thank you.
Last edited by Homebody on Aug 9th, '21, 16:11, edited 1 time in total.
Homebody
 
Posts: 3
Joined: Sep 21st, '17, 02:33

Re: AMDGPU error leads to crash

Postby doktor5000 » Jul 28th, '21, 18:10

Seems to be either an issue with the amdgpu driver or with mesa. You may need to look for some more logs around the part where amdgpu driver detects the GPU hangs.

Although for amdgpu there are pretty different issues, you may have to search around for your issue.
Have a look at e.g. https://bugs.mageia.org/show_bug.cgi?id=25882 or https://gitlab.freedesktop.org/drm/amd/-/issues/953

But could also be as simple as disabling automatic power management as in
https://forum.manjaro.org/t/graphics-gl ... ut/55979/4 or also mentioned in the above linked Mageia bug report.
Cauldron is not for the faint of heart!
Caution: Hot, bubbling magic inside. May explode or cook your kittens!
----
Disclaimer: Beware of allergic reactions in answer to unconstructive complaint-type posts
User avatar
doktor5000
 
Posts: 17630
Joined: Jun 4th, '11, 10:10
Location: Leipzig, Germany

Re: AMDGPU error leads to crash

Postby Homebody » Jul 29th, '21, 16:14

Thank you for the guidance, doktor5000. I took your advice to disable automatic power management following the wiki's instructions https://wiki.mageia.org/en/How_to_set_up_kernel_options and typed the following options at boot:
Code: Select all
amdgpu.aspm=0 amdgpu.bapm=0 amdgpu.runpm=0
and hit Ctrl+X. I have spent about twenty minutes so far without a crash, so [fingers-crossed] that this workaround prevents future headaches.
Homebody
 
Posts: 3
Joined: Sep 21st, '17, 02:33

Re: AMDGPU error leads to crash

Postby Homebody » Jul 31st, '21, 18:07

Update (July 31): Another crash in IceWM (even with those power management options disabled) sent me back to the drawing board. While looking through https://bugs.mageia.org/show_bug.cgi?id=28154 for a different amdgpu bug, comment 17 indicated that one of the boot options (VGA=791) should be deleted as it can cause conflicts with DRM. In my own boot options, I have
Code: Select all
VGA=794
, so I deleted this line without toggling any power management off. Will update again if this solves the issue.

Update (August 9): After more than a week of normal use, I feel comfortable saying that the kernel option
Code: Select all
VGA=794
was the culprit; No crashes occurred after removing it at boot. Thanks to doktor5000 for pointing me in the right direction!
Homebody
 
Posts: 3
Joined: Sep 21st, '17, 02:33


Return to Video

Who is online

Users browsing this forum: No registered users and 1 guest

cron