cyberspectre Posted May 27, 2020 Posted May 27, 2020 I've suddenly started having problems whenever I try to launch a VM. My server was rock solid for well over a year until now. When I launch a VM, shortly after the guest OS boots, the web UI shows 6 or 7 CPU threads (random ones, not necessarily the ones pinned to the VM) spiking and staying at redline. Seconds later, the web UI becomes unresponsive, the entire system locks up, and you've got no choice but to power-cycle. I've attached my diagnostic file and a couple of persistent log files that may be helpful. I have noticed these worrying messages: May 26 16:04:41 ANDRAS4 kernel: nvme nvme0: I/O 9 QID 7 timeout, aborting May 26 16:04:41 ANDRAS4 kernel: nvme nvme0: I/O 10 QID 7 timeout, aborting May 26 16:04:41 ANDRAS4 kernel: nvme nvme0: I/O 192 QID 10 timeout, aborting May 26 16:04:41 ANDRAS4 kernel: nvme nvme0: I/O 193 QID 10 timeout, aborting May 26 16:05:11 ANDRAS4 kernel: nvme nvme0: I/O 9 QID 7 timeout, reset controller May 26 16:07:15 ANDRAS4 kernel: nvme nvme0: Device not ready; aborting reset May 26 16:07:15 ANDRAS4 kernel: print_req_error: I/O error, dev nvme0n1, sector 558795512 May 26 16:07:15 ANDRAS4 kernel: print_req_error: I/O error, dev nvme0n1, sector 588290920 May 26 16:07:15 ANDRAS4 kernel: print_req_error: I/O error, dev nvme0n1, sector 586574048 May 26 16:07:15 ANDRAS4 kernel: nvme nvme0: Abort status: 0x7 May 26 16:07:15 ANDRAS4 kernel: print_req_error: I/O error, dev nvme0n1, sector 573818160 May 26 16:07:15 ANDRAS4 kernel: nvme nvme0: Abort status: 0x7 May 26 16:07:15 ANDRAS4 kernel: nvme nvme0: Abort status: 0x7 May 26 16:07:15 ANDRAS4 kernel: nvme nvme0: Abort status: 0x7 Do these messages indicate my cache SSD is failing? The VM domains are on the cache, set to "prefer." Thanks for the assistance! andras4-diagnostics-20200526-2048.zip syslog-1589603478 syslog-1590534581 syslog-1590535014 syslog-1590541990 syslog-1590542150 syslog-1590546809 Quote
JorgeB Posted May 27, 2020 Posted May 27, 2020 This can sometimes help with those NVMe errors: Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append" nvme_core.default_ps_max_latency_us=0 Reboot and see if it makes a difference. Quote
cyberspectre Posted May 28, 2020 Author Posted May 28, 2020 19 hours ago, johnnie.black said: This can sometimes help with those NVMe errors: Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append" nvme_core.default_ps_max_latency_us=0 Reboot and see if it makes a difference. Thanks, I've done so. Seems to be working normally now. I'll report back if I have more issues. Quote
cyberspectre Posted May 28, 2020 Author Posted May 28, 2020 Nope, it's still happening. Seems to be that whenever it seems stable with a VM running, all I need to do is start a few docker containers and that will cause it to fail. Basically, heavier I/O causes it. Quote
JorgeB Posted May 28, 2020 Posted May 28, 2020 Look for a bios update or try a different NVMe device. Quote
cyberspectre Posted June 5, 2020 Author Posted June 5, 2020 On 5/28/2020 at 1:14 AM, johnnie.black said: Look for a bios update or try a different NVMe device. Thanks johnnie.black. Updating the firmware seemed promising at first, but ultimately made no difference. Since it's clear to me now that this is an issue with the disk itself, I'm going to make a new thread in hardware to get some more opinions. Quote
cyberspectre Posted June 14, 2020 Author Posted June 14, 2020 For posterity, I'd like to report that I solved the problem. Using the smartctl / nvme, I discovered that even though the disk's main temperature was within the acceptable range, one of the disk's secondary temperature sensors (labeled Temperature Sensor 5) was reading 60-64c at idle and 70c or higher under load. This is most likely the temperature of the controller. Apparently, most nvme controllers begin throttling at 70c, so the I/O errors make sense. I moved the disk from the M.2 slot to a PCI-e adapter and installed a metal heatsink. The controller now runs 20 degrees cooler on average, and I haven't experienced any more issues. When it's right behind a graphics card, an M.2 slot is apparently not an ideal thermal environment for an SSD. Quote
JorgeB Posted June 14, 2020 Posted June 14, 2020 Thanks for reporting back, if you don't mind I'm going to tag this solved. Quote
cyberspectre Posted October 2, 2020 Author Posted October 2, 2020 Unfortunately, the issue has resurfaced. It is certainly connected to the temperature of the controller. Moving the SSD away from the graphics cards and installing a heat sink did help, but ultimately did not eliminate the problem. Certain activities can cause the temperature of the controller to spike suddenly, and when it does, it's lights out. To anyone who reads this in the future, do not buy a Crucial P1 NVME SSD for UnRaid, or for a gaming PC for that matter. It cannot tolerate even moderate heat. 1 Quote
debit lagos Posted October 16, 2020 Posted October 16, 2020 On 10/2/2020 at 1:13 AM, cyberspectre said: Unfortunately, the issue has resurfaced. It is certainly connected to the temperature of the controller. Moving the SSD away from the graphics cards and installing a heat sink did help, but ultimately did not eliminate the problem. Certain activities can cause the temperature of the controller to spike suddenly, and when it does, it's lights out. To anyone who reads this in the future, do not buy a Crucial P1 NVME SSD for UnRaid, or for a gaming PC for that matter. It cannot tolerate even moderate heat. I honestly wish I had found this about 3 days ago when I moved 900GB worth of content to my P5 drive. Will be replacing it with a 970 Plus. Thank you for posting this. Quote
R3nFoly Posted November 28, 2020 Posted November 28, 2020 I'm facing same issues with the Pioneer APS-SE20G-2T_SJ08C21153WL - 2 TB (nvme0n1) This is driving me nuts. Probably going to replace the drive, thank god still under warranty. What nvme you're using guys? Quote
voltbit Posted November 28, 2020 Posted November 28, 2020 I have the exact same problem with a Kingston A2000, 500GB, NVMe, M.2 For my usecase I create 10-12 containers in quick succession and the system is guaranteed to be crashing. If anyone can recommend a drive that for sure does not have this issue I would be grateful, I will return my drive under warranty and probably just use the HDD until I find a solution. Quote
onufry Posted January 18, 2021 Posted January 18, 2021 hey, same problem. initially i had Sabrent 1TB Rocket drives in a pool, issue happened several times, so replaced Sabrents with Samsung 980 PRO 1TB, everything worked fine about 2 weeks and today same issue again, "lost" 1 of drives. i do have a line nvme_core.default_ps_max_latency_us=0. i see Cache "Not Installed" after reboots. any suggestions? Quote
xabi Posted January 20, 2021 Posted January 20, 2021 (edited) On 11/28/2020 at 8:34 PM, voltbit said: I have the exact same problem with a Kingston A2000, 500GB, NVMe, M.2 For my usecase I create 10-12 containers in quick succession and the system is guaranteed to be crashing. If anyone can recommend a drive that for sure does not have this issue I would be grateful, I will return my drive under warranty and probably just use the HDD until I find a solution. Same disk, same issue. Did you find a solution? Thanks in advance. Edited January 20, 2021 by xabi Quote
-jim Posted February 20, 2021 Posted February 20, 2021 Apparently I am having the same issue with a Crucial P5 2TB PCIe M.2 2280SS SSD. I saw some blurb on the device like: 20-02-2021 15:26 Unraid device nvme0n1 messageNotice [TOWER] - device nvme0n1 returned to normal temperatureCT2000P5SSD8_20362A61D26D (nvme0n1) normal and Feb 20 15:25:30 Tower kernel: nvme nvme0: I/O 41 QID 10 timeout, aborting Feb 20 15:25:47 Tower kernel: nvme nvme0: I/O 1 QID 0 timeout, reset controller Feb 20 15:26:00 Tower kernel: nvme nvme0: I/O 41 QID 10 timeout, reset controller Feb 20 15:26:54 Tower kernel: nvme nvme0: Device not ready; aborting reset Feb 20 15:26:54 Tower kernel: print_req_error: I/O error, dev nvme0n1, sector 0 Feb 20 15:26:54 Tower kernel: nvme nvme0: Abort status: 0x7 Feb 20 15:27:00 Tower kernel: nvme nvme0: Device not ready; aborting reset Feb 20 15:27:00 Tower kernel: nvme nvme0: Removing after probe failure status: -19 Feb 20 15:27:05 Tower emhttpd: error: ckmbr, 2030: Input/output error (5): read: /dev/nvme0n1 Feb 20 15:27:05 Tower emhttpd: import 30 cache device: (nvme0n1) CT2000P5SSD8_20362A61D26D Feb 20 15:27:05 Tower emhttpd: import flash device: sdb Feb 20 15:27:05 Tower kernel: nvme nvme0: Device not ready; aborting reset Feb 20 15:27:05 Tower kernel: Buffer I/O error on dev nvme0n1, logical block 0, async page read Feb 20 15:27:05 Tower kernel: nvme nvme0: failed to set APST feature (-19) So which ones will work? Thanks -jim Quote
cyberspectre Posted February 24, 2021 Author Posted February 24, 2021 Months after replacing the Crucial drive with a Samsung 970 Evo, I'm pleased to say this issue has not happened since. Not even once. The Samsung doesn't miss a beat. 1 Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.