Tesla3327 Posted June 8, 2022 Share Posted June 8, 2022 Hello, I posted this on the ASRock forum too because I believe the problem is hardware based, but would appreciate any additional insight from people if possible. I use an ASRock Z690M Phantom Gaming 4 as the MB in my unRAID system. Full specs at the end. TL;DR is that drives plugged into SATA3_0, SATA3_1. & M.2_2 have started to spontaneously disappear and are not recognized by the MB again until the system is completely shut down and powered back up. Important to note that I have a cache pool consisting of two SSDs in RAID1. 4 days ago I received a notification that one of the drives in my cache pool has gone missing. It was a Western Digital PCIE gen 3 nvme drive installed in the 2nd m.2 slot of my MB. I did not have anything installed on the SATA port that is shared with m.2 slot (SATA3_0). I assumed that the drive had just failed and pulled it from the system and went to order a replacement. I ended up deciding to replace both cache drives with two Samsung 870 EVO SATA SSDs. I installed these drives into SATA3_0 & SATA3_1 slots. I still did not have anything in the second M.2 slot. The next day I get another notification that the drive has gone missing. Through some troubleshooting I have determined that the drives will disappear, and will not re-appear until the system is shut down. Rebooting the system via the unRAID web interface does not cause the drives to re-appear. To reiterate that last part. If a drive has disappeared, and I hit "Reboot" the drives do not re-appear; however, if I shutdown the system completely, and then walk over and turn it back on, the drives re-appear (for a little while) Also important to note: I have two other SSDs plugged into SATA3_2 & SATA3_3 that have not disappeared at all. After the brand new drives disappeared I did try upgrading unRAID from 6.10.0 > 6.10.2. No help. Also my MB was running BIOS V3.01, but I did upgrade it to V7.02 during troubleshooting and it made no difference. I have a feeling there may be an issue with the MB chipset, but I'm not sure why it would effect M.2_2, SATA3_0, & SATA3_1 only. System Specs: CPU: Intel Core i7-12700K 3.6 GHz 12-Core Processor Motherboard: ASRock Z690M Phantom Gaming 4 Micro ATX LGA1700 Motherboard Memory: Corsair Vengeance LPX 64GB (4 x 16 GB) DDR4-3600 CL18 Memory x2 ADATA SU630 480 GB 2.5" SSD in RAID1 used for docker containers and VM Drives - plugged into SATA3_2 & SATA3_3 x1 WD Black SN750 1 TB NVME SSD - 1 of the NVMEs used in the original Cache Pool - Installed in M.2_1 x1 WD Green 960 GB NVME SSD - The other NVME that was the first drive to go missing - Previously installed in M.2_2, now uninstalled x2 Samsung 870 EVO 1 TB 2.5" SSD - in RAID1 as the new cache pool. Plugged into SATA3_0 & SATA3_1 x8 Various 3.5" HDDs for the array - No issues with any of these drives - Installed via LSI Card flashed into IT mode tomservo-diagnostics-20220608-1410.zip Quote Link to comment
JorgeB Posted June 9, 2022 Share Posted June 9, 2022 10 hours ago, Tesla3327 said: I have two other SSDs plugged into SATA3_2 & SATA3_3 that have not disappeared at all. What if you swap ports between those different pools, it would confirm the problem is with the ports or not. 1 Quote Link to comment
Tesla3327 Posted June 9, 2022 Author Share Posted June 9, 2022 (edited) 6 hours ago, JorgeB said: What if you swap ports between those different pools, it would confirm the problem is with the ports or not. Thank you for the suggestion. The Cache drives sat idle overnight without any issues, so before trying your suggestion I copied some files to and from those drives to see if that triggered the disks going missing. Sure enough that did it. So it seems to happen only when the drives experience load. So I swapped the ports used for the Systemdrives for the ones used for cache drives as you suggested and tried to reproduce. Interestingly enough while I was copying files (From the array to the Systemdrives) one of the Cache drives (which were sitting idle) went missing. So it seems like it's only when the chipset is under load? Do you think it could be the SATA cables? My gut says no since the NVME Drive also went missing at some point. Please let me know if you have any other suggestions, and thanks again for taking the time to read it the first time. tomservo-diagnostics-20220609-0853.zip Edited June 9, 2022 by Tesla3327 Quote Link to comment
JorgeB Posted June 9, 2022 Share Posted June 9, 2022 36 minutes ago, Tesla3327 said: Do you think it could be the SATA cables? What's what the log suggests, some kind of power/connection problem, also note that in my experience Samsung SSDs are very picky with SATA cable quality, so still worth trying with different cables if you have some, recent locking cables that come with current motherboards usually work very well. Quote Link to comment
Tesla3327 Posted June 9, 2022 Author Share Posted June 9, 2022 58 minutes ago, JorgeB said: recent locking cables that come with current motherboards usually work very well That's what I was using actually. Just tried a different set of cables (I used to use these on my my Array drives before getting the HBA card), and the drive still goes missing under load... I think I'll just order some brand new ones to try just in case, but until they arrive do you have any other suggestions? Quote Link to comment
Tesla3327 Posted June 9, 2022 Author Share Posted June 9, 2022 Just tried different power cables too. No dice Quote Link to comment
Tesla3327 Posted June 9, 2022 Author Share Posted June 9, 2022 2 minutes ago, trurl said: Any splitters? Thanks for helping out too! Not on these drives. The only splits that happens is the SAS -> SATA breakout cables on the HBA cards. For all the cache drives they are plugged directly into the MB and the power cables are run directly from the PSU. The two ADATA Systemdrives are on a separate power cable than the Samsung Cache Drives too. Quote Link to comment
Tesla3327 Posted June 9, 2022 Author Share Posted June 9, 2022 For the record, I just tried everything in Safe Mode and still had the same problem Quote Link to comment
nomadgeek Posted June 9, 2022 Share Posted June 9, 2022 (edited) I have also had this problem when my NVME cache disk is under some load (usually in the middle of the night while backups are running/uploading to it). The drive is connected to the port directly on the motherboard and simply turning off the server and turning it back on solves the issue (but a reboot does not). I haven't been able to find a long-term solution. Backups occur every night but the failure only happens every few months (but then maybe twice in a few days).. seems pretty random. Dell Inc. 07WP95 , Version A02 Dell Inc., Version 1.7.0 Edited June 9, 2022 by nomadgeek more detail 1 Quote Link to comment
Tesla3327 Posted June 9, 2022 Author Share Posted June 9, 2022 Interesting. Glad I'm not alone at least. For me though simply trying to copy a 32GB folder onto the cache drive is enough to do it. Would you mind posting your build's specs if you have the time? I am considering a new MB at this point. Quote Link to comment
nomadgeek Posted June 9, 2022 Share Posted June 9, 2022 Quote PCI Devices and IOMMU Groups IOMMU group 0:[8086:3e1f] 00:00.0 Host bridge: Intel Corporation 8th Gen Core 4-core Desktop Processor Host Bridge/DRAM Registers [Coffee Lake S] (rev 08) IOMMU group 1:[8086:1901] 00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 08) [8086:1533] 01:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03) IOMMU group 2:[8086:3e91] 00:02.0 VGA compatible controller: Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630] IOMMU group 3:[8086:1911] 00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model IOMMU group 4:[8086:a379] 00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10) IOMMU group 5:[8086:a36d] 00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10) Bus 001 Device 001 Port 1-0 ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 001 Device 002 Port 1-5 ID 0781:5575 SanDisk Corp. Cruzer Glide Bus 002 Device 001 Port 2-0 ID 1d6b:0003 Linux Foundation 3.0 root hub [8086:a36f] 00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10) IOMMU group 6:[8086:a360] 00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10) IOMMU group 7:[8086:a352] 00:17.0 SATA controller: Intel Corporation Cannon Lake PCH SATA AHCI Controller (rev 10) [1:0:0:0] disk ATA Samsung SSD 860 4B6Q /dev/sdb 1.00TB [3:0:0:0] disk ATA ST4000VN008-2DR1 SC60 /dev/sdc 4.00TB IOMMU group 8:[8086:a33c] 00:1c.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #5 (rev f0) IOMMU group 9:[8086:a33e] 00:1c.6 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #7 (rev f0) IOMMU group 10:[8086:a330] 00:1d.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 (rev f0) IOMMU group 11:[8086:a304] 00:1f.0 ISA bridge: Intel Corporation H370 Chipset LPC/eSPI Controller (rev 10) [8086:a348] 00:1f.3 Audio device: Intel Corporation Cannon Lake PCH cAVS (rev 10) [8086:a323] 00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10) [8086:a324] 00:1f.5 Serial bus controller: Intel Corporation Cannon Lake PCH SPI Controller (rev 10) IOMMU group 12:[10ec:8168] 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) IOMMU group 13:[1b4b:9215] 03:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller (rev 11) [6:0:0:0] disk ATA ST4000VN008-2DR1 SC60 /dev/sdd 4.00TB IOMMU group 14:[10ec:5763] 04:00.0 Non-Volatile memory controller: Realtek Semiconductor Co., Ltd. Device 5763 (rev 01) [N:0:1:1] disk ADATA SX6000LNP__1 /dev/nvme0n1 1.02TB CPU Thread Pairings Single:cpu 0 Single:cpu 1 Single:cpu 2 Single:cpu 3 USB Devices Bus 001 Device 001 Port 1-0ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 001 Device 002 Port 1-5ID 0781:5575 SanDisk Corp. Cruzer Glide Bus 002 Device 001 Port 2-0ID 1d6b:0003 Linux Foundation 3.0 root hub SCSI Devices [0:0:0:0]disk SanDisk Cruzer Glide 1.00 /dev/sda 31.4GB [1:0:0:0]disk ATA Samsung SSD 860 4B6Q /dev/sdb 1.00TB [3:0:0:0]disk ATA ST4000VN008-2DR1 SC60 /dev/sdc 4.00TB [6:0:0:0]disk ATA ST4000VN008-2DR1 SC60 /dev/sdd 4.00TB [N:0:1:1]disk ADATA SX6000LNP__1 /dev/nvme0n1 1.02TB 1 Quote Link to comment
nomadgeek Posted June 9, 2022 Share Posted June 9, 2022 17 minutes ago, Tesla3327 said: Interesting. Glad I'm not alone at least. For me though simply trying to copy a 32GB folder onto the cache drive is enough to do it. Would you mind posting your build's specs if you have the time? I am considering a new MB at this point. Pretty frustrating though - it means I have to come in early the next morning to power off the server and turn it back on before any of my users get to the office and start to complain they can't get to any of their files. 1 Quote Link to comment
trurl Posted June 10, 2022 Share Posted June 10, 2022 Have you checked for firmware updates? Quote Link to comment
Tesla3327 Posted June 10, 2022 Author Share Posted June 10, 2022 51 minutes ago, trurl said: Have you checked for firmware updates? For the motherboard? I did upgrade to the latest BIOS version sometime before I initially posted. Quote Link to comment
trurl Posted June 10, 2022 Share Posted June 10, 2022 for the nvme. I had to update my Samsung to get it to play well on my latest windows build. 1 Quote Link to comment
Tesla3327 Posted June 10, 2022 Author Share Posted June 10, 2022 Thank you for the suggestion, I had not considered that before. I just checked, and unfortunately both of my drives were already up to date with firmware. 😕 Quote Link to comment
Tesla3327 Posted June 10, 2022 Author Share Posted June 10, 2022 Update: I received my new SATA cables and tried them out. Unbeknownst to me I swapped the power cables that went to the ADATA system drives, and the power cables that were going to the Samsung drives. When booting on the computer I heard a pop and smelled ozone. Both ADATA drives are now dead. So it seems like there WAS an issue with the power cable heading towards the Samsung drives. I can't say if that was for sure what was causing the drives from disappearing but it's the best theory so far. That doesn't explain to me why the NVME drive disappeared in the first place (or why nomadgeek was having a similar problem), but I had to move that drive back into my system now that the ADATA drives are dead, so I'll be able to see if the problem comes back. Thanks again to JorgeB and trurl for shooting ideas at me while I tried to troubleshoot. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.