Jump to content

Cache Drives Disappearing Randomly


Recommended Posts

Hello, I posted this on the ASRock forum too because I believe the problem is hardware based, but would appreciate any additional insight from people if possible.

I use an ASRock Z690M Phantom Gaming 4 as the MB in my unRAID system. Full specs at the end. TL;DR is that drives plugged into SATA3_0, SATA3_1. & M.2_2 have started to spontaneously disappear and are not recognized by the MB again until the system is completely shut down and powered back up.

Important to note that I have a cache pool consisting of two SSDs in RAID1. 4 days ago I received a notification that one of the drives in my cache pool has gone missing. It was a Western Digital PCIE gen 3 nvme drive installed in the 2nd m.2 slot of my MB. I did not have anything installed on the SATA port that is shared with m.2 slot (SATA3_0). I assumed that the drive had just failed and pulled it from the system and went to order a replacement. I ended up deciding to replace both cache drives with two Samsung 870 EVO SATA SSDs. I installed these drives into SATA3_0 & SATA3_1 slots. I still did not have anything in the second M.2 slot.

The next day I get another notification that the drive has gone missing. Through some troubleshooting I have determined that the drives will disappear, and will not re-appear until the system is shut down. Rebooting the system via the unRAID web interface does not cause the drives to re-appear.

 

To reiterate that last part. If a drive has disappeared, and I hit "Reboot" the drives do not re-appear; however, if I shutdown the system completely, and then walk over and turn it back on, the drives re-appear (for a little while)

Also important to note: I have two other SSDs plugged into SATA3_2 & SATA3_3 that have not disappeared at all. After the brand new drives disappeared I did try upgrading unRAID from 6.10.0 > 6.10.2. No help. Also my MB was running BIOS V3.01, but I did upgrade it to V7.02 during troubleshooting and it made no difference.

I have a feeling there may be an issue with the MB chipset, but I'm not sure why it would effect M.2_2, SATA3_0, & SATA3_1 only.

System Specs:
 

CPU: Intel Core i7-12700K 3.6 GHz 12-Core Processor
Motherboard: ASRock Z690M Phantom Gaming 4 Micro ATX LGA1700 Motherboard
Memory: Corsair Vengeance LPX 64GB (4 x 16 GB) DDR4-3600 CL18 Memory
x2 ADATA SU630 480 GB 2.5" SSD in RAID1 used for docker containers and VM Drives - plugged into SATA3_2 & SATA3_3
x1 WD Black SN750 1 TB NVME SSD - 1 of the NVMEs used in the original Cache Pool - Installed in M.2_1
x1 WD Green 960 GB NVME SSD - The other NVME that was the first drive to go missing - Previously installed in M.2_2, now uninstalled

x2 Samsung 870 EVO 1 TB 2.5" SSD - in RAID1 as the new cache pool. Plugged into SATA3_0 & SATA3_1
x8 Various 3.5" HDDs for the array - No issues with any of these drives - Installed via LSI Card flashed into IT mode

tomservo-diagnostics-20220608-1410.zip

Link to comment
6 hours ago, JorgeB said:

What if you swap ports between those different pools, it would confirm the problem is with the ports or not.

Thank you for the suggestion.

 

The Cache drives sat idle overnight without any issues, so before trying your suggestion I copied some files to and from those drives to see if that triggered the disks going missing. Sure enough that did it. So it seems to happen only when the drives experience load.

 

So I swapped the ports used for the Systemdrives for the ones used for cache drives as you suggested and tried to reproduce. Interestingly enough while I was copying files (From the array to the Systemdrives) one of the Cache drives (which were sitting idle) went missing. So it seems like it's only when the chipset is under load?

 

Do you think it could be the SATA cables? My gut says no since the NVME Drive also went missing at some point. Please let me know if you have any other suggestions, and thanks again for taking the time to read it the first time.

tomservo-diagnostics-20220609-0853.zip

Edited by Tesla3327
Link to comment
36 minutes ago, Tesla3327 said:

Do you think it could be the SATA cables?

What's what the log suggests, some kind of power/connection problem, also note that in my experience Samsung SSDs are very picky with SATA cable quality, so still worth trying with different cables if you have some, recent locking cables that come with current motherboards usually work very well.

Link to comment
58 minutes ago, JorgeB said:

recent locking cables that come with current motherboards usually work very well

That's what I was using actually. Just tried a different set of cables (I used to use these on my my Array drives before getting the HBA card), and the drive still goes missing under load... I think I'll just order some brand new ones to try just in case, but until they arrive do you have any other suggestions?

Link to comment
2 minutes ago, trurl said:

Any splitters?

Thanks for helping out too!

 

Not on these drives. The only splits that happens is the SAS -> SATA breakout cables on the HBA cards. For all the cache drives they are plugged directly into the MB and the power cables are run directly from the PSU. The two ADATA Systemdrives are on a separate power cable than the Samsung Cache Drives too.

Link to comment

I have also had this problem when my NVME cache disk is under some load (usually in the middle of the night while backups are running/uploading to it). The drive is connected to the port directly on the motherboard and simply turning off the server and turning it back on solves the issue (but a reboot does not). I haven't been able to find a long-term solution.

 

Backups occur every night but the failure only happens every few months (but then maybe twice in a few days).. seems pretty random.

 

Dell Inc. 07WP95 , Version A02
Dell Inc., Version 1.7.0

Edited by nomadgeek
more detail
  • Like 1
Link to comment
Quote

PCI Devices and IOMMU Groups

IOMMU group 0:[8086:3e1f] 00:00.0 Host bridge: Intel Corporation 8th Gen Core 4-core Desktop Processor Host Bridge/DRAM Registers [Coffee Lake S] (rev 08)

IOMMU group 1:[8086:1901] 00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 08)

[8086:1533] 01:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)

IOMMU group 2:[8086:3e91] 00:02.0 VGA compatible controller: Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630]

IOMMU group 3:[8086:1911] 00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model

IOMMU group 4:[8086:a379] 00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)

IOMMU group 5:[8086:a36d] 00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)

Bus 001 Device 001 Port 1-0 ID 1d6b:0002 Linux Foundation 2.0 root hub

Bus 001 Device 002 Port 1-5 ID 0781:5575 SanDisk Corp. Cruzer Glide

Bus 002 Device 001 Port 2-0 ID 1d6b:0003 Linux Foundation 3.0 root hub

[8086:a36f] 00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)

IOMMU group 6:[8086:a360] 00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)

IOMMU group 7:[8086:a352] 00:17.0 SATA controller: Intel Corporation Cannon Lake PCH SATA AHCI Controller (rev 10)

[1:0:0:0] disk ATA Samsung SSD 860 4B6Q /dev/sdb 1.00TB

[3:0:0:0] disk ATA ST4000VN008-2DR1 SC60 /dev/sdc 4.00TB

IOMMU group 8:[8086:a33c] 00:1c.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #5 (rev f0)

IOMMU group 9:[8086:a33e] 00:1c.6 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #7 (rev f0)

IOMMU group 10:[8086:a330] 00:1d.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 (rev f0)

IOMMU group 11:[8086:a304] 00:1f.0 ISA bridge: Intel Corporation H370 Chipset LPC/eSPI Controller (rev 10)

[8086:a348] 00:1f.3 Audio device: Intel Corporation Cannon Lake PCH cAVS (rev 10)

[8086:a323] 00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)

[8086:a324] 00:1f.5 Serial bus controller: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)

IOMMU group 12:[10ec:8168] 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

IOMMU group 13:[1b4b:9215] 03:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller (rev 11)

[6:0:0:0] disk ATA ST4000VN008-2DR1 SC60 /dev/sdd 4.00TB

IOMMU group 14:[10ec:5763] 04:00.0 Non-Volatile memory controller: Realtek Semiconductor Co., Ltd. Device 5763 (rev 01)

[N:0:1:1] disk ADATA SX6000LNP__1 /dev/nvme0n1 1.02TB

 

 

CPU Thread Pairings

Single:cpu 0

Single:cpu 1

Single:cpu 2

Single:cpu 3

 

USB Devices

Bus 001 Device 001 Port 1-0ID 1d6b:0002 Linux Foundation 2.0 root hub

Bus 001 Device 002 Port 1-5ID 0781:5575 SanDisk Corp. Cruzer Glide

Bus 002 Device 001 Port 2-0ID 1d6b:0003 Linux Foundation 3.0 root hub

 

SCSI Devices

[0:0:0:0]disk SanDisk Cruzer Glide 1.00 /dev/sda 31.4GB

[1:0:0:0]disk ATA Samsung SSD 860 4B6Q /dev/sdb 1.00TB

[3:0:0:0]disk ATA ST4000VN008-2DR1 SC60 /dev/sdc 4.00TB

[6:0:0:0]disk ATA ST4000VN008-2DR1 SC60 /dev/sdd 4.00TB

[N:0:1:1]disk ADATA SX6000LNP__1 /dev/nvme0n1 1.02TB

 

  • Thanks 1
Link to comment
17 minutes ago, Tesla3327 said:

Interesting. Glad I'm not alone at least. For me though simply trying to copy a 32GB folder onto the cache drive is enough to do it.

 

Would you mind posting your build's specs if you have the time? I am considering a new MB at this point.

 

Pretty frustrating though - it means I have to come in early the next morning to power off the server and turn it back on before any of my users get to the office and start to complain they can't get to any of their files.

  • Upvote 1
Link to comment

Update: I received my new SATA cables and tried them out. Unbeknownst to me I swapped the power cables that went to the ADATA system drives, and the power cables that were going to the Samsung drives. When booting on the computer I heard a pop and smelled ozone. Both ADATA drives are now dead. So it seems like there WAS an issue with the power cable heading towards the Samsung drives. I can't say if that was for sure what was causing the drives from disappearing but it's the best theory so far.

 

That doesn't explain to me why the NVME drive disappeared in the first place (or why nomadgeek was having a similar problem), but I had to move that drive back into my system now that the ADATA drives are dead, so I'll be able to see if the problem comes back.

 

Thanks again to JorgeB and trurl for shooting ideas at me while I tried to troubleshoot.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...