Array advice

Skipdog · July 30, 2019

Need some array/disk advice.

System is running 6.7.2. 2 x 8TB parity. 12 x 8TB data drives running btrfs. Supermicro X9DRH-7TF. Dual Xeon E5-2670 v2s.

Adventure started when one of the two parity drives started throwing errors. Is it normal that the drive spins down when it encounters errors? Or does this happen as a result of read errors? I shut the system down and replaced the drive with a spare 8TB I had for this purpose. I started parity rebuild and during rebuild one of the 8TB data drives ran into 2560 errors and also dismounted. (red x and says its spun down)

Should I finish parity rebuild? Should I dismount array, turn off and remove power in order to try to get Disk 8 to spin back up?

Parity rebuild is at 58% with approximate 7 hours remaining.

Skip

JorgeB · July 31, 2019

Please post the diagnostics: Tools -> Diagnostics

Skipdog · July 31, 2019

Parity finished rebuilding. I'm actually not sure if I lost data or not. Drive 8 is available, but no longer in the array configuration. UNRAID is prompting me that I can format it. It almost seems like several drives dropped out around the same time. I took one of the three out and have been testing it offline and can find no SMART errors. Is it a backplane supermicro issue that caused the disconnects, or are the drives really flaking out? Two of the drives are running pre-clear again. One of them failed but I was able to restart the job. None of the Western Digital drives experienced any issues during this.

Another random question - what is a safe temperature for the drives in the supermicro? I'm not hitting the warning for the most part -- usually 100-108 F.

Attaching diag file.

Thank you kindly.

Skip

JorgeB · July 31, 2019

Diags are after rebooting so not much help on what happened, multiple disk failing at the same time are most often a connection/power problem.

As for disk8 most likely the rebuild is corrupt, see if old disk8 mounts with UD plugin.

10 minutes ago, Skipdog said:

usually 100-108 F.

Disks should be below 40C, 45C tops.

Skipdog · July 31, 2019

OK, so I think yes I made some mistakes on this one.

When something like this happens should I generate diag, and then power down completely? Post reboot see what is coming up? When this happened I did notice that disks were spun down. I've got two disks pre-clearing right now. Thanks for the pointer to UD.

JorgeB · July 31, 2019

You should grab the diags before rebooting, but no need to power down.

Skipdog · August 1, 2019

OK so pre-clear on two drives failed during different times. One looks like during read. The other i'm not sure about. It looks like I/O errors. Drives are in sleds in supermicro. Diags uploaded.

Skip

tower-diagnostics-20190801-1256.zip

JorgeB · August 1, 2019

Looks like a connection problem, one disk disconnected and then reconnected, the other disconnected and stayed disconnected, try the disks in different slots, if still issues try replacing the miniSAS cables(s), if all that doesn't help it could be an expander, hba or PSU problem.

Skipdog · August 12, 2019

Still struggling with this as just the Seagate 8TB EXOS drives continue to have problems.

I have another supermicro and i'm seeing these Seagate drives also produce problems in that. Completely different chassis, HBA, cables.

My system failed to rebuild parity on one of the drives just now, and then after a reboot, the drive that did rebuild correctly decided to drop out. I didn't reboot this time and captured diags.

Here is snippet of the logs from the drive when it happened:

Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] 15628053168 512-byte logical blocks: (8.00 TB/7.28 TiB)
Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] 4096-byte physical blocks
Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] Write Protect is off
Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] Mode Sense: 7f 00 10 08
Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] Write cache: enabled, read cache: enabled, supports DPO and FUA
Aug 12 10:43:12 Tower kernel: sdk: sdk1
Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] Attached SCSI disk
Aug 12 10:43:12 Tower kernel: BTRFS: device fsid e2a97d07-5b8c-4f40-9733-01c1060204f8 devid 1 transid 387385 /dev/sdk1
Aug 12 10:43:38 Tower emhttpd: ST8000NM0055-1RM112_ZA105CRR (sdk) 512 15628053168
Aug 12 10:43:38 Tower kernel: mdcmd (1): import 0 sdk 64 7814026532 0 ST8000NM0055-1RM112_ZA105CRR
Aug 12 10:43:38 Tower kernel: md: import disk0: (sdk) ST8000NM0055-1RM112_ZA105CRR size: 7814026532
Aug 12 10:43:41 Tower emhttpd: shcmd (25): /usr/local/sbin/set_ncq sdk 1
Aug 12 10:43:41 Tower root: set_ncq: setting sdk queue_depth to 1
Aug 12 10:43:41 Tower emhttpd: shcmd (26): echo 128 > /sys/block/sdk/queue/nr_requests
Aug 12 10:46:54 Tower kernel: sd 7:0:7:0: [sdk] tag#7 CDB: opcode=0x8a 8a 00 00 00 00 00 01 51 c2 c0 00 00 00 40 00 00
Aug 12 10:46:57 Tower kernel: sd 7:0:7:0: [sdk] Synchronizing SCSI cache
Aug 12 10:46:58 Tower kernel: sd 7:0:7:0: [sdk] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Aug 12 10:46:58 Tower kernel: sd 7:0:7:0: [sdk] tag#0 CDB: opcode=0x8a 8a 00 00 00 00 00 01 51 c2 c0 00 00 00 40 00 00
Aug 12 10:46:58 Tower kernel: print_req_error: I/O error, dev sdk, sector 22135488
Aug 12 10:46:58 Tower kernel: sd 7:0:7:0: [sdk] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Aug 12 10:46:58 Tower kernel: sd 7:0:7:0: [sdk] tag#0 CDB: opcode=0x8a 8a 00 00 00 00 00 70 27 f9 48 00 00 00 08 00 00
Aug 12 10:46:58 Tower kernel: print_req_error: I/O error, dev sdk, sector 1881667912
Aug 12 10:46:58 Tower kernel: sd 7:0:7:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00

Skip

tower-diagnostics-20190812-1452.zip

JorgeB · August 12, 2019

Try connecting the disk to the onboard SATA controller, to rule out any HBA compatibility problem.

Skipdog · August 12, 2019

I should also add i've got 7x WDC_WD80EMAZ that have given me zero problems since they were installed.

On the spare supermicro, they were indeed going into the onboard SATA. There are no HBAs installed in that server.

The primary supermicro is using SAS 9207-8i type card.

JorgeB · August 12, 2019

So the same disk fails on both servers, with different controllers? If so that's likely the disk.

Skipdog · August 12, 2019

Yes i moved the Seagates to another supermicro that effectively uses the onboard SATA. they are producing the errors in the text I pasted. The drives aren't throwing bad sectors - but maybe they are mechanical problems in nature.

Skipdog · August 12, 2019

talked to a friend who has the same drives in an enterprise QNAP. He started to have the same problem at the same time. We are wondering if something in the Linux kernel/driver set was updated that is causing issues with the Seagate Constellation series. Also checking firmware to see if that has any impact.

JorgeB · August 13, 2019

It's a possibility, you can always downgrade to an earlier Unraid release to confirm.

Array advice

Recommended Posts

Skipdog

Link to comment

JorgeB

Link to comment

Skipdog

Link to comment

JorgeB

Link to comment

Skipdog

Link to comment

JorgeB

Link to comment

Skipdog

Link to comment

JorgeB

Link to comment

Skipdog

Link to comment

JorgeB

Link to comment

Skipdog

Link to comment

JorgeB

Link to comment

Skipdog

Link to comment

Skipdog

Link to comment

JorgeB

Link to comment

Archived