Skipdog Posted July 30, 2019 Share Posted July 30, 2019 Need some array/disk advice. System is running 6.7.2. 2 x 8TB parity. 12 x 8TB data drives running btrfs. Supermicro X9DRH-7TF. Dual Xeon E5-2670 v2s. Adventure started when one of the two parity drives started throwing errors. Is it normal that the drive spins down when it encounters errors? Or does this happen as a result of read errors? I shut the system down and replaced the drive with a spare 8TB I had for this purpose. I started parity rebuild and during rebuild one of the 8TB data drives ran into 2560 errors and also dismounted. (red x and says its spun down) Should I finish parity rebuild? Should I dismount array, turn off and remove power in order to try to get Disk 8 to spin back up? Parity rebuild is at 58% with approximate 7 hours remaining. Skip Link to comment
JorgeB Posted July 31, 2019 Share Posted July 31, 2019 Please post the diagnostics: Tools -> Diagnostics Link to comment
Skipdog Posted July 31, 2019 Author Share Posted July 31, 2019 Parity finished rebuilding. I'm actually not sure if I lost data or not. Drive 8 is available, but no longer in the array configuration. UNRAID is prompting me that I can format it. It almost seems like several drives dropped out around the same time. I took one of the three out and have been testing it offline and can find no SMART errors. Is it a backplane supermicro issue that caused the disconnects, or are the drives really flaking out? Two of the drives are running pre-clear again. One of them failed but I was able to restart the job. None of the Western Digital drives experienced any issues during this. Another random question - what is a safe temperature for the drives in the supermicro? I'm not hitting the warning for the most part -- usually 100-108 F. Attaching diag file. Thank you kindly. Skip Link to comment
JorgeB Posted July 31, 2019 Share Posted July 31, 2019 Diags are after rebooting so not much help on what happened, multiple disk failing at the same time are most often a connection/power problem. As for disk8 most likely the rebuild is corrupt, see if old disk8 mounts with UD plugin. 10 minutes ago, Skipdog said: usually 100-108 F. Disks should be below 40C, 45C tops. Link to comment
Skipdog Posted July 31, 2019 Author Share Posted July 31, 2019 OK, so I think yes I made some mistakes on this one. When something like this happens should I generate diag, and then power down completely? Post reboot see what is coming up? When this happened I did notice that disks were spun down. I've got two disks pre-clearing right now. Thanks for the pointer to UD. Link to comment
JorgeB Posted July 31, 2019 Share Posted July 31, 2019 You should grab the diags before rebooting, but no need to power down. Link to comment
Skipdog Posted August 1, 2019 Author Share Posted August 1, 2019 OK so pre-clear on two drives failed during different times. One looks like during read. The other i'm not sure about. It looks like I/O errors. Drives are in sleds in supermicro. Diags uploaded. Skip tower-diagnostics-20190801-1256.zip Link to comment
JorgeB Posted August 1, 2019 Share Posted August 1, 2019 Looks like a connection problem, one disk disconnected and then reconnected, the other disconnected and stayed disconnected, try the disks in different slots, if still issues try replacing the miniSAS cables(s), if all that doesn't help it could be an expander, hba or PSU problem. Link to comment
Skipdog Posted August 12, 2019 Author Share Posted August 12, 2019 Still struggling with this as just the Seagate 8TB EXOS drives continue to have problems. I have another supermicro and i'm seeing these Seagate drives also produce problems in that. Completely different chassis, HBA, cables. My system failed to rebuild parity on one of the drives just now, and then after a reboot, the drive that did rebuild correctly decided to drop out. I didn't reboot this time and captured diags. Here is snippet of the logs from the drive when it happened: Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] 15628053168 512-byte logical blocks: (8.00 TB/7.28 TiB) Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] 4096-byte physical blocks Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] Write Protect is off Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] Mode Sense: 7f 00 10 08 Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] Write cache: enabled, read cache: enabled, supports DPO and FUA Aug 12 10:43:12 Tower kernel: sdk: sdk1 Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] Attached SCSI disk Aug 12 10:43:12 Tower kernel: BTRFS: device fsid e2a97d07-5b8c-4f40-9733-01c1060204f8 devid 1 transid 387385 /dev/sdk1 Aug 12 10:43:38 Tower emhttpd: ST8000NM0055-1RM112_ZA105CRR (sdk) 512 15628053168 Aug 12 10:43:38 Tower kernel: mdcmd (1): import 0 sdk 64 7814026532 0 ST8000NM0055-1RM112_ZA105CRR Aug 12 10:43:38 Tower kernel: md: import disk0: (sdk) ST8000NM0055-1RM112_ZA105CRR size: 7814026532 Aug 12 10:43:41 Tower emhttpd: shcmd (25): /usr/local/sbin/set_ncq sdk 1 Aug 12 10:43:41 Tower root: set_ncq: setting sdk queue_depth to 1 Aug 12 10:43:41 Tower emhttpd: shcmd (26): echo 128 > /sys/block/sdk/queue/nr_requests Aug 12 10:46:54 Tower kernel: sd 7:0:7:0: [sdk] tag#7 CDB: opcode=0x8a 8a 00 00 00 00 00 01 51 c2 c0 00 00 00 40 00 00 Aug 12 10:46:57 Tower kernel: sd 7:0:7:0: [sdk] Synchronizing SCSI cache Aug 12 10:46:58 Tower kernel: sd 7:0:7:0: [sdk] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 Aug 12 10:46:58 Tower kernel: sd 7:0:7:0: [sdk] tag#0 CDB: opcode=0x8a 8a 00 00 00 00 00 01 51 c2 c0 00 00 00 40 00 00 Aug 12 10:46:58 Tower kernel: print_req_error: I/O error, dev sdk, sector 22135488 Aug 12 10:46:58 Tower kernel: sd 7:0:7:0: [sdk] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 Aug 12 10:46:58 Tower kernel: sd 7:0:7:0: [sdk] tag#0 CDB: opcode=0x8a 8a 00 00 00 00 00 70 27 f9 48 00 00 00 08 00 00 Aug 12 10:46:58 Tower kernel: print_req_error: I/O error, dev sdk, sector 1881667912 Aug 12 10:46:58 Tower kernel: sd 7:0:7:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 Skip tower-diagnostics-20190812-1452.zip Link to comment
JorgeB Posted August 12, 2019 Share Posted August 12, 2019 Try connecting the disk to the onboard SATA controller, to rule out any HBA compatibility problem. Link to comment
Skipdog Posted August 12, 2019 Author Share Posted August 12, 2019 I should also add i've got 7x WDC_WD80EMAZ that have given me zero problems since they were installed. On the spare supermicro, they were indeed going into the onboard SATA. There are no HBAs installed in that server. The primary supermicro is using SAS 9207-8i type card. Link to comment
JorgeB Posted August 12, 2019 Share Posted August 12, 2019 So the same disk fails on both servers, with different controllers? If so that's likely the disk. Link to comment
Skipdog Posted August 12, 2019 Author Share Posted August 12, 2019 Yes i moved the Seagates to another supermicro that effectively uses the onboard SATA. they are producing the errors in the text I pasted. The drives aren't throwing bad sectors - but maybe they are mechanical problems in nature. Link to comment
Skipdog Posted August 12, 2019 Author Share Posted August 12, 2019 talked to a friend who has the same drives in an enterprise QNAP. He started to have the same problem at the same time. We are wondering if something in the Linux kernel/driver set was updated that is causing issues with the Seagate Constellation series. Also checking firmware to see if that has any impact. Link to comment
JorgeB Posted August 13, 2019 Share Posted August 13, 2019 It's a possibility, you can always downgrade to an earlier Unraid release to confirm. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.