Array advice


Recommended Posts

Need some array/disk advice.

 

System is running 6.7.2.  2 x 8TB parity.  12 x 8TB data drives running btrfs.  Supermicro X9DRH-7TF.  Dual Xeon E5-2670 v2s.

 

Adventure started when one of the two parity drives started throwing errors.  Is it normal that the drive spins down when it encounters errors?  Or does this happen as a result of read errors? I shut the system down and replaced the drive with a spare 8TB I had for this purpose.  I started parity rebuild and during rebuild one of the 8TB data drives ran into 2560 errors and also dismounted.  (red x and says its spun down)

 

Should I finish parity rebuild?  Should I dismount array, turn off and remove power in order to try to get Disk 8 to spin back up?

 

Parity rebuild is at 58% with approximate 7 hours remaining.

 

Skip

Edited by Skipdog
Link to comment

Parity finished rebuilding.  I'm actually not sure if I lost data or not.  Drive 8 is available, but no longer in the array configuration.  UNRAID is prompting me that I can format it.  It almost seems like several drives dropped out around the same time.  I took one of the three out and have been testing it offline and can find no SMART errors.  Is it a backplane supermicro issue that caused the disconnects, or are the drives really flaking out?  Two of the drives are running pre-clear again.  One of them failed but I was able to restart the job.  None of the Western Digital drives experienced any issues during this.

 

Another random question - what is a safe temperature for the drives in the supermicro?  I'm not hitting the warning for the most part -- usually 100-108 F.

 

Attaching diag file.

 

Thank you kindly.

Skip

 

Edited by Skipdog
Link to comment

Diags are after rebooting so not much help on what happened, multiple disk failing at the same time are most often a connection/power problem.

 

As for disk8 most likely the rebuild is corrupt, see if old disk8 mounts with UD plugin.

 

10 minutes ago, Skipdog said:

usually 100-108 F.

Disks should be below 40C, 45C tops.

Link to comment

OK, so I think yes I made some mistakes on this one.

 

When something like this happens should I generate diag, and then power down completely?  Post reboot see what is coming up?  When this happened I did notice that disks were spun down.  I've got two disks pre-clearing right now.  Thanks for the pointer to UD.

Link to comment

Looks like a connection problem, one disk disconnected and then reconnected, the other disconnected and stayed disconnected, try the disks in different slots, if still issues try replacing the miniSAS cables(s), if all that doesn't help it could be an expander, hba or PSU problem.

Link to comment
  • 2 weeks later...

Still struggling with this as just the Seagate 8TB EXOS drives continue to have problems.

 

I have another supermicro and i'm seeing these Seagate drives also produce problems in that.  Completely different chassis, HBA, cables.

 

My system failed to rebuild parity on one of the drives just now, and then after a reboot, the drive that did rebuild correctly decided to drop out.  I didn't reboot this time and captured diags.  

 

Here is  snippet of the logs from the drive when it happened:

 

Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] 15628053168 512-byte logical blocks: (8.00 TB/7.28 TiB)
Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] 4096-byte physical blocks
Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] Write Protect is off
Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] Mode Sense: 7f 00 10 08
Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] Write cache: enabled, read cache: enabled, supports DPO and FUA
Aug 12 10:43:12 Tower kernel: sdk: sdk1
Aug 12 10:43:12 Tower kernel: sd 7:0:7:0: [sdk] Attached SCSI disk
Aug 12 10:43:12 Tower kernel: BTRFS: device fsid e2a97d07-5b8c-4f40-9733-01c1060204f8 devid 1 transid 387385 /dev/sdk1
Aug 12 10:43:38 Tower emhttpd: ST8000NM0055-1RM112_ZA105CRR (sdk) 512 15628053168
Aug 12 10:43:38 Tower kernel: mdcmd (1): import 0 sdk 64 7814026532 0 ST8000NM0055-1RM112_ZA105CRR
Aug 12 10:43:38 Tower kernel: md: import disk0: (sdk) ST8000NM0055-1RM112_ZA105CRR size: 7814026532
Aug 12 10:43:41 Tower emhttpd: shcmd (25): /usr/local/sbin/set_ncq sdk 1
Aug 12 10:43:41 Tower root: set_ncq: setting sdk queue_depth to 1
Aug 12 10:43:41 Tower emhttpd: shcmd (26): echo 128 > /sys/block/sdk/queue/nr_requests
Aug 12 10:46:54 Tower kernel: sd 7:0:7:0: [sdk] tag#7 CDB: opcode=0x8a 8a 00 00 00 00 00 01 51 c2 c0 00 00 00 40 00 00
Aug 12 10:46:57 Tower kernel: sd 7:0:7:0: [sdk] Synchronizing SCSI cache
Aug 12 10:46:58 Tower kernel: sd 7:0:7:0: [sdk] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Aug 12 10:46:58 Tower kernel: sd 7:0:7:0: [sdk] tag#0 CDB: opcode=0x8a 8a 00 00 00 00 00 01 51 c2 c0 00 00 00 40 00 00
Aug 12 10:46:58 Tower kernel: print_req_error: I/O error, dev sdk, sector 22135488
Aug 12 10:46:58 Tower kernel: sd 7:0:7:0: [sdk] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Aug 12 10:46:58 Tower kernel: sd 7:0:7:0: [sdk] tag#0 CDB: opcode=0x8a 8a 00 00 00 00 00 70 27 f9 48 00 00 00 08 00 00
Aug 12 10:46:58 Tower kernel: print_req_error: I/O error, dev sdk, sector 1881667912
Aug 12 10:46:58 Tower kernel: sd 7:0:7:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00

 

Skip

seagate-exos-error-parity.JPG

tower-diagnostics-20190812-1452.zip

Edited by Skipdog
Link to comment

I should also add i've got 7x WDC_WD80EMAZ that have given me zero problems since they were installed.

 

On the spare supermicro, they were indeed going into the onboard SATA.  There are no HBAs installed in that server.

 

The primary supermicro is using SAS 9207-8i type card.

Link to comment

talked to a friend who has the same drives in an enterprise QNAP.  He started to have the same problem at the same time.  We are wondering if something in the Linux kernel/driver set was updated that is causing issues with the Seagate Constellation series.  Also checking firmware to see if that has any impact.  

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.