Unraid crashed during parity check - disk failed?


Recommended Posts

I ran my monthly parity check last night and woke up to an unresponsive unraid.  I got an alert "Disk 4 in error state" and another ""Array has 2 disks with read errors" around 1am.

 

I rebooted, and was greeted with a red X over disk 4 "Disk is disabled, Contents emulated"

 

I ran a smart test on disk4 with no errors.  What are my next steps?  Diagnostics attached.  

 

Thanks in advance. 

nas-diagnostics-20170301-0929.zip

Link to comment

You should have taken the diagnostic before rebooting. Do you know whether it was a correcting parity check?

 

Also, the version of unRAID you are running has a bug that prevents us from getting the SMART for any of your disks in the diagnostics. Go to Settings - Disk Settings and turn off Autostart. Then upgrade and get another diagnostic.

Link to comment

Thanks for your response.  My machine was completely locked up and couldn't grab a diagnostics unfortunately. 

 

It was NOT a correcting parity check, just a default monthly.  

 

I am running an extended smart test right now on disk 4.  It's been running for 4 hours now and at 50%.  Should I let it finish or cancel, upgrade and run diagnostics again?

 

Thanks

Link to comment

So, I got anxious and went out and purchased a new 2TB drive to replace disk4.  Things locked up after I started the rebuild but I copied off the syslog.  See attached.

 

 Any guidance would be appreciated.  

 

**Edit Unraid became responsive again and the drive seems to be rebuilding, however I am all of the sudden missing my TV and Movies shares??  Not sure what is going on..

 

About 2 minutes into the rebuild I got a few alerts:

 

Event: unRAID array errors
Subject: Warning [NAS] - array has errors
Description: Array has 1 disk with read errors
Importance: warning

Disk 3 - WDC_WD20EARS-00MVWB0_WD-WMAZA3690347 (sdj) (errors 48)

 

Event: unRAID array errors
Subject: Warning [NAS] - array has errors

Description: Array has 2 disks with read errors
Importance: warning

Disk 1 - WDC_WD20EARS-00MVWB0_WD-WMAZA3638502 (sdh) (errors 1)
Disk 3 - WDC_WD20EARS-00MVWB0_WD-WMAZA3690347 (sdj) (errors 52)

syslog.txt

Edited by pickthenimp
Some shares are now missing
Link to comment
1 hour ago, bjp999 said:

Sounds like bad cabling 

I double checked all my sata cables and they are snug

 

Here is the latest: Upgraded to 6.3.2.  Rebuilt disk 4 with a new drive.  Reboot and missing shares finally came back

 

Started a parity sync (without writing corrections) and it started out painfully slow.  My error count kept going up.  I finally stopped it after find 666670 errors.

 

Latest diagnostics attached after stopping parity sync.

 

Bad mobo?

nas-diagnostics-20170302-0652.zip

Link to comment

Your syslog is full of read errors on these disks. Maybe they are working intermittently but something needs to be fixed, and I'm not sure I would trust the data on the rebuilt disk4 either. Best if you avoid correcting parity until this is cleared up.

 

9 minutes ago, pickthenimp said:

seems odd I can access all of those disks fine?

How are you accessing the disks?

Link to comment
Your syslog is full of read errors on these disks. Maybe they are working intermittently but something needs to be fixed, and I'm not sure I would trust the data on the rebuilt disk4 either. Best if you avoid correcting parity until this is cleared up.
 
How are you accessing the disks?

Via unc path directly to the disk from another machine.




Sent from my iPhone using Tapatalk
Link to comment

It looks like the last three drives (attached to the LSISAS1068 controller) have dropped offline.

 

Quote

Mar  2 06:34:21 nas kernel: scsi host10: ioc1: LSISAS1068 B1, FwRev=01160100h, Ports=1, MaxQ=286, IRQ=36
Mar  2 06:34:21 nas kernel: mptsas: ioc1: attaching sata device: fw_channel 0, fw_id 0, phy 0, sas_addr 0x455f3d3ee1bfa990
Mar  2 06:34:21 nas kernel: scsi 10:0:0:0: Direct-Access     ATA      WDC WD20EARS-00M AB51 PQ: 0 ANSI: 5
Mar  2 06:34:21 nas kernel: sd 10:0:0:0: Attached scsi generic sg7 type 0
Mar  2 06:34:21 nas kernel: sd 10:0:0:0: [sdh] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
Mar  2 06:34:21 nas kernel: mptsas: ioc1: attaching sata device: fw_channel 0, fw_id 2, phy 2, sas_addr 0x3c296c3680865a58
Mar  2 06:34:21 nas kernel: scsi 10:0:1:0: Direct-Access     ATA      ST2000DM006-2DM1 CC26 PQ: 0 ANSI: 5
Mar  2 06:34:21 nas kernel: sd 10:0:0:0: [sdh] Write Protect is off
Mar  2 06:34:21 nas kernel: sd 10:0:0:0: [sdh] Mode Sense: 73 00 00 08
Mar  2 06:34:21 nas kernel: sd 10:0:1:0: Attached scsi generic sg8 type 0
Mar  2 06:34:21 nas kernel: sd 10:0:1:0: [sdi] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
Mar  2 06:34:21 nas kernel: mptsas: ioc1: attaching sata device: fw_channel 0, fw_id 3, phy 3, sas_addr 0x455f3d44d9bdad95
Mar  2 06:34:21 nas kernel: sd 10:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar  2 06:34:21 nas kernel: scsi 10:0:2:0: Direct-Access     ATA      WDC WD20EARS-00M AB51 PQ: 0 ANSI: 5

 

These are disks 1, 3, and 4. All of those disks are reporting problems.

 

Quote

...

Mar  2 06:51:06 nas kernel: md: disk1 read error, sector=30376
Mar  2 06:51:06 nas kernel: md: disk3 read error, sector=30376
Mar  2 06:51:06 nas kernel: md: disk4 read error, sector=30376
Mar  2 06:51:06 nas kernel: md: disk1 read error, sector=30384
Mar  2 06:51:06 nas kernel: md: disk3 read error, sector=30384
Mar  2 06:51:06 nas kernel: md: disk4 read error, sector=30384
Mar  2 06:51:06 nas kernel: sd 10:0:0:0: rejecting I/O to offline device
Mar  2 06:51:06 nas kernel: sd 10:0:0:0: rejecting I/O to offline device
Mar  2 06:51:06 nas kernel: sd 10:0:0:0: rejecting I/O to offline device
Mar  2 06:51:06 nas kernel: sd 10:0:1:0: rejecting I/O to offline device
Mar  2 06:51:06 nas kernel: sd 10:0:1:0: rejecting I/O to offline device
Mar  2 06:51:06 nas kernel: sd 10:0:1:0: rejecting I/O to offline device
Mar  2 06:51:06 nas kernel: sd 10:0:2:0: rejecting I/O to offline device
Mar  2 06:51:06 nas kernel: sd 10:0:2:0: rejecting I/O to offline device
Mar  2 06:51:06 nas kernel: sd 10:0:2:0: rejecting I/O to offline device
Mar  2 06:51:06 nas kernel: md: disk1 read error, sector=30392
Mar  2 06:51:06 nas kernel: md: disk3 read error, sector=30392
Mar  2 06:51:06 nas kernel: md: disk4 read error, sector=30392
Mar  2 06:51:06 nas kernel: md: disk1 read error, sector=30400
Mar  2 06:51:06 nas kernel: md: disk3 read error, sector=30400
Mar  2 06:51:06 nas kernel: md: disk4 read error, sector=30400
Mar  2 06:51:06 nas kernel: md: disk1 read error, sector=30408

...

 

My guess is that the controller either dropped offline or is failing.

 

Johnnie_Black may be good to weigh in here.

Link to comment
1 minute ago, johnnie.black said:

Agree, looks like a controller issue, if it's failing or there's some other issue can't say, replacing it should solve your issues but try to avoid marvell based controllers, as these are know to have issues with unRAID for some users.

Thanks for the reply.  Do you have a better controller you recommend?

Link to comment

Hopefully you haven't run into the same problem I did as explained here: 

Even after buying a new SATA controller card my problems still occurred.  I had to upgrade my entire MOBO/CPU/RAM to resolve the issue.  Of course no issues using that hardware outside of unraid

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.