Jump to content
ChristerB

Several disks got read errors after rebuilding one data disk (SOLVED)

13 posts in this topic Last Reply

Recommended Posts

Hi!

I have a Unraid set up consisting of 1 parity disk, 1 cache disk and 12 data disks.

One of the data disks (4 TB WD Red) had read errors and I decided to remove it and add a new disk (6 TB HGST) and rebuild.

Everything looked good for 10 hours but then I started to get error messages that several disk had read errors.

When the rebuild was done I had 4 disks with read errors and when I got home from work one of them (8 TB WD Red) was disabled and content emulated.

 

I have used Unraid for several years, changed hardware several times and changed disks before without problems.

 

What do I do know? Any suggestions would be appreciated

tower-diagnostics-20190415-1818.zip

Unraid disk errors_small.PNG

Share this post


Link to post

I can't really say, I need to shut it down to be able to see the disks. Of course, it could be one of the ports on one of the controllers. 

Is it safe to shut down the server?

 

Share this post


Link to post

Problem with the HBA:

Apr 15 11:02:59 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!!

Make sure it's well seated and sufficiently cooled, also update firmware to latest, you're on p15, current one is p20.00.07.00, then you'll need to force a rebuild of disk7, since current one will be corrupt, you can do that with the invalid slot command, assuming rest of the disks are OK.

Share this post


Link to post

Thanks a lot.

I forgot to close the side of the chassis so it probably was overheated while doing the rebuild.

So the safe thing would be

1. Shutdown unraid

2. Make sure it's cooled down

3. Make sure controller and cables are well seated

3. update firmware

Reboot and ask IT Gods for mercy?

 

Share this post


Link to post

After rebooting will need to force disk7 rebuild (and online disk2), but post new diags first so we can check SMART for the dropped disks.

Share this post


Link to post
Posted (edited)

I didn't find a good guide to upgrade firmware (after a bit more looking I did, but still) and I don't feel confident enough to do it (I know I suck at this). So I decided to cool down the HBA and then start it again.

Attached is the new diagnostics after reboot. To my limited knowledge SMART seems fine. No read errors or allocated sectors.

Disk 2 is still disabled though. How would I go about to make it online again?

And how should I force rebuild of disk7. Would the guide I found here work? 

 

 

tower-diagnostics-20190415-2130.zip

Edited by ChristerB
Added SMART info

Share this post


Link to post
11 hours ago, ChristerB said:

To my limited knowledge SMART seems fine.

It does.

 

To force a rebuild of disk7 see below, but it might be a good idea to use a new spare disk, though the previous rebuild is a little (or a lot) corrupt it did ran for several hours before the errors, so some data will be fine in case something else goes wrong with this try.

 

-Tools -> New Config -> Retain current configuration: All -> Apply
-Assign any missing disk(s) including a new disk7 (or use the old one if you want)
-Important - After checking the assignments leave the browser on that page, the "Main" page.

-Open an SSH session/use the console and type (don't copy/paste directly from the forum, as sometimes it can insert extra characters):

mdcmd set invalidslot 7 29

-Back on the GUI and without refreshing the page, just start the array, do not check the "parity is already valid" box (GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the invalid slot command, but they won't be as long as the procedure was correctly done), disk7 will start rebuilding, disk should mount immediately but if it's unmountable don't format, wait for the rebuild to finish and then run a filesystem check

 

11 hours ago, ChristerB said:

I didn't find a good guide to upgrade firmware

It's very simple, just boot with a DOS flash drive, or use a Windows PC, and download the firmware package, then to update:

 

using DOS

sas2flsh.exe -o -f firmware.bin

using Windows

sas2flash.exe -o -f firmware.bin

 

Share this post


Link to post
Posted (edited)

@johnnie.black, you are my HERO.

 

Thanks for your fast and understandable answers. I will look in to this when I get home from work.

 

Edited by ChristerB

Share this post


Link to post

Stupid question. If all goes well, everything on the raid will be intact?

 

I will go for a new disk7 as you suggest johnnie.black. I just precleared a 8TB disk intended for my raid. 

Share this post


Link to post
14 minutes ago, ChristerB said:

If all goes well, everything on the raid will be intact?

Yes, if all goes well, i.e., no errors during the rebuild, data on the array should be fine.

Share this post


Link to post

@johnnie.black You are truly a Hero. Just got home and my Unraid is back to normal.

Received the confirmation email 45 min ago. 

Event: Unraid Parity sync / Data rebuild
Subject: Notice [TOWER] - Parity sync / Data rebuild finished (0 errors)
Description: Duration: 21 hours, 40 seconds. Average speed: 105,8 MB/s
Importance: normal

 

Thank you for your help and support in this troubled time.

I am forever thankful to you.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.