v5.0.4 One Disk Redballed, Another Disk Reporting Tons of Errors (while green-balled)


Recommended Posts

Motherboard: SUPERMICRO MBD-X9SCM-F-O

CPU: Intel Core i3-2120 3.3GHz

RAM: Kingston 4GB DDR3 1333

SATA Controllers: LSI 9211-8i (X 2)  (these two just replaced two SUPERMICRO AOC-SASLP-MV8's, see below)

Case: Norco 4220

Unraid: v5.0.4 (largest disk size 6TB)

 

Hello,

 

Last week I attempted a parity check. No reason other than it had been nearly a year since my last one. During the parity check, eight of my disks went offline and when I rebooted, they came back but one was redballed (disk16, a 6TB drive). I replaced the red-balled drive and attempted to rebuild but again the same eight disks went offline and the rebuild failed. All of the offline disks were connected to one of my SATA controllers (a Supermicro SASLP). I tried to replace the bad card with a SAS2LP, but the speed of the rebuild was atrocious. So I did some reading and decided to replace both my SATA cards with the LSI cards that are listed above.

 

My rebuild from that point went great, tooling around at 100 MB/s. However, at approximately the 3% mark, another one of my disks started reporting a ton of errors, disk14, a 1.5TB drive (also the temperature of this drive was replaced with an asterisk). The rebuild progress still seemingly continued, but the new drive (the one that replaced the redballed drive) was no longer accumulating "writes" and the "errors" drive was accumulating errors. I hope this makes sense.

 

I stopped and tried again. Same thing. Right around the 3% mark, the redballed drive stopped accumulating writes and the other drive's temp was replaced with an asterisk and it was reporting errors. When I refresh the GUI, the errors keep mounting, and the redballed drive is no longer being written to.

 

At this point I realized that a lot of my drives were running really hot. So I looked at my fan situation and tried to improve them (I had moved recently and some of the fans in my case had become unattached to their anchors). I tried my rebuild again, and overall temps were reduced by about 8 degrees per disk. That was an improvement, but unfortunately, the same thing happened again. 

 

I have attached a screenshot of my GUI and also my syslog. I am going to stop the current rebuild. It is currently at 405GB (7%), but the red-balled drive isn't being written to and the only thing that seems to be happening is the errors for the green-but-troubled drive are ticking upwards. 

 

I need a plan. What I would give to get out of this without losing data! Please help if you have any ideas.

ServerRebuild.jpg

Serversyslog.txt

Link to comment

Ah, didn't realize about my syslog being so short. This will be better. Since my syslog is full of individual sector errors for disk14, it's rather long and (probably) redundant. So I offer three sizes. The full one is really unwieldy. Just to be clear: all three start from the beginning, the difference is in how "far" they go. NOTE: I have to zip them in order to upload them because of their size, sorry.

 

I am going to run the extended SMART now.

 

Thanks so much for the 

syslog_three-lengths.zip

Edited by lungnut
Link to comment

And, finally, the results of the extended SMART test (attached). Looks like no errors? Is there any way to get this drive up and running successfully? Because if I am reading the situation correctly, if I can't get this drive back up, I might lose all the data on both disk14 and disk16, right? If there is ANY way to avoid that, I cannot tell you how appreciative I'd be.

SMART_Extended.txt

Link to comment

johnnie: I swapped disk14 and disk10's positions in my chassis and started another rebuild. So far, this one has been going okay, I think? There were a handful of errors on yet another disk, disk13, but I think those are sorted? I actually have no idea. I am 31% into the rebuild. I'm attaching a jpg and my syslog. Am I okay to allow this to continue? It has slowed down considerably and another odd thing is that even though my other 1.5TB drives have gone to sleep (because my rebuild is beyond their size at this point), the "trouble" drive disk14 hasn't yet gone to sleep. I thought that was odd. (Note: I'm about to go into work and won't be able to check in on things for about 11 hours or so).

 

And to jowi: My plan after this is to go up to a 12TB parity and then replace one of my 1.5TB with another 12TB, and then copy all of my other smaller, older drives' data onto that new disk. I am not in a position where I can easily dispose of the data that are on these drives. Yes, I should have paid better attention and should have replaced those drives earlier, but I am where I'm at right now trying to make the best of it. Do you really think I'm in an un-salvageable state with my data?

syslog_180328AM.txt

GUI_180328.jpg

Link to comment
48 minutes ago, lungnut said:

There were a handful of errors on yet another disk, disk13, but I think those are sorted?

These will result on some corruption on the rebuilt disk, but since it's only a few errors let it finish, possibly it will only affect a single file, and only a small glitch if it's e.g. a movie.

Link to comment
16 hours ago, lungnut said:

And to jowi: My plan after this is to go up to a 12TB parity and then replace one of my 1.5TB with another 12TB, and then copy all of my other smaller, older drives' data onto that new disk[...]. Do you really think I'm in an un-salvageable state with my data?

If i was you, i would build a new V6 based unraid server, and place the new 12TB parity + 12TB drive in that, and copy over everything from the old server to this one. 

 

The latest picture shows that disk 14 is ok agian? disk 12 is disabled, disk 13 has errors and disk 16 is redballed... damn.

 

I would never have swapped disks and performed a rebuild... if things go wrong now you are done. Your server is a disaster waiting to happen, it's got very old drives, you never maintained it properly. I run a parity check every month and check the smart status on a regular basis. I'm using 9x 4TB drives, it's been running for about 5 years now, i had to replace 2 disks already. 

Edited by jowi
Link to comment

johnnie: All right! Finally--I am back up. I'm so thankful. I've attached my syslog in case you have the time to give it a quick glance. I think I'm good?

 

Moving forward, I thought I would move to 12TB drives and retire all of my 1.5TB, 2TB, and 3TB drives, but after weighing the cost of 12TB drives vs. 6TB drives, I think I'll stick with 6TB drives and do the same at way less cost. It will take six of them to retire all 15 of my smaller drives.

 

Question(s): I am right in thinking that I should upgrade to 6.5 first before doing this, right? Also I believe Unraid also now supports dual parity, so it also seems a good idea to add another parity as soon as I have an empty slot in my Norco case, yeah?

 

Oh, and johnnie--in looking at your profile, you have me beat as a forum member by four days (December 2007). But it's amazing how much more you know than I do; I feel ashamed, hahaha. Again, thanks so much for helping me out. I can't tell you how much I freaking appreciate it.

syslog_180329.txt

Link to comment
2 hours ago, lungnut said:

Question(s): I am right in thinking that I should upgrade to 6.5 first before doing this, right? Also I believe Unraid also now supports dual parity, so it also seems a good idea to add another parity as soon as I have an empty slot in my Norco case, yeah?

Yes and yes.

 

During the rebuild, besides the disk13 issues, there were a few recoverable ATA errors on both disks 17 and 18, possibly connection issues, you may want to check all cabling, power supply, etc.

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.