lungnut Posted March 24, 2018 Share Posted March 24, 2018 Motherboard: SUPERMICRO MBD-X9SCM-F-O CPU: Intel Core i3-2120 3.3GHz RAM: Kingston 4GB DDR3 1333 SATA Controllers: LSI 9211-8i (X 2) (these two just replaced two SUPERMICRO AOC-SASLP-MV8's, see below) Case: Norco 4220 Unraid: v5.0.4 (largest disk size 6TB) Hello, Last week I attempted a parity check. No reason other than it had been nearly a year since my last one. During the parity check, eight of my disks went offline and when I rebooted, they came back but one was redballed (disk16, a 6TB drive). I replaced the red-balled drive and attempted to rebuild but again the same eight disks went offline and the rebuild failed. All of the offline disks were connected to one of my SATA controllers (a Supermicro SASLP). I tried to replace the bad card with a SAS2LP, but the speed of the rebuild was atrocious. So I did some reading and decided to replace both my SATA cards with the LSI cards that are listed above. My rebuild from that point went great, tooling around at 100 MB/s. However, at approximately the 3% mark, another one of my disks started reporting a ton of errors, disk14, a 1.5TB drive (also the temperature of this drive was replaced with an asterisk). The rebuild progress still seemingly continued, but the new drive (the one that replaced the redballed drive) was no longer accumulating "writes" and the "errors" drive was accumulating errors. I hope this makes sense. I stopped and tried again. Same thing. Right around the 3% mark, the redballed drive stopped accumulating writes and the other drive's temp was replaced with an asterisk and it was reporting errors. When I refresh the GUI, the errors keep mounting, and the redballed drive is no longer being written to. At this point I realized that a lot of my drives were running really hot. So I looked at my fan situation and tried to improve them (I had moved recently and some of the fans in my case had become unattached to their anchors). I tried my rebuild again, and overall temps were reduced by about 8 degrees per disk. That was an improvement, but unfortunately, the same thing happened again. I have attached a screenshot of my GUI and also my syslog. I am going to stop the current rebuild. It is currently at 405GB (7%), but the red-balled drive isn't being written to and the only thing that seems to be happening is the errors for the green-but-troubled drive are ticking upwards. I need a plan. What I would give to get out of this without losing data! Please help if you have any ideas. Serversyslog.txt Link to comment
Squid Posted March 24, 2018 Share Posted March 24, 2018 The attached syslog is from December 1st. Repost the current one. Link to comment
JorgeB Posted March 24, 2018 Share Posted March 24, 2018 Also post a SMART report for disk14, you may need to reboot or power cycle if it dropped offline. PS no point in continuing the rebuild, as it's rebuilding garbage. Link to comment
lungnut Posted March 24, 2018 Author Share Posted March 24, 2018 I really need to learn how to read a syslog. Sorry to waste your time. I was under the impression that the syslog file on my thumb drive was always updated to the last restart. I am attaching the real one now. I will search on how to pull a SMART report for disk14 and post it once I've figured it out. Serversyslog.txt Link to comment
lungnut Posted March 24, 2018 Author Share Posted March 24, 2018 And here is the SMART report for disk14. smart.txt Link to comment
lungnut Posted March 24, 2018 Author Share Posted March 24, 2018 Not sure if this is the same info I just posted as smart.txt in my previous post, but just in case this is different information or just plain easier to read, here's the SMART report from my telnet session. Disk14_SMART.txt Link to comment
JorgeB Posted March 25, 2018 Share Posted March 25, 2018 SMART looks fine, run an extended SMART test and post a new report. The syslog stops 2 seconds after the disk rebuild starts, so none of the errors are visible, can you post one that covers the errors? Link to comment
lungnut Posted March 25, 2018 Author Share Posted March 25, 2018 Ah, didn't realize about my syslog being so short. This will be better. Since my syslog is full of individual sector errors for disk14, it's rather long and (probably) redundant. So I offer three sizes. The full one is really unwieldy. Just to be clear: all three start from the beginning, the difference is in how "far" they go. NOTE: I have to zip them in order to upload them because of their size, sorry. I am going to run the extended SMART now. Thanks so much for the syslog_three-lengths.zip Link to comment
lungnut Posted March 26, 2018 Author Share Posted March 26, 2018 And, finally, the results of the extended SMART test (attached). Looks like no errors? Is there any way to get this drive up and running successfully? Because if I am reading the situation correctly, if I can't get this drive back up, I might lose all the data on both disk14 and disk16, right? If there is ANY way to avoid that, I cannot tell you how appreciative I'd be. SMART_Extended.txt Link to comment
JorgeB Posted March 26, 2018 Share Posted March 26, 2018 Swap both cables (or backplane slot) for disk14 with another disk and try the rebuild again. Link to comment
lungnut Posted March 27, 2018 Author Share Posted March 27, 2018 Swapped the cable for the backplane of disk14 and tried to rebuild. Unfortunately, the errors started accumulating around 7% thru the rebuild. Syslog attached. Any other ideas? I really appreciate the help. SMART_180326.txt Link to comment
JorgeB Posted March 27, 2018 Share Posted March 27, 2018 If you're using backplanes, and instead of cables, it would be best to swap backplanes with another disk to rule it out. Link to comment
jowi Posted March 27, 2018 Share Posted March 27, 2018 65292 power on hours... that is 7.5 years full on... no wonder that drive is going bad. The drive itself must be over 10 years old i assume. Just replace it, don't bother trying to repair it. Link to comment
lungnut Posted March 28, 2018 Author Share Posted March 28, 2018 johnnie: I swapped disk14 and disk10's positions in my chassis and started another rebuild. So far, this one has been going okay, I think? There were a handful of errors on yet another disk, disk13, but I think those are sorted? I actually have no idea. I am 31% into the rebuild. I'm attaching a jpg and my syslog. Am I okay to allow this to continue? It has slowed down considerably and another odd thing is that even though my other 1.5TB drives have gone to sleep (because my rebuild is beyond their size at this point), the "trouble" drive disk14 hasn't yet gone to sleep. I thought that was odd. (Note: I'm about to go into work and won't be able to check in on things for about 11 hours or so). And to jowi: My plan after this is to go up to a 12TB parity and then replace one of my 1.5TB with another 12TB, and then copy all of my other smaller, older drives' data onto that new disk. I am not in a position where I can easily dispose of the data that are on these drives. Yes, I should have paid better attention and should have replaced those drives earlier, but I am where I'm at right now trying to make the best of it. Do you really think I'm in an un-salvageable state with my data? syslog_180328AM.txt Link to comment
JorgeB Posted March 28, 2018 Share Posted March 28, 2018 48 minutes ago, lungnut said: There were a handful of errors on yet another disk, disk13, but I think those are sorted? These will result on some corruption on the rebuilt disk, but since it's only a few errors let it finish, possibly it will only affect a single file, and only a small glitch if it's e.g. a movie. Link to comment
jowi Posted March 29, 2018 Share Posted March 29, 2018 16 hours ago, lungnut said: And to jowi: My plan after this is to go up to a 12TB parity and then replace one of my 1.5TB with another 12TB, and then copy all of my other smaller, older drives' data onto that new disk[...]. Do you really think I'm in an un-salvageable state with my data? If i was you, i would build a new V6 based unraid server, and place the new 12TB parity + 12TB drive in that, and copy over everything from the old server to this one. The latest picture shows that disk 14 is ok agian? disk 12 is disabled, disk 13 has errors and disk 16 is redballed... damn. I would never have swapped disks and performed a rebuild... if things go wrong now you are done. Your server is a disaster waiting to happen, it's got very old drives, you never maintained it properly. I run a parity check every month and check the smart status on a regular basis. I'm using 9x 4TB drives, it's been running for about 5 years now, i had to replace 2 disks already. Link to comment
lungnut Posted March 30, 2018 Author Share Posted March 30, 2018 johnnie: All right! Finally--I am back up. I'm so thankful. I've attached my syslog in case you have the time to give it a quick glance. I think I'm good? Moving forward, I thought I would move to 12TB drives and retire all of my 1.5TB, 2TB, and 3TB drives, but after weighing the cost of 12TB drives vs. 6TB drives, I think I'll stick with 6TB drives and do the same at way less cost. It will take six of them to retire all 15 of my smaller drives. Question(s): I am right in thinking that I should upgrade to 6.5 first before doing this, right? Also I believe Unraid also now supports dual parity, so it also seems a good idea to add another parity as soon as I have an empty slot in my Norco case, yeah? Oh, and johnnie--in looking at your profile, you have me beat as a forum member by four days (December 2007). But it's amazing how much more you know than I do; I feel ashamed, hahaha. Again, thanks so much for helping me out. I can't tell you how much I freaking appreciate it. syslog_180329.txt Link to comment
JorgeB Posted March 30, 2018 Share Posted March 30, 2018 2 hours ago, lungnut said: Question(s): I am right in thinking that I should upgrade to 6.5 first before doing this, right? Also I believe Unraid also now supports dual parity, so it also seems a good idea to add another parity as soon as I have an empty slot in my Norco case, yeah? Yes and yes. During the rebuild, besides the disk13 issues, there were a few recoverable ATA errors on both disks 17 and 18, possibly connection issues, you may want to check all cabling, power supply, etc. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.