OK, this looks bad. What do I do next?

Karyudo · December 20, 2016

I am having a terrible time with unRAID. In the six months I've been running my server, I've had a new motherboard fail (just over a month ago), a new cache SSD fail (last couple of weeks), and now a new drive—which I conscientiously pre-cleared—look like it's failing.

I just got the SSD straightened out yesterday, and then in the evening, there seemed to be a power anomaly (which I have never noticed happening before in the five years we've lived here). My desktop stayed on, but the server reset. Seemed to boot fine, all the drives came up, but of course it hadn't been shut down properly, so a parity check was recommended. So I started that, and then went to bed.

This morning, one drive is marked with an 'x' and "device is disabled, contents emulated." But the parity check kept running until just now (14 hours or so later). The Main screen looks... not good. (Screenshot attached.)

I'm getting very, very, very, very tired of the stress associated with wondering if I'm going to lose data, and wondering what to do in what order to make sure I don't screw anything up. Could somebody please talk me through this?

(Oh, and before you think it: the UPS has been ordered.)

JorgeB · December 20, 2016

You didn't post your diagnostics but there are errors on multiple disks, so almost certainly a controller issue.

Karyudo · December 20, 2016

Oh, yeah: diagnostics. Forgot about that. File's attached....

I guess I'm using at least three controllers: two PCI-E cards (two different brands), plus at least one on the mobo (might be two?). Are you suggesting that if a controller is bad, I could move drives to another controller, and all could be well? That would be pretty great.

unRAID-diagnostics-20161220-0745.zip

JorgeB · December 20, 2016

Besides a couple CRC errors in the beginning log is full with the community applications backup, can't see what happened.

Karyudo · December 20, 2016

Huh. My order of operations was CA backup (December 18, when I had finished resetting everything on the new SSD), then parity check (late December 18 through almost all day December 19). Does the parity check really not write anything interesting to what turns up in the diagnostics? That's surprising; I'd have thought it would.

What can I try next that will definitely not screw anything up? I'd like to move forward, but I don't have a good idea of what will get done in the background to fix or ruin my array.

I could, for example, sort out which drives are connected to which controller cards. But I don't want to waste time gathering information on red herrings, if that's not useful.

JorgeB · December 20, 2016

These are the syslog's last entries:

Server/Media/localhost/0/d595daea9917d359763fe010abf7d7cf3bafa4c.bundle/
Dec 18 23:30:00 Shinagawa rsyncd[11388]: cd+++++++++ Plex/config/Library/Application Support/Plex Media Server/Media/localhost/0/d595daea9917d359763fe010abf7d7cf3bafa4c.bundle/Contents/
Dec 18 23:30:00 Shinagawa rsyncd[11388]: >f+++++++++ Plex/config/Library/Application Support/Plex Media Server/Media/localhost/0/d

JorgeB · December 20, 2016

It filled up and there's nothing there from the parity check:

tmpfs           128M  128M     0 100% /var/log

Squid · December 20, 2016

Besides a couple CRC errors in the beginning log is full with the community applications backup, can't see what happened.

If the docker.img has errors or is completely full rsync on its own decides to log into the syslog instead of my designated log file.

Sent from my LG-D852 using Tapatalk

Karyudo · December 20, 2016

My main focus is to get back up and running with minimal (i.e. zero, if possible) data loss. Looks like the usual diagnostics are not available, but I hope you (Squid and johnnie.black) have some suggestions based on experience, nonetheless.

Can I take the array offline, reboot, and try again (as long as I'm no worse off than I am now by doing so)?

Looks like my controller topology is:

- all four 8TB data drives on a Vantec controller

- both 8TB parity drives on an Iocrest controller

- both 4TB WD drives on the mobo controller

- unassigned cache drive also on mobo controller

If a reboot doesn't change anything, then should I perhaps move drives off the Vantec controller, and onto something else (I have another Iocrest controller, plus space on the mobo and existing Iocrest controllers)?

What else can I provide to help make good decisions?

JorgeB · December 20, 2016

- all four 8TB data drives on a Vantec controller

Looks like this one is the problem then, if it's a Marvell controller and you don't need VT-D disable it.

Rebooting will clear the read errors and should bring all your data online, but disk6 will neet to be rebuilt.

Karyudo · December 20, 2016

OK, here's what I'm going to do:

- power down, remove Antec controller, move drives to other controller(s) -- I have enough SATA ports elsewhere

- power up; hopefully everything looks good

- rebuild Disk 5 (the one with the red 'x')

You did mean Disk 5, right?

JorgeB · December 20, 2016

Yes, disk5.

Karyudo · December 20, 2016

OK: progress. Powered up, and all drives (except for Disk 5) come up green on the new controller ports. Array started. Diagnostic file attached.

As predicted, Disk 5 is "unmountable." How do I go about rebuilding it?

Does Disk 5 need to be replaced? (At CA$250 per, I sure hope not. But I do have another new-but-not-precleared 8TB drive, if necessary.)

unRAID-diagnostics-20161220-1304.zip

JorgeB · December 20, 2016

Before I forget, this is unrelated but you need to replace disk3 SATA cable, there still are CRC errors.

Disk5 needing rebuild was expect, being unmountable was not, and it's formatted BTRFS, so it can be harder to fix.

So I would recommend first unassigning it and trying to mount it using the UD plugin so you can check if the actual disk is also unmountable.

JorgeB · December 20, 2016

You also have another problem, parity1 needs replacing, and this is an actual disk problem:

Model Family:     Seagate Archive HDD
Device Model:     ST8000AS0002-1NA17Z
Serial Number:    Z840EA9H

197 Current_Pending_Sector  0x0012   093   093   000    Old_age   Always       -       2600
198 Offline_Uncorrectable   0x0010   093   093   000    Old_age   Offline      -       2600

John_M · December 20, 2016

Disk5's SMART report is not good either - is has 3888 pending sectors.

JorgeB · December 20, 2016

Disk5 is also not good, without the syslog it's just guessing, but maybe its errors crashed the controller, still good to use different ports if available.

Model Family:     Seagate Archive HDD
Device Model:     ST8000AS0002-1NA17Z
Serial Number:    Z840P64M
197 Current_Pending_Sector  0x0012   089   089   000    Old_age   Always       -       3888
198 Offline_Uncorrectable   0x0010   089   089   000    Old_age   Offline      -       3888

JorgeB · December 20, 2016

So let me revise my previous advice:

Get at spare to replace disk5, disable parity1, rebuild disk5 to the spare and try to fix the filesystem.

Keep old disk5 intact, most data should still be salvageable if the rebuild or filesystem repair goes bad.

Karyudo · December 20, 2016

OK: have replaced Disk 3 SATA cable. Diagnostic attached.

Next, I will:

- install new 8TB drive to replace Disk 5 (and keep Disk 5 intact, just in case)

- disable Parity 1, by unplugging and disconnecting it

- see if I can stumble my way to rebuilding Disk 5 from Parity 2

Do I need to pre-clear the replacement Disk 5? If not, I imagine it would still be a good idea...?

(Mostly unrelated note: it would be nice if parity drives were explicitly labelled "Parity 1" and "Parity 2," instead of just "Parity" and "Parity 2.")

unRAID-diagnostics-20161220-1533.zip

John_M · December 21, 2016

You can just unassign Parity 1 in the GUI - no need to remove it physically. You don't have to pre-clear the replacement for Disk 5, but you might want to test it first. With two bad disks your array is unprotected at the moment and pre-clearing the new disk would take a long time. Personally, I'd want to get another disk in there as soon as possible but at the very least I'd run a short SMART self-test on it. The rebuild process is going to write every single bit of 8 TB to it.

Karyudo · December 21, 2016

Took out both Disk 5 and Parity drives. Put two new 8TB drives in.

Assigned one drive (serial ending LCHC) as Disk 5. Ran Short SMART test; completed without errors; SMART report attached.

I've left the other drive (ending MZHK) unassigned. I see it in the choices for Parity, and also under the Unassigned Devices plugin section.

Now, how exactly do I go about rebuilding Disk 5? I don't see anything obvious. unRAID says I can't start the array: "Invalid configuration: Too many wrong and/or missing disks!"

It would be nice to get the rebuild started before about 11 PM (PST), so it can be working while I'm asleep tonight....

unRAID-smart-20161220-1658.zip

John_M · December 21, 2016

That SMART report looks OK. The trouble with the short self-test is that it doesn't have time to test much of the drive but at least it confirms that the electronics are working and it hasn't been catastrophically damaged in transit.

Strange that the array won't start as you're allowed to have two of the original disks missing. Just checking that Parity 2, Disks 1 to 4 and 6 are all present and showing a green status. Disk 5 is a new disk with blue status and Parity 1 is unassigned? And that you powered the system down to replace the drives (ie. no hot-swapping)?

Perhaps the aborted parity check following the unsafe shutdown has upset it and it considers Parity 2 to be invalid too, which is not unreasonable as there are very likely to be some errors.

What I would do is seek johnnie.black's advice, but failing that I'd do Tools -> New Config and, using the screen grab in your first post as a guide, assign the correct drives to Parity 2, Disks 1 to 4 and 6. Then I'd assign the new drive to Disk 5 and leave Parity 1 unassigned (in other words, just as it currently is). Important: check the box labelled "Parity is already valid". Double check the assignments and then start the array. This time it should start and Disk 5 should rebuild.

This won't affect any of the data on your other disks and you still have the original Disk 5 as insurance.

If all goes to plan you'll still need to do a file system check/repair on Disk 5 once it has finished rebuilding.

If the array still won't start post your diagnostics.

Karyudo · December 21, 2016

Just checking that Parity 2, Disks 1 to 4 and 6 are all present and showing a green status. Disk 5 is a new disk with blue status and Parity 1 is unassigned? And that you powered the system down to replace the drives (ie. no hot-swapping)?

Yup. Just like that: no hot-swapping, blue status on Disk 5, green on everything else, Parity 1 unassigned. Screenshot attached. (Man, that 320kb maximum is unnecessarily stringent: I can't even get a full screenshot at original resolution into that size!)

Perhaps the aborted parity check following the unsafe shutdown has upset it and it considers Parity 2 to be invalid too, which is not unreasonable as there are very likely to be some errors.

The parity check after the unsafe shutdown wasn't aborted, but it did complete with what seems to have been a SATA controller issue. The Main tab didn't seem to have any issues with the Parity drive (although I guess the logs did). To remove the Parity drive, I didn't do anything ahead of time: I just shut down, and re-started without the drive in place. (And yeah, I'm quite sure I got the right drive, by confirming the serial number on the drive with the snippet johnnie.black included in his post.) Should I have instead done something in the GUI while the drive was spun up, to make sure it was more gracefully removed from the array?

What I would do is seek johnnie.black's advice...

Yeah, I'm hoping he pokes his head in again soon!

...but failing that I'd do Tools -> New Config [...] This won't affect any of the data on your other disks and you still have the original Disk 5 as insurance.

Thanks for the outline of what to try next, and a big THANK YOU for adding some confirmation that such a move won't affect the other disks!

John_M · December 21, 2016

I would just have unassigned Parity 1 in the GUI and left it physically connected with the aim of investigating it further once my data was safe.

Pending sectors are bad news in unRAID because they can't be read, but a cycle of pre-clear might convert them to remapped sectors, which makes the disk usable again, with the proviso that you keep a close eye on it and reject it if it gets worse.

JorgeB · December 21, 2016

Just checking that Parity 2, Disks 1 to 4 and 6 are all present and showing a green status. Disk 5 is a new disk with blue status and Parity 1 is unassigned? And that you powered the system down to replace the drives (ie. no hot-swapping)?

Yup. Just like that: no hot-swapping, blue status on Disk 5, green on everything else, Parity 1 unassigned. Screenshot attached. (Man, that 320kb maximum is unnecessarily stringent: I can't even get a full screenshot at original resolution into that size!)

That's probably a bug, try this:

-set disk5 to unassigned

-leave parity1 unassigned

-you should now be able to start array

-stop array

-reassign new disk5

-start to begin rebuild

OK, this looks bad. What do I do next?

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation