[HELP] 2-3 drive fail + 1 drive read error

Xili · November 4, 2022

Hello

I come to ask for help because I find myself in a delicate situation and I would not want to do more stupidity than what I have already done….

Already excuse me for the google translation….

To explain the situation, I have an array of 18hdds (3to to 6to), and two parity disks of 14 (following a start of disk renewal) + nvme cache and tmp in sdd

I had the brilliant idea of wanting to eject the small and old hdds from the array, namely 2to 3to 3to. So I replaced 2 disks with two 6to that I had on the side, reconstruction OK + 2 days without worries.
I changed the last one by another of 6to, and during the reconstruction of the latter I have one of the two previous ones which fails and stops me the reconstruction.

It is then at this moment that I realize that I have re-injected disks that unraid had excluded me from a few months ago……
big dump on my part.

I no longer have the exact details of what I tried with these 3 disks, but now I find myself with a system that has 2 hdd fail (disk5 and 6), and one that has read errors (disk 2) making the system very slow. And potentially who risks letting go of me.

So I ordered 2 8to disks urgently and shut down the machine while waiting.

Today, after checking the integrity of the new disks, I injected them instead of the two fails to try to rebuild despite the reading problems on a 3rd

I hoped that it would pass, but I find myself with a reconstruction certainly in progress, but extremely slow (which I will not let go to the end as it is) and with many errors a priori… In addition a disc (17 m ' indicates "unmoutable . wrong or no file system)

Concretely what do I have on my array.

A lot (too much) of media, which are recoverable, even if it's boring if I lose everything, it's not dramatic.
On the other hand, I have 1.5to of nextcloud (obviously without backup, I didn't take the time to take care of my google drive ban), with personal/family docs, photos, for which I'm much more annoyed .
All my appdata is on nvme cache so safe.

At the moment I have the reconstruction in progress, but too slow for me to let it run. So unless it speeds up the solution will not work (especially since it seems to make a lot of mistakes…. I don’t know what I thought about it, if I should let it run or not….

I plan to stop the machine, make a new config, without the 3 disks, indicating to unraid that the config is functional... without certainty that it works?). and take the 3 disks one part to mount on another machine to recover their content (I don't know if unraid cuts the files or not .... If so in this case it is not manageable

Last possibility, launch the machine with mounted disk and go to recover my Nextcloud, even if very slow and it takes me 1 month, it will be manageable unlike the current reconstruction time

Thank you for your help ….

tower-diagnostics-20221104-1845.zip

Edited November 4, 2022 by Xili
screen

JorgeB · November 4, 2022

Adaptec controller crashed:

Nov  4 13:16:18 Tower kernel: aacraid: Host bus reset request. SCSI hang ?
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: Adapter health - -3
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: midlevel-8
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: lowlevel-0
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: error handler-0
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: firmware-1
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: kernel-0
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: Controller reset type is 3
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: Issuing IOP reset
Nov  4 13:17:33 Tower kernel: aacraid 0000:09:00.0: IOP reset failed
Nov  4 13:17:33 Tower kernel: aacraid 0000:09:00.0: ARC Reset attempt failed

Start by rebooting and post new diags after array start.

Xili · November 4, 2022

51 minutes ago, JorgeB said:

Adaptec controller crashed:

Nov  4 13:16:18 Tower kernel: aacraid: Host bus reset request. SCSI hang ?
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: Adapter health - -3
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: midlevel-8
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: lowlevel-0
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: error handler-0
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: firmware-1
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: kernel-0
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: Controller reset type is 3
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: Issuing IOP reset
Nov  4 13:17:33 Tower kernel: aacraid 0000:09:00.0: IOP reset failed
Nov  4 13:17:33 Tower kernel: aacraid 0000:09:00.0: ARC Reset attempt failed

Start by rebooting and post new diags after array start.

thanks for reply

with or without new 8to hdds

JorgeB · November 4, 2022

For now with the array as it is.

Xili · November 4, 2022

array is starting ..... very slowly

Array Starting•Mounting disks...

tower-diagnostics-20221104-2103.zip

Xili · November 4, 2022

start

edit : screen add 🙄

tower-diagnostics-20221104-2120.zip

Edited November 4, 2022 by Xili

Xili · November 4, 2022

Well

Unraid tried a rebuild with the two 8TB drives after the array booted. worries, one of the two discs is in error, the other is going to the end (strangely fast reconstruction and a lot of errors) ... but does not seem to be exploitable "Unmountable" surely because it does not have been forced at the beginning of the process. And during all that disk 1 is put in error, while it is a priori good

it doesn't seem terrible to me

tower-diagnostics-20221105-0039.zip

JorgeB · November 5, 2022

Still constant problems with the controller, make sure it's well seated or try a different slot if available, at least for now also disable IOMMU since it appears to be causing issues, then post new diags after array start.

Xili · November 6, 2022

On 11/5/2022 at 10:56 AM, JorgeB said:

Still constant problems with the controller, make sure it's well seated or try a different slot if available, at least for now also disable IOMMU since it appears to be causing issues, then post new diags after array start.

OK,

I disabled IOMMU, remove everything that was on the pci. There remains only adaptec card that I changed port, and it is well fixed, and the nvme cache.

I did not dare to start the array (even in maintenance), for fear of making things worse, because I have:

- disk 1 which appears deactivated by unraid

- disk 5 seems OK, but not formatted, and I guess empty

- new disc 6 tested OK, which appears disabled by unraid.

I await your instructions.

Thank you for your help

JorgeB · November 6, 2022

Post new diags after array start.

Xili · November 6, 2022

42 minutes ago, JorgeB said:

Post new diags after array start.

very fast start this time, I started in maintenance mode

tower-diagnostics-20221106-1425.zip

JorgeB · November 6, 2022

Still seeing the same HBA issues, do you have a different HBA you could use? Ideally an LSI.

Xili · November 6, 2022

no I currently have no other card, several years that I have this one without worries (or I did not realize it). Formerly it managed my raid 6, but since my switch to unraid, it just serves me to reassemble the disk individually, I had also put it in a particular mode, maybe that's the problem? I'm not sure I understood the problem, what's wrong with the card? what are the errors?

JorgeB · November 6, 2022

I'm not very familiar with those controllers but after taking a second look looks more like disk2 is causing the HBA related timeouts, what disk(s) did you last replace before you got read errors from disk2? Do you still have those old disks intact? Was anything written to the array after those disks were removed?

Xili · November 6, 2022

I initially changed the 2(2to) 5(3to) 6 (3to) drives to 6to, the replaced drives were functional, but 7 years old in operation, so I wanted to take them out for security reasons...funny to write this now

reconstruction of 2 and 5 OK,and during the reconstruction of 6, error, and disc 2 fail
the 3 initial disks 2(2to) 5(3to) 6 (3to), have not moved they are on the side, and I know which one (normally)

During the disk 6 reconstruction error (or after) I had a disk 2 read error message, it is this disk that seems to be causing problems with the stability of the array and creates latency with the controller I guess.

since the change of discs, there has not been much done on the array. even if during the reconstruction of the disks the containers were in operation, everything is on /appdata in the nvme cache. Only my Nextcloud Data folder may have undergone changes (I'm not sure)

the only big things launched since we were a parity check that I canceled because it was very very slow.
attempts to rebuild the new disk with many errors found
and at the beginning I had launched the array to try to recover some files, but it was only reading

I guess you're going to offer me to create a new configuration with the old disk location?

Edited November 6, 2022 by Xili

JorgeB · November 7, 2022

17 hours ago, Xili said:

I guess you're going to offer me to create a new configuration with the old disk location?

Correct, if you have all 3 replaced disks you can do a new config with them, since parity should be mostly valid you can check "parity is already valid" before array start but should then run a correcting check.

Xili · November 7, 2022

2 hours ago, JorgeB said:

Correct, if you have all 3 replaced disks you can do a new config with them, since parity should be mostly valid you can check "parity is already valid" before array start but should then run a correcting check.

when you say before starting the array, that is to say that I start in maintenance? with valid parity checked and then I run a parity check

JorgeB · November 7, 2022

46 minutes ago, Xili said:

that is to say that I start in maintenance?

No, there's no need to start in maintenance mode, after doing a new config there will be a checkbox with "parity is already valid" next to the array start button, just check that before first array start.

Xili · November 7, 2022

5 hours ago, JorgeB said:

No, there's no need to start in maintenance mode, after doing a new config there will be a checkbox with "parity is already valid" next to the array start button, just check that before first array start.

OK Does this pose a problem if I replace my two 14to parity disks with 16to, so that I can add the 14to instead very soon? or is it better to start with the original parity disks first?

JorgeB · November 7, 2022

If there's no reason to suspect issues with any of the data drives you can replace the parity drives now, but in that case do not check "parity is already valid", parity will be synced after first array start.

itimpi · November 7, 2022

8 minutes ago, Xili said:

OK Does this pose a problem if I replace my two 14to parity disks with 16to, so that I can add the 14to instead very soon? or is it better to start with the original parity disks first?

If you do this you must NOT tick the Parity is Valid checkbox as you want Unraid to build parity onto these drives.

Xili · November 7, 2022

ok , strange because no matter whether I check the box or not , I have this which is put on my parity disks: "All existing data on this device will be OVERWRITTEN when array is Started" Given what you tell me, it might be safer to start with the parity disks, and then replace them 1 by 1

I admit that I am a bit cautious, and I would like to take as little risk as possible

JorgeB · November 7, 2022

34 minutes ago, Xili said:

because no matter whether I check the box or not , I have this which is put on my parity disks: "All existing data on this device will be OVERWRITTEN when array is Started"

The GUI does not reflect the checkbox, but parity won't be overwritten if it is checked.

Xili · November 7, 2022

ok, so I launched the array with the config, before restarting anything I get back what I didn't want to lose. It's quietly copying from the network to another machine at 90mb/s, hoping there are no errors or corrupted files.

For the rest do you think that I will resume my use as before or do I need to do some checks?
doing a parity check seems necessary I think

Thank you for this first step

JorgeB · November 7, 2022

You can do a parity check after the sync, but if the sync completes at a normal speed without any errors all should be OK.

[HELP] 2-3 drive fail + 1 drive read error

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation