[HELP] 2-3 drive fail + 1 drive read error


Xili

Recommended Posts

Hello

 

I come to ask for help because I find myself in a delicate situation and I would not want to do more stupidity than what I have already done….

 

Already excuse me for the google translation….

 

To explain the situation, I have an array of 18hdds (3to to 6to), and two parity disks of 14 (following a start of disk renewal) + nvme cache and tmp in sdd

 

I had the brilliant idea of wanting to eject the small and old hdds from the array, namely 2to 3to 3to. So I replaced 2 disks with two 6to that I had on the side, reconstruction OK + 2 days without worries.
I changed the last one by another of 6to, and during the reconstruction of the latter I have one of the two previous ones which fails and stops me the reconstruction.


It is then at this moment that I realize that I have re-injected disks that unraid had excluded me from a few months ago……
big dump on my part.

 

 

I no longer have the exact details of what I tried with these 3 disks, but now I find myself with a system that has 2 hdd fail (disk5 and 6), and one that has read errors (disk 2) making the system very slow. And potentially who risks letting go of me.

 

So I ordered 2 8to disks urgently and shut down the machine while waiting.

Today, after checking the integrity of the new disks, I injected them instead of the two fails to try to rebuild despite the reading problems on a 3rd

I hoped that it would pass, but I find myself with a reconstruction certainly in progress, but extremely slow (which I will not let go to the end as it is) and with many errors a priori… In addition a disc (17 m ' indicates "unmoutable . wrong or no file system)

 

Concretely what do I have on my array.


A lot (too much) of media, which are recoverable, even if it's boring if I lose everything, it's not dramatic.
On the other hand, I have 1.5to of nextcloud (obviously without backup, I didn't take the time to take care of my google drive ban), with personal/family docs, photos, for which I'm much more annoyed .
All my appdata is on nvme cache so safe.

 

 

At the moment I have the reconstruction in progress, but too slow for me to let it run. So unless it speeds up the solution will not work (especially since it seems to make a lot of mistakes…. I don’t know what I thought about it, if I should let it run or not….

 

I plan to stop the machine, make a new config, without the 3 disks, indicating to unraid that the config is functional... without certainty that it works?). and take the 3 disks one part to mount on another machine to recover their content (I don't know if unraid cuts the files or not .... If so in this case it is not manageable

 

Last possibility, launch the machine with mounted disk and go to recover my Nextcloud, even if very slow and it takes me 1 month, it will be manageable unlike the current reconstruction time

 

Thank you for your help ….

2022-11-04 17_49_08-Tower_Main - Vivaldi.png

2022-11-04 17_49_17-Tower_Main - Vivaldi.png

tower-diagnostics-20221104-1845.zip

Edited by Xili
screen
Link to comment

Adaptec controller crashed:

 

Nov  4 13:16:18 Tower kernel: aacraid: Host bus reset request. SCSI hang ?
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: Adapter health - -3
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: midlevel-8
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: lowlevel-0
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: error handler-0
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: firmware-1
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: kernel-0
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: Controller reset type is 3
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: Issuing IOP reset
Nov  4 13:17:33 Tower kernel: aacraid 0000:09:00.0: IOP reset failed
Nov  4 13:17:33 Tower kernel: aacraid 0000:09:00.0: ARC Reset attempt failed

 

Start by rebooting and post new diags after array start.

Link to comment
51 minutes ago, JorgeB said:

Adaptec controller crashed:

 

Nov  4 13:16:18 Tower kernel: aacraid: Host bus reset request. SCSI hang ?
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: Adapter health - -3
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: midlevel-8
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: lowlevel-0
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: error handler-0
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: firmware-1
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: outstanding cmd: kernel-0
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: Controller reset type is 3
Nov  4 13:16:18 Tower kernel: aacraid 0000:09:00.0: Issuing IOP reset
Nov  4 13:17:33 Tower kernel: aacraid 0000:09:00.0: IOP reset failed
Nov  4 13:17:33 Tower kernel: aacraid 0000:09:00.0: ARC Reset attempt failed

 

Start by rebooting and post new diags after array start.

thanks for reply

 

with or without new 8to hdds

 

Link to comment

Well

Unraid tried a rebuild with the two 8TB drives after the array booted. worries, one of the two discs is in error, the other is going to the end (strangely fast reconstruction and a lot of errors) ... but does not seem to be exploitable "Unmountable" surely because it does not have been forced at the beginning of the process. And during all that disk 1 is put in error, while it is a priori good

 

it doesn't seem terrible to me

 

2022-11-05 00_56_05-Window.png

2022-11-05 00_55_36-Window.png

tower-diagnostics-20221105-0039.zip

Link to comment
On 11/5/2022 at 10:56 AM, JorgeB said:

Still constant problems with the controller, make sure it's well seated or try a different slot if available, at least for now also disable IOMMU since it appears to be causing issues, then post new diags after array start.

OK,

I disabled IOMMU, remove everything that was on the pci. There remains only adaptec card that I changed port, and it is well fixed, and the nvme cache.

 

I did not dare to start the array (even in maintenance), for fear of making things worse, because I have:

- disk 1 which appears deactivated by unraid

- disk 5 seems OK, but not formatted, and I guess empty

- new disc 6 tested OK, which appears disabled by unraid.

 

I await your instructions.

Thank you for your help

Link to comment

no I currently have no other card, several years that I have this one without worries (or I did not realize it). Formerly it managed my raid 6, but since my switch to unraid, it just serves me to reassemble the disk individually, I had also put it in a particular mode, maybe that's the problem? I'm not sure I understood the problem, what's wrong with the card? what are the errors?

Link to comment

I'm not very familiar with those controllers but after taking a second look looks more like disk2 is causing the HBA related timeouts, what disk(s) did you last replace before you got read errors from disk2? Do you still have those old disks intact? Was anything written to the array after those disks were removed?

Link to comment

I initially changed the 2(2to) 5(3to) 6 (3to) drives to 6to, the replaced drives were functional, but 7 years old in operation, so I wanted to take them out for security reasons...funny to write this now

reconstruction of 2 and 5 OK,and during the reconstruction of 6, error, and disc 2 fail
the 3 initial disks 2(2to) 5(3to) 6 (3to), have not moved they are on the side, and I know which one (normally)

During the disk 6 reconstruction error (or after) I had a disk 2 read error message, it is this disk that seems to be causing problems with the stability of the array and creates latency with the controller I guess.

since the change of discs, there has not been much done on the array. even if during the reconstruction of the disks the containers were in operation, everything is on /appdata in the nvme cache. Only my Nextcloud Data folder may have undergone changes (I'm not sure)

the only big things launched since we were a parity check that I canceled because it was very very slow.
attempts to rebuild the new disk with many errors found
and at the beginning I had launched the array to try to recover some files, but it was only reading

I guess you're going to offer me to create a new configuration with the old disk location?

Edited by Xili
Link to comment
17 hours ago, Xili said:

I guess you're going to offer me to create a new configuration with the old disk location?

Correct, if you have all 3 replaced disks you can do a new config with them, since parity should be mostly valid you can check "parity is already valid" before array start but should then run a correcting check.

Link to comment
2 hours ago, JorgeB said:

Correct, if you have all 3 replaced disks you can do a new config with them, since parity should be mostly valid you can check "parity is already valid" before array start but should then run a correcting check.

when you say before starting the array, that is to say that I start in maintenance? with valid parity checked and then I run a parity check

Link to comment
5 hours ago, JorgeB said:

No, there's no need to start in maintenance mode, after doing a new config there will be a checkbox with "parity is already valid" next to the array start button, just check that before first array start.

OK Does this pose a problem if I replace my two 14to parity disks with 16to, so that I can add the 14to instead very soon? or is it better to start with the original parity disks first?

Link to comment
8 minutes ago, Xili said:

OK Does this pose a problem if I replace my two 14to parity disks with 16to, so that I can add the 14to instead very soon? or is it better to start with the original parity disks first?

If you do this you must NOT tick the Parity is Valid checkbox as you want Unraid to build parity onto these drives.

Link to comment

ok , strange because no matter whether I check the box or not , I have this which is put on my parity disks: "All existing data on this device will be OVERWRITTEN when array is Started" Given what you tell me, it might be safer to start with the parity disks, and then replace them 1 by 1

 

I admit that I am a bit cautious, and I would like to take as little risk as possible

Link to comment

ok, so I launched the array with the config, before restarting anything I get back what I didn't want to lose. It's quietly copying from the network to another machine at 90mb/s, hoping there are no errors or corrupted files.

For the rest do you think that I will resume my use as before or do I need to do some checks?
doing a parity check seems necessary I think

Thank you for this first step

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.