Multiple disks failing (all Toshiba disks) and multiple disks with errors


abhi.ko

Recommended Posts

Just now, trurl said:

Just reviewed the thread. Did you ever get your hardware problems resolved? Probably better to not attempt any rebuild if you are still having power problems.

Thanks for reading through.

 

To answer your question, YES, I did switch to a higher single rail amperage rated power supply (Seasonic GS-1300W) and have distributed load as efficiently as currently possible.

  • Old wiring was all of the backplanes being connected to one molex cable. 
  • Current scenario is distributed between 2 cables back to the PSU. One cable has 4 backplanes (15 disks, 14 HDD and one SSD (cache)) and another has 2 (8 disks) , since I only had 2 cables that came with the PSU. 
  • Future (ideal) scenario is 1:1 to connections from the backplane to the PSU. Have requested more molex cables from Seasonic since I don't want to try and mix cables even from my other Seasonic PSU's, and will be re-wiring as soon as I have them.
Link to comment
  • Replies 71
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

Normally you could rebuild 2 disks at once with dual parity, but I think it would be better to take them one at a time since we aren't happy with disk19 yet.

 

And while it would probably be OK to rebuild disk12 to the same disk, it's always safer to rebuild to a spare and keep the original as it is in case of problems.

 

After disk12 is rebuilt we can reconsider what to do about disk19.

Link to comment
10 minutes ago, trurl said:

Normally you could rebuild 2 disks at once with dual parity, but I think it would be better to take them one at a time since we aren't happy with disk19 yet.

 

And while it would probably be OK to rebuild disk12 to the same disk, it's always safer to rebuild to a spare and keep the original as it is in case of problems.

 

After disk12 is rebuilt we can reconsider what to do about disk19.

Thank you. I don't have a spare hot-swap drive available in the case currently, and I would rather not wait another day or two to pre-clear the drive and then rebuild onto that with 2 disabled disks, unless you think that is the way to go.

 

Just curious, what do we do with the lost+found on Disk 19?

Edited by abhi.ko
Link to comment
44 minutes ago, abhi.ko said:

Ok, so just clarify next steps:

 

For Disk 19: Do I just rename the numbered directories to the original share names? How do I change the disk from the current disabled status?

 

For Disk 12: what do I do?

Yes for disk19.
 

As @trurl mentioned you need to work out if you have any residual hardware problems.

 

The process for recovering disabled disks is covered here in the online documentations accessible via the ‘Manual’ link at the bottom of the GUI.

Link to comment

Rebuild is usually the correct way to proceed instead of resyncing parity. That way if anything was written to the emulated disks those writes can be recovered. Failed writes are the reason the disks were disabled to begin with, and those failed writes and any subsequent writes to the disks were emulated by updating parity. The reason Unraid must disable (kick out of the array) the disks is because they are no longer in sync.

 

Rebuild will be a good test of the new disk and a good test of your hardware changes.

 

If that goes well we can see what physical disk19 looks like as an Unassigned Device and compare that to the repaired emulated disk19.

 

 

 

 

Link to comment

Thank you both. So I plan to do this for now:

  1. Shutdown server
  2. Add the new 10TB drive
  3. Preclear the 10TB drive
  4. Once done, unassign disk 19 from current drive and assign new precleared drive instead
  5. Start array and rebuild

Does that sound good? Or should I rebuild disk 12 to the new one?

Can I re-assign without preclearing and start rebuilding immediately?

Link to comment
16 minutes ago, abhi.ko said:

Does that sound good? Or should I rebuild disk 12 to the new one?

It 9s really up to you.   As you are using a new drive the good thing is that you will still have the old drive available just in case anything goes wrong with the rebuild, so keep it intact until you are happy the rebuild worked as expected.

 

18 minutes ago, abhi.ko said:

Can I re-assign without preclearing and start rebuilding immediately?

Preclear is NEVER required with Unraid.    The only real reason to use it is to act as a stress test of the drive before adding it to the array.  

Link to comment
38 minutes ago, trurl said:

Whichever disk you rebuild, the expected result is exactly what you have with the emulated disks. So, rebuilding disk19 won't change the fact that all of that is in lost+found.

Ah...thanks for clarifying that. I was incorrectly assuming that the rebuild was going to restore the disk as it should be and a physical move of the lost+found folders won't be necessary.

 

51 minutes ago, itimpi said:

Preclear is NEVER required with Unraid.    The only real reason to use it is to act as a stress test of the drive before adding it to the array.  

 Yes, but would it save time from skipping preclear and starting to rebuild directly? I always understood that Unraid does its own version of stress testing on a new drive, which is not precleared, even though I have never added one without pre-clearing ever.

Link to comment
1 hour ago, abhi.ko said:

I always understood that Unraid does its own version of stress testing on a new drive, which is not precleared, even though I have never added one without pre-clearing ever.

Don't know where you got that idea.

 

If you don't preclear a disk, Unraid will clear a disk when a clear disk is required. The only time a clear disk is required is when adding a disk to a new data slot into an array that already has valid parity. The disk is cleared (all zeros) so parity remains valid. A clear disk is not required in your situation since you aren't adding a disk to a new slot.

Link to comment
1 hour ago, trurl said:

Rebuild can only build what is already there emulated by the rest of the array. How could it be otherwise? 

Got it. Was probably a brain fart moment, not sure what I was thinking that the lost+found was not written to parity somehow. Clear now. 

1 hour ago, trurl said:

Don't know where you got that idea.

 

If you don't preclear a disk, Unraid will clear a disk when a clear disk is required. The only time a clear disk is required is when adding a disk to a new data slot into an array that already has valid parity. The disk is cleared (all zeros) so parity remains valid. A clear disk is not required in your situation since you aren't adding a disk to a new slot.

Oh okay, that probably was what I was thinking about.

Link to comment
On 2/13/2022 at 2:44 PM, itimpi said:

It 9s really up to you.   As you are using a new drive the good thing is that you will still have the old drive available just in case anything goes wrong with the rebuild, so keep it intact until you are happy the rebuild worked as expected.

Thank you @itimpi and @trurl.

 

I started rebuilding the data disk (disk 19) with the new drive yesterday (2/15), and everything is going well except for the fact that I noticed today morning, the logs are getting filled up. it looks like it is mostly filled with error messages from 2/14 when the array was not even started but eh preclear was running. From what it looks to me, it seems like a backup trying to access a share that was not live during that time, the client IP in the error message below is that of my main workstation, and I have that being backed up using Paragon Backup to the server, so I am thinking that is what it is about. I might be wrong.

 

There are some other call trace errors in there as well and I am not sure if there is still a hardware issue that is causing errors in the log since the log is not being written to now, hence this post here to seek some expert help.

 

Feb 14 02:32:51 Tower nginx: 2022/02/14 02:32:51 [error] 11687#11687: *323602 limiting requests, excess: 20.842 by zone "authlimit", client: 10.0.0.232, server: , request: "PROPFIND /login HTTP/1.1", host: "tower"

Feb 15 21:12:58 Tower kernel: Call Trace:

I have posted the diags here not sure if that is helpful, since the logs have been full and not written to since yesterday night it looks like.

Is there any way to clear the log file without restarting in the middle of the rebuild? Should I even try to?

2022-02-16 07_07_16-Tower_Dashboard.png

tower-diagnostics-20220216-0643.zip

Edited by abhi.ko
Link to comment
5 hours ago, trurl said:

You would probably have to restart syslog in addition to deleting some of it. The logs are at /var/log.

 

Haven't tried this myself, but see here:

 

Thank you. I think I'm going to wait till this disk finishes rebuild. I am not seeing any other issues in the array (outside of the log) as I did before. Like disks with reallocated sector counts going up etc. 

Will rename the top folders in lost+found in the disk after it is done and then restart the array before I start rebuilding disk 12. That should clear the log file and will monitor it closely after that. 

Should I rebuild disk 12 onto itself, since the file system check did not create a lost+found in that disk, I am assuming that the disk itself is okay and it was the voltage fluctuations causing the issues with the disk dropping off? 

Is that a good plan? Or should I rebuild it into the old disk from disk 19 slot? 

 

Link to comment
52 minutes ago, abhi.ko said:

Thank you. I think I'm going to wait till this disk finishes rebuild. I am not seeing any other issues in the array (outside of the log) as I did before. Like disks with reallocated sector counts going up etc. 

Will rename the top folders in lost+found in the disk after it is done and then restart the array before I start rebuilding disk 12. That should clear the log file and will monitor it closely after that. 

Should I rebuild disk 12 onto itself, since the file system check did not create a lost+found in that disk, I am assuming that the disk itself is okay and it was the voltage fluctuations causing the issues with the disk dropping off? 

Is that a good plan? Or should I rebuild it into the old disk from disk 19 slot? 

 

 

 

When I will be in the same situation I would try to reuse the old disk. Just because your problem wasnt the disk itself, it was your powersupply.

When I really want to many drives I would instantly go just for 12-16TB drives. xD

I just got a Corsair RMx 550 with 12 Disk attached two HBAs. Without any problem but it seems the rails are better? xD

Anyway try to use your "old" disk12 and try it. It should work. A friend got in the same trouble with a faulty motherboard. But even then we were able to rebuild the old disks without a problem.

Link to comment

Just posting an update and asking for help on another issue - so disk 19 got rebuilt to a new drive without any errors, and all the lost and found items were moved to right directories, everything looks good. Thanks @JorgeB @itimpi @trurl

I got the additional cables and now all the backplanes are connected 1 tray to one molex connector on the PSU, so 4 drives to one connector, thanks @Michael_P @Vr2Io

Currently rebuilding Disk 12 on to the old Disk 19 drive, but my log is filling up with call trace errors. Not sure why, latest diags posted, could someone take a look and tell me what seems to be the issue. No other consequences noticed yet, because of the error, other than the logs filling up.

tower-diagnostics-20220217-2159.zip

Edited by abhi.ko
Link to comment

Check CPU ( 10980XE ) temperature, throttle before, not new ?

 

Feb 17 21:35:13 Tower kernel: CPU29: Core temperature is above threshold, cpu clock is throttled (total events = 4)
Feb 17 21:35:13 Tower kernel: CPU11: Core temperature is above threshold, cpu clock is throttled (total events = 4)

Edited by Vr2Io
Link to comment
  • 1 month later...

A BIG thank you to everyone who helped. The issue is fixed now, I wanted to wait a month to see if anything got messed up again. Luckily no issues so far and hence I hope this is resolved and I will mark it as closed.

 

The log filling up issue stopped happening after another reboot and the system is running much cooler, I was going to replace the existing CPU cooler and with something more beefy but did not need to. Even while the log was at 100% the system was running pretty solid, so decided to wait and see, and no issues after that reboot. I upgraded to rc3 and still everything to the best of my knowledge is running smoothly.

 

Thanks guys again for the assist and the education.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.