Drives dropping out of array into UD (split from Preclear Results)


TODDLT

Recommended Posts

I just came home and everything looked normal (other than one preclear being "failed").

 

I clicked the "stop preclear" button in unassigned drives for both drives.  Then everything went haywire again.  Drives are bouncing back and forth between the array, and unassigned drives.  However, Plex is still running and recording with the DVR.  That recording will be over by the time I'm home and I'll stop everything, restart, and try to run the extended self tests.

 

1.  Once the drive is stopped, I can't open the drive from Unassigned Devices.  The link to do so is dead.  If the drive is in a pre-clear than I can access the drive window, but can't run the test.  How do I get into a device that is not assigned and run an extended self test?

 

2.  Is there any chance this is a software issue?   It seems completely wacky for a bad drive to put the unRaid Main page into a tailspin with drive assignments bouncing around like this, especially considering the whole array and docker is still functioning.  Is it possibly some conflict between unassigned devices, preclear, and unraid 6.6.6?  

Edited by TODDLT
Link to comment
  • Replies 141
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

3 minutes ago, trurl said:

Sure you aren't having a problem with flash?

I can replace the flash, it is old.  I ran a checkdsk on it at the start of this process and had no errors.    I know you can only do this every so often, is there any other way to be sure?   

 

 

Also, I did try to play a movie and music from Plex, and the both still play.   Below are the main and dashboard pages.  The drives have stopped bouncing around now, they are parked in unassigned devices while everything still works.

 

Dashboard.JPG

Main.JPG

Edited by TODDLT
Link to comment

The inexplicable behavior continues.  I went to bed last night with 2 extended self tests running.  Everything looked normal.  This morning, the main page would not populate at all, and the dashboard would populate enough to say "parity not present"  The "reboot button" would not respond, but the power button on the server put this through a safe shutdown process.

 

This time no parity check was running, just these two drives in s self test scan, and the server goes crazy.

 

It does look like the scans completed and reports are attached.   I have to run out the door but will swing back through the house in an hour or two and can look more closely at the reports myself.  I'm really sturggling to see how the server going crazy is connected to the drives but as of yet, the only time this happens is when some process is running on one of these two drives VIA unassigned devices.  The server has had zero issues outside of that.

 

I'm stumped on this one, so any/all next steps are appreciated.

todd-svr-smart-20190227-0727.zip

todd-svr-smart-20190227-0728.zip

Edited by TODDLT
Link to comment
4 hours ago, johnnie.black said:

At least you now the disks are fine, did you grab the diags (or the syslog) before shutting down?

no, when things were barely responsive I just shut down.  In a rush this morning and didn't' stop to think, I should have tried.

 

Glad you didnt see anything in the report either.  I'm not an expert on those reports but nothing jumped out at me.

 

4 hours ago, trurl said:

I would rather see the whole diagnostics.

 

Have you tried another port for flash? Preferably USB2.

Full diagnostics as it currently stands attached.

It's been in this specific port for years now.   Any reason to think this is a reason for it would stop working?

 

 

Here is my plan, please critique or advise of a different pathway.

1. I'm going to initiate pre-clears from the utilities menu, the way the preclear plug-in runs on its own. IE not using unassigned devices.

2. If everything goes haywire/fails again, I'll attempt to pull logs and post here.

3. If the pre-clears pass I'll re-run again with unassigned devices and pull logs.  OR I'll do another extended test and do the same.

todd-svr-diagnostics-20190227-1123.zip

 

 

Should I move this to the general support thread/page?

Edited by TODDLT
Link to comment

This is a relocated topic that started in the pre-clear report thread but has moved on from the original question.  I am going to try and summarize what is going on in the process.

 

A week(ish) ago I started trying to pre-clear 2 new drives in the same manner I've been doing it without issue for several years.  This is two drives attached externally via eSata ports which are in turn extended from 2 of the main MB Sata ports.  The first time through one of the two drives failed and one completed.  On a subsequent try the reverse occurred.  At this point both drives have passed an extended SMART Self Test and those reports were reviewed under a separate thread.  (Hopefully they will get moved to this thread or I will post them here later today.

 

What also occurs after one of the drives fail, is the entire main/dashboard pages go into chaos.  When I first go to look the drives bounce between the "array" and "unassigned devices"  both pages are slow to respond and populate.  Eventually all drives park themselves in UD and stop moving around.   Even more peculiar is in most instances the server is still functional.  With all my drives showing up in UD, Plex is still running, I'm recording on the DVR, and I can watch movies/music etc via Plex.

 

The only fix for the drive assignments has been to re-boot the server and everything returns to normal.   Here are some additional facts / actions taken along the way:

1. Running  preclear fails on either/or/both of the drives.  The failure has occurred during the initial reading or during zeroing.

2. Running the extended smart test via the UD menu also locked up the main page and required a reboot, but not into over 50% through the extended test 3 or more hours later.

3. After starting the preclear/smart-test everything stays normal for an extended period of hours and in the case of the pre-clears was even over 12 hours later.

4. I ran a checkdsk on my flash drive, which is pretty old now, but showed no errors. 

5. in at least one occurrence of a failed pre-clear there was a temp file that filled up, I can't say this was the case every time.

6. I have replaced the PSU (which was 7 years old) as it seemed that might be a root cause but it didnt solve the trouble.  The new PSU is an HX850 and should be more than sufficient.

7.  I have started a pre-clear using the disk utilities menu instead of going through UD to see if it errors out again.  I'm trying to see if UD is the common thread considering both the pre-clears and the extended smart test were instigated in UD.

 

If the pre-clear fails out again, I'll try to pull a full diagnostics but not sure the server will be responsive on that page or not.  If it completes, I'll run another pre-clear from the UD menu and then try when/if it fails again.

 

I'm looking for any advice or troubleshooting ideas.  

 

Thanks!

Edited by TODDLT
Link to comment
1 hour ago, TODDLT said:

This is a relocated topic that started in the pre-clear report thread but has moved on from the original question.  I am going to try and summarize what is going on in the process.

 

A week(ish) ago I started trying to pre-clear 2 new drives in the same manner I've been doing it without issue for several years.  This is two drives attached externally via eSata ports which are in turn extended from 2 of the main MB Sata ports.  The first time through one of the two drives failed and one completed.  On a subsequent try the reverse occurred.  At this point both drives have passed an extended SMART Self Test and those reports were reviewed under a separate thread.  (Hopefully they will get moved to this thread or I will post them here later today.

 

What also occurs after one of the drives fail, is the entire main/dashboard pages go into chaos.  When I first go to look the drives bounce between the "array" and "unassigned devices"  both pages are slow to respond and populate.  Eventually all drives park themselves in UD and stop moving around.   Even more peculiar is in most instances the server is still functional.  With all my drives showing up in UD, Plex is still running, I'm recording on the DVR, and I can watch movies/music etc via Plex.

 

The only fix for the drive assignments has been to re-boot the server and everything returns to normal.   Here are some additional facts / actions taken along the way:

1. Running  preclear fails on either/or/both of the drives.  The failure has occurred during the initial reading or during zeroing.

2. Running the extended smart test via the UD menu also locked up the main page and required a reboot, but not into over 50% through the extended test 3 or more hours later.

3. After starting the preclear/smart-test everything stays normal for an extended period of hours and in the case of the pre-clears was even over 12 hours later.

4. I ran a checkdsk on my flash drive, which is pretty old now, but showed no errors. 

5. in at least one occurrence of a failed pre-clear there was a temp file that filled up, I can't say this was the case every time.

6. I have replaced the PSU (which was 7 years old) as it seemed that might be a root cause but it didnt solve the trouble.  The new PSU is an HX850 and should be more than sufficient.

7.  I have started a pre-clear using the disk utilities menu instead of going through UD to see if it errors out again.  I'm trying to see if UD is the common thread considering both the pre-clears and the extended smart test were instigated in UD.

 

If the pre-clear fails out again, I'll try to pull a full diagnostics but not sure the server will be responsive on that page or not.  If it completes, I'll run another pre-clear from the UD menu and then try when/if it fails again.

 

I'm looking for any advice or troubleshooting ideas.  

 

Thanks!

Would you unplug both LSI HBA ( best keep all drive have power ), then plug 1 or 2 new buy spare disk in mainboard SATA to perform preclear and check all normal or not.

Link to comment
7 minutes ago, Benson said:

Would you unplug both LSI HBA ( best keep all drive have power ), then plug 1 or 2 new buy spare disk in mainboard SATA to perform preclear and check all normal or not.

I'll try anything to resolve this issue, but this action would take down my server for a couple days to run a preclear of 6TB drives, can you please help me understand what we are trouble shooting here?   

The new drives are plugged into the main MB ports now.  No change there, so we are only talking about something going on with the LSI drivers?

I have precleared a number of drives with an LSI controller installed.  Previously I may have had a single LSI and a Superchips SASLP2.  The SC was replaced with a 2nd LSI in the past 6 months.  I could get a date and see if I have cleared a drive with both LSI's in the past (will confirm later tonight).  

 

Please let me know what the thought process is here and I can make it a next step.  Each of these trouble shooting steps is 1-2 days due to the length of time in a preclear.  so I just want to be as productive as possible.    Another option would be to run an extended SMART Test, as it is a shorter process than the preclear.   Thoughts?

Edited by TODDLT
Link to comment

Case seems hardware issue cause, as no major software / hardware change ( just add 2 new 6TB disk ).

 

Unplug both HBA could verify basic componet normal or not first, if fail that means basic component already got trouble. BTW I notice both HBA's firmware also quite old, or you may try update both HBA first.

 

My troubleshoot method "first" was found a quick way to reproduce the problem rather then try any solution. Bacause this can shorter troubleshoot duration, after that, I will perform any thing to isolate the problem.

 

Your problem seems not on preclear process, sound good if you could found a quick way to trigger the problem first.

 

40 minutes ago, TODDLT said:

The new drives are plugged into the main MB ports now. 

If previous fail also in this setup, I would suggest ignore update HBA firmware, just isolate both HBA and troubleshoot.

 

40 minutes ago, TODDLT said:

Another option would be to run an extended SMART Test, as it is a shorter process than the preclear.   Thoughts?

From your troubleshoot history, I don't think problem from any disk relate.

 

Edited by Benson
Link to comment
1 hour ago, Benson said:

Case seems hardware issue cause, as no major software / hardware change ( just add 2 new 6TB disk ).

 

There have been several updates to both Preclear and UD as well as unRAID itself since my last pre-clear was done.

 

1 hour ago, Benson said:

BTW I notice both HBA's firmware also quite old, or you may try update both HBA first.

 

I'll go look at their website.  It's been a while since I did this type of firmware update, I think I have to create another bootable USB with the firmware updates, disconnect all my drives and run the updates from the USB, correct?

 

1 hour ago, Benson said:

My troubleshoot method "first" was found a quick way to reproduce the problem rather then try any solution. Bacause this can shorter troubleshoot duration, after that, I will perform any thing to isolate the problem.

 

Your problem seems not on preclear process, sound good if you could found a quick way to trigger the problem first.

TOTALLY AGREE, but I haven't found anything else that triggers the problem and it even happens deep into the process.  Any ideas on how to find this?

 

I have already started a pre-clear from outside UD and will see how that goes tonight.  It will hit the zeroing stage in about 3 hours.

 

1 hour ago, Benson said:

From your troubleshoot history, I don't think problem from any disk relate.

 

Agreed, unless it's caused by some combination of hardware/drivers.  I think there have a been a few quirky issues discussed in this community over the years that were caused by a combination of things, not one specific hardware conflict. 

My new drives are Toshiba N300's (NAS drive line) and the first of that series I have.

Edited by TODDLT
Link to comment

A bit more information:

 

The last set of pre-clears I ran was before I swapped the SAS2LP card for my 2nd LSI controller, so this is a first time for this exact configuration.  However, I did have one LSI in the rig at the time.

 

So far 30% into zeroing and no issues.  I'll know if they both make it to post read by morning.

 

I have noticed since some recent unRAID upgrade the CPU seems to stay busier even at rest.  IE bouncing up to 20-30%.   2 cores peg at 100% and that activity bounces from core to core.  Now with 2 pre-clear's running the CPU usage is bouncing 30 - 50% and even 3 cores are pegging out at a time.  I shut down all my dockers and nothing changed.

 

It seems like something is going on in the background.  I really am not sure if it was like this before the this preclear issue started or not.  Is there some way to look at processes by CPU usage like task manager on Windows?

 

 

Link to comment
7 hours ago, TODDLT said:

I'll go look at their website.  It's been a while since I did this type of firmware update, I think I have to create another bootable USB with the firmware updates, disconnect all my drives and run the updates from the USB, correct?

Yes, copy to a USB stick, I like UEFI method, no need make stick bootable, just format it as FAT32 and copy those necessary file.

 

2118it.bin

mptsas2.rom

sas2flash.efi

Shellx64.efi

 

7 hours ago, TODDLT said:

My new drives are Toshiba N300's (NAS drive line) and the first of that series I have.

I have ~10 Toshiba harddisk, all is 6TB ( MD04/MG04 series ), they are working good. But in true, I haven't use preclear longtime, I still stay in there have 2 different branch preclear plugin time.

Edited by Benson
Link to comment
7 minutes ago, TODDLT said:

I have noticed since some recent unRAID upgrade the CPU seems to stay busier even at rest.  IE bouncing up to 20-30%.   2 cores peg at 100% and that activity bounces from core to core.  Now with 2 pre-clear's running the CPU usage is bouncing 30 - 50% and even 3 cores are pegging out at a time.  I shut down all my dockers and nothing changed.

 

It seems like something is going on in the background.  I really am not sure if it was like this before the this preclear issue started or not.  Is there some way to look at processes by CPU usage like task manager on Windows?

I will add to this, the activity level stays up even with all the drives spun down.

Link to comment
8 hours ago, TODDLT said:

TOTALLY AGREE, but I haven't found anything else that triggers the problem and it even happens deep into the process.  Any ideas on how to find this?

Will you periodically run parity check ( make disk/controller in load for long period ), result positive ?

 

Info: I run parity check monthly and SMART extend test for new add disk, 2 Unraid system equip single LSI HBA ( have UD but no preclear plugin) without problem.

Edited by Benson
Link to comment

Morning Report:

 

I never got a "preclear failed" email, but this morning everything was in chaos again.

- No drives showing up as assigned, all in UD.  Sometimes it changes.

- bottom left hand corner status shows the array bouncing between "started" and "undefined"

- I CAN access shares

- Plex works

- When I click the diagnostics button, nothing happens. (screen blinks)

- Main Page and dashboard info is intermittent to populate, sometimes very slow to respond, but sometimes shows partial info and is not consistent.

 

Is there anywhere to go manually grab the log from the flash before rebooting?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.