Jump to content

Red Ball (X) Issue - Please help


TODDLT

Recommended Posts

OK, seems this is my month for red-ball drives.  I've had two this month but considering these are the only two drives have ever done this too me, its a pretty rare event.  Hopefully this is it.

 

There are a few peculiarities going on here that may or may not be relevant to the red-ball drive but I'm going to just list out points that I do know and then I'm looking for a "what now."

 

  • Earlier tonight I watched a movie remotely on Plex from unRAID, no issues.
  • Tonight I was re-installing software on a PC that was housed on my unRAID server.  Looking back now, it seems at one point the response time to run anything increased, and in fact I seem to recall one specifically long hesitancy where I heard all the drives spin up.   I should have looked, but i didn't.  Everything kept working so on I went.  Noted: I don't actually know this is when the drive red-balled, its just me looking back that it might have been.
  • Now all that done I went to log into Plex from my PC and got a no response on the IP address.
  • I went to the unRAID main page and see a red-ball on the only SSD in my array.  Samsung_SSD_850_EVO_500GB_S21HNXAG454345K - 500 GB (sdf).  This is the drive that housed the software I was installing earlier.  The only connection to Plex is my music collection, cataloged by Plex is stored on the SSD.
  • I went to the Dashboard page in Plex and found the "Apps" section didn't even exist.  Nothing shown for the 3 dockers I have installed the whole section of the page was gone like I didnt have any dockers installed or running.
  • I tried to click on the Dockers Tab.  Nothing would open, nothing happened.  All other tabs seemed fine.
  • Then I noticed the parity check run 2 days ago resulted in 5 errors being found.  This may be the first, but no more than the 2nd parity error found in 6 years.
  • I can't seem to pull any smart reports from the SSD.  It says it's not spun up, but we know SSD's don't spin up.  I have checked cables and didn't see any obvious problems, but unRAID has now disabled it so I can't test that out.

 

A reboot of the server seemed to fully restore dockers and plex.  The redball as you would expect remains.  Stupid me, didn't pull a syslog before rebooting.

 

The data on the SSD is particularly important to me.  It's all of our family photos, family videos, my music collection, and important documents in addition to various software applications.  It is backed up in 2 other places so if corrupted I can recover, but the question is how do I know?  The backups are current in just about everything except some very recent changes to software, and possibly a few recently music additions.  That could all be fixed.

 

Ok so a few questions:

  1. Would unRAID have automatically corrected for the parity check errors?  If it has, is this now possibly corrupt parity?
  2. Is there any way to test the SSD?
  3. Should I just un-assign and re-assign the drive to see if it will work?  What steps should I follow at this point?
  4. Is there any way to figure out if the parity check errors are due to the SSD going bad and would that now mean my parity is corrupt?  If thats the case then what now?

 

What steps would you take right now?  

 

One thought would be to either re-assign this SSD (see if its ok), or just buy a new one.  Recover the data from the parity, but then copy over everything from my backup.  This would take a while but I think it would ensure none of what is backed up is left corrupted by the potentially corrupted parity.  This would cover nearly everything currently on the SSD, safe a few more recent additions that would simply by "at risk" for this corruption.

 

Thoughts?   Please help, I don't want to screw this up.

 

 

Thanks much to all.

 

 

 

 

Link to comment

Unless you have checksums there's no way of knowing if the rebuilt disk is going to be corrupt or not, assuming the SSD is OK, since they rarely fail and if you don't have checksums, I would test the SSD and make sure it mounts, if all is well re-sync parity instead of rebuild.

Link to comment
15 minutes ago, johnnie.black said:

Unless you have checksums there's no way of knowing if the rebuilt disk is going to be corrupt or not, assuming the SSD is OK, since they rarely fail and if you don't have checksums, I would test the SSD and make sure it mounts, if all is well re-sync parity instead of rebuild.

 

This makes sense, and appreciate the logic.  

  1. What would cause an SSD to redball if it's still good (other than a cable issue)?
  2. How do I test the SSD if I can't run smart diagnostics on it while red-balled?
  3. What is the process to reassign a drive that is redballed without triggering a data re-build and telling it to re-sync parity instead?

Thanks.

Link to comment
3 hours ago, TODDLT said:

 

  1. How do I test the SSD if I can't run smart diagnostics on it while red-balled?
  2. What is the process to reassign a drive that is redballed without triggering a data re-build and telling it to re-sync parity instead?

You CAN get the SMART info for a red-balled disk as long as it is online.   If it has dropped offline then you need a reboot.  If that works and the SMART data looks good then maybe it was just a loose cable (power or SATA) that caused the original red-ball.

 

if you do not want to rebuild the data disk then you can

- do a New Config via Tools menu and select the option to keep the current assignments

- on the Main tab tick the Parity is Good option to start the array in its ‘as is’ state

- run a parity check to correct any parity errors

note that this will lose any writes that happened after the SSD red-balled so be sure you are happy with that.

Link to comment
3 hours ago, TODDLT said:

What would cause an SSD to redball if it's still good (other than a cable issue)?

 

Usually cable or controller

 

3 hours ago, TODDLT said:

How do I test the SSD if I can't run smart diagnostics on it while red-balled?

 

SMART tests are not always usefull on SSDs, SMART report usually shows any problems.

Link to comment
1 hour ago, itimpi said:

You CAN get the SMART info for a red-balled disk as long as it is online.   If it has dropped offline then you need a reboot.  If that works and the SMART data looks good then maybe it was just a loose cable (power or SATA) that caused the original red-ball.

 

if you do not want to rebuild the data disk then you can

- do a New Config via Tools menu and select the option to keep the current assignments

- on the Main tab tick the Parity is Good option to start the array in its ‘as is’ state

- run a parity check to correct any parity errors

note that this will lose any writes that happened after the SSD red-balled so be sure you are happy with that.

 

Thanks, I must not have tried after the reboot because it did let me run the test.  Test attached but it pretty much shows nothing of use or consequence.  (please correct me if I'm missing something)

 

1 hour ago, johnnie.black said:

 

Usually cable or controller

 

SMART tests are not always usefull on SSDs, SMART report usually shows any problems.

 

The cables all seemed tight and didn't budge.  They do not lock due to the way this drive is mounted and I'm working on solving that, but there was no movement in the cables when I checked them.  Is there any way to make sure the drive isn't somehow failing (rare as I know that would be for a SSD).  Would you just trust it?  The combination of the drive dropping out yesterday and parity errors 2-3 days ago has me concerned.  

 

Samsung_SSD_850_EVO_500GB_S21HNXAG454345K-20170705-0830.txt

Link to comment

SMART looks fine, in my experience the 850 EVO is extremely picky about the cables, for a while I had 8 in RAID10 and it was very rare to go a day without 1 or 2 ATA errors, replacing the cables sometimes helped, other times not, ended up removing them from the pool.

Link to comment
48 minutes ago, johnnie.black said:

SMART looks fine, in my experience the 850 EVO is extremely picky about the cables, for a while I had 8 in RAID10 and it was very rare to go a day without 1 or 2 ATA errors, replacing the cables sometimes helped, other times not, ended up removing them from the pool.

 

Wow, seems odd for a drive type to be plagued with that problem.  I have a new bracket to mount the SSD drives in the case that will allow me to connect the cable direct to the drive (cache drive is an SSD too).  I THINK the SATA data will at least lock on the drive unlike the current configuration.

 

This drive has been in the pool for probably the 1.5 yrs it's shown power on and I have not seen a single issue to date.

 

That being said, is there any way to verify anything on this drive before telling the array I fully trust it and re-create the parity?  Given the scenario above and the parity check errors 2 days ago, would you roll those dice?  Would you trust the parity to not be corrupt given the 5 errors found or the drive that you can't currently read to not have a problem?

 

Did the scheduled parity check automatically update the parity or just flag the errors?  Am I correct in remembering this is an option when you use the scheduler to set up the monthly check?  If so I probably had it unchecked.  IE not to correct parity.  It's been a while since I set this up and not in front of the box now.

Link to comment
1 hour ago, TODDLT said:

Did the scheduled parity check automatically update the parity or just flag the errors?

 

Depends on if it's non correct or correct.

 

For the rest you're making questions that it's impossible to answer without at least seeing the diags from when the errors happened.

Link to comment
2 minutes ago, johnnie.black said:

 

Depends on if it's non correct or correct.

 

For the rest you're making questions that it's impossible to answer without at least seeing the diags from when the errors happened.

 

Yes pretty irritated with myself for not thinking to pull diagnostics before the reboot.

 

Last night I copied over the files which had been added to the array since my last backup ran, over into the backup system.  At this point if there is anything not in the backup, its minor and I'm not overly concerned about it.

 

I'm going to look at lunch and see if the scheduled parity is correcting or not.  I think not.  

 

Link to comment
3 hours ago, johnnie.black said:

 

Depends on if it's non correct or correct.

 

For the rest you're making questions that it's impossible to answer without at least seeing the diags from when the errors happened.

 

I checked at lunch and was correct the monthly parity check is non-correcting.

 

I guess I have two options:

 

  1. Restore the disk from the parity which would reflect everything on the drive up until sometime when it failed, presumably yesterday.  The wild card here is I don't know what caused those 5 errors, if it is something on the SSD or how it might affect the restored data.   I can copy back over literally 98% or more of what is on that SSD using the backups that I have.  Being the copy over would only overwrite the files it has, this would substantially reduce the scope of what could potentially be corrupted to the files added since the last backup last month (not much has changed since then
  2. Use process outlined above by itimpi to trust the data on the SSD and re-create the parity.  I have the same ability to copy over from the backup once the parity is complete.  The potential benefit here is if the 5 parity errors were due to something on the SSD, then in the copy over, hopefully it would be corrected or at least narrow the scope of a possible problem to a smaller group of files.

Any thoughts on one option over the other?

 

 

Parity Check.PNG

Link to comment

Difficult to say which option is better, if you had cheksums you could just scrub/check the emulated disk and it would be an easy answer.

 

If you have a spare SSD (or HDD) you could rebuild to that and then use a file binary compare utility to find any different files, all equal files would be OK for sure.

Link to comment
16 minutes ago, johnnie.black said:

Difficult to say which option is better, if you had cheksums you could just scrub/check the emulated disk and it would be an easy answer.

 

If you have a spare SSD (or HDD) you could rebuild to that and then use a file binary compare utility to find any different files, all equal files would be OK for sure.

I might have a way to do that.  I have an old 2 TB drive I had taken out of service but it was still functional.  

 

Is there a way to roll the 2TB back to a 500 GB to swap it back later?  I think I'd have to go through a lot of gyrations to pull that off.  IE copy everything off the drive, remove it from the pool, recreate the parity with one less drive, then add it back in at 500 GB? 

 

As another option is there a way to mount that drive outside of the array to do the comparison?

 

If you have any suggestions for a utility to compare files at a binary level, please let me know.  I'll see what I can do on another 500 GB hard drive.  

 

thanks again for all the help and time.

Link to comment
6 minutes ago, TODDLT said:

Is there a way to roll the 2TB back to a 500 GB to swap it back later?

 

Yes, new config followed by a parity sync

 

7 minutes ago, TODDLT said:

As another option is there a way to mount that drive outside of the array to do the comparison?

 

Unassigned Devices plugin will do, you'll just need to change the UUID of one of them since they'll be equal and only one will mount like that.

 

8 minutes ago, TODDLT said:

If you have any suggestions for a utility to compare files at a binary level, please let me know.

 

Rsync on unRAID or Syncbackfree if you prefer to use a Windows util.

Link to comment
3 hours ago, johnnie.black said:

 

Yes, new config followed by a parity sync

 

Rsync on unRAID or Syncbackfree if you prefer to use a Windows util.

 

So the process would be:

  1. Replace the SSD with my 2TB HDD.  Rebuild from Parity.
  2. Do a comparison between the newly installed HDD and my backup drive.
  3. Use unassigned devices plugin and do the same comparison with the SSD to the backup drive.  (this would still be available as a "share" correct?
  4. Fine whatever files are different, get the SSD updated to the correct, un-corrupted files.  There can't or shouldn't be any large differences, unless there is more corruption than 4 parity sync errors would lead me to believe.
  5. Do a new config, un-assign the HDD, assign the SSD, and re-build parity.

I think the benefit here is I can recover whatever was written since the drive red-balled as well.

 

If I'm comparing to the Backup drive, that is installed in a windows machine so don't I need to use a windows based utility and access the HDD/SSD by network share?  I've never used RSync, but I"m assuming both drives would need to be inside unRAID?

 

 

Link to comment

You can also compare with your backup, but I meant compare the files from the SSD with the rebuilt disk, all files that match on both are be OK, different files will need to be checked if it's a more recent version or if one of them is corrupt (for example by checking those files only against the backup)

Link to comment
15 minutes ago, johnnie.black said:

You can also compare with your backup, but I meant compare the files from the SSD with the rebuilt disk, all files that match on both are be OK, different files will need to be checked if it's a more recent version or if one of them is corrupt (for example by checking those files only against the backup)

 

OK I have unassigned devices installed and I will restore to the HDD tonight.

 

Comparing the two drives two each other would tell me what is missing since the SSD went down.  

 

However the only way to know if any data is actually corrupted (5 errors) is to compare to the backup attached to the Windows machine. That will leave me a narrow margin of the newer files added since the last backup that could have possibly been corrupted, but its a pretty narrow window all things considered.

 

I downloaded syncbackfree.  I can't find how to tell it to compare only.  It seems to just sync or backup.   I also dont see any settings that confirm its doing a bit level sync, just file level, by date changed, file size, etc...

 

Any advice on how to compare the backup files from the windows box to unRAID is appreciated.

 

Thanks again.

Link to comment
52 minutes ago, TODDLT said:

 I can't find how to tell it to compare only.  It seems to just sync or backup.

 

Sync is going to compare, to check hashs you just need to activate "Use slower but more reliable method of file change detection" on Compare Options.

Link to comment
7 hours ago, johnnie.black said:

 

Sync is going to compare, to check hashs you just need to activate "Use slower but more reliable method of file change detection" on Compare Options.

 

Is there a way to tell it to compare only?  IE I dont want it to over-write anything until I see what is different.

Link to comment

 

18 hours ago, johnnie.black said:

 

Sync is going to compare, to check hashs you just need to activate "Use slower but more reliable method of file change detection" on Compare Options.

 

I enabled that setting and did a "test run".  What came up is a much larger number of files that were on one side or the other, but I think this is because I deleted some folders complete, and some were relocated, so they show up as differences on both ends.  That being said there are 32 "changed" files.  Are these the ones it noted as different at the bit level?  Some look like files that I know were just updated but still have the same name.   What would a "bit level difference" show up as?

 

Now I want to compare the SSD and I've installed Unassigned Devices.  I turned on the mount and sharing but I can't get the share to show up in my browser or in the list of shares on UNraid.  See attached two screenshots.   Any thoughts?  Based on what I read I thought this was all that would be required?   I do note the gray ball, looking like the disk is "spun down"  which can't be for a SSD, but don't see a way to spin up any of those drives.

 

 

Unassigned Devices.PNG

disk shares.PNG

Link to comment

OK well this came from the UD Log:

 

Jul 06 20:33:32 Disk with serial 'ST3000VN000-1HJ166_Z6A039JQ' is not set to auto mount and will not be mounted...
Jul 06 20:33:32 Disk with serial 'ST3000VN000-1HJ166_Z6A03928' is not set to auto mount and will not be mounted...
Jul 06 20:33:32 Adding disk '/dev/sdf1'...
Jul 06 20:33:32 Mount drive command: /sbin/mount -t xfs -o rw,noatime,nodiratime,discard '/dev/sdf1' '/mnt/disks/SSD'
Jul 06 20:33:32 Mount of '/dev/sdf1' failed. Error message: mount: wrong fs type, bad option, bad superblock on /dev/sdf1,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

Jul 06 20:33:32 Partition 'Samsung SSD 850 EVO 500GB S21HNXAG454345K' could not be mounted, aborting...
Jul 06 21:39:58 Adding disk '/dev/sdf1'...
Jul 06 21:39:58 Mount drive command: /sbin/mount -t xfs -o rw,noatime,nodiratime,discard '/dev/sdf1' '/mnt/disks/SSD'
Jul 06 21:39:58 Mount of '/dev/sdf1' failed. Error message: mount: wrong fs type, bad option, bad superblock on /dev/sdf1,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

Jul 06 21:39:58 Partition 'Samsung SSD 850 EVO 500GB S21HNXAG454345K' could not be mounted, aborting...
 

Link to comment

smart report on the drive attached.  

 

See this at the bottom:  not sure what this means but doesn't look good.

 

Warning! SMART Selective Self-Test Log Structure error: invalid SMART checksum.
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Samsung_SSD_850_EVO_500GB_S21HNXAG454345K-20170706-2331.txt

Link to comment
5 hours ago, TODDLT said:

What would a "bit level difference" show up as?

 

Any file with same name and a different modified time/date stamp or size will appear as different, bit level difference is any file with the same name and size but a different checksum.

 

3 hours ago, TODDLT said:

Mount of '/dev/sdf1' failed. Error message: mount: wrong fs type, bad option, bad superblock on /dev/sdf1,
       missing codepage or helper program, or other error

 

Post the syslog after an attempted mount but like I posted earlier you'll need to change the UUID so it's different from the rebuilt disk or you won't be able to mount both at the same time.

Link to comment
18 minutes ago, johnnie.black said:

 

 

Post the syslog after an attempted mount but like I posted earlier you'll need to change the UUID so it's different from the rebuilt disk or you won't be able to mount both at the same time.

 

Sorry, I didn't catch that.  How do you change the UUID?  

 

The errors I posted were form the Unassigned Devices log after an attempted mount.  I have attached the full diagnostics zip taken at the same time.

 

Your time is much appreciated.  I'm learning as I go.

 

todd-svr-diagnostics-20170706-2252.zip

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...