Have I lost a drive?


Recommended Posts

Hubby reported not being able to copy files to my UnRAID 5.0.5 server, message was "you do not have permission".  I checked all the shares via the UnRAID GUI and they seem fine.

 

Looking at disk status I can see 1 disk is in red status, when I try and run a SMART report on it I get a "Smartctl open device: /dev/sdf failed: No such device" message.  I have tried shecking the Syslog but all I get is these few entries :-

 

Jun  4 04:40:01 Tardis syslogd 1.4.1: restart.

Jun  4 08:47:35 Tardis dhcpcd[1041]: eth0: renewing lease of 192.168.1.16 (Network)

Jun  4 08:47:35 Tardis dhcpcd[1041]: eth0: acknowledged 192.168.1.16 from 192.168.1.1 (Network)

Jun  4 08:47:35 Tardis dhcpcd[1041]: eth0: leased 192.168.1.16 for 86400 seconds (Network)

Jun  4 18:49:47 Tardis unmenu[1127]: cat: /sys/block/sdf/stat: No such file or directory (Drive related)

Jun  4 18:49:52 Tardis kernel: NTFS driver 2.1.30 [Flags: R/W MODULE]. (System)

Jun  4 18:49:52 Tardis unmenu[1127]: cat: /sys/block/sdf/stat: No such file or directory (Drive related)

Jun  4 18:50:37 Tardis unmenu[1127]: cat: /sys/block/sdf/stat: No such file or directory (Drive related)

Jun  4 18:50:37 Tardis kernel: mdcmd (797): spinup 3 (Routine)

Jun  4 18:50:37 Tardis kernel:  (Routine)

Jun  4 18:51:11 Tardis unmenu[1127]: cat: /sys/block/sdf/stat: No such file or directory (Drive related)

Jun  4 18:51:11 Tardis kernel: mdcmd (798): spinup 1 (Routine)

Jun  4 18:51:11 Tardis kernel:  (Routine)

Jun  4 18:52:03 Tardis unmenu[1127]: cat: /sys/block/sdf/stat: No such file or directory (Drive related)

Jun  4 18:52:03 Tardis kernel: mdcmd (799): spinup 3 (Routine)

Jun  4 18:52:03 Tardis kernel:  (Routine)

Jun  4 19:04:30 Tardis kernel: mdcmd (800): spindown 0 (Routine)

Jun  4 19:04:30 Tardis kernel: mdcmd (801): spindown 2 (Routine)

 

From this it would seem the disk has died but how best to verify?  I do have disks here I can replace the drive with if needed but they are larger so I would have to replace my parity drive first and then the failed disk.  What is the best process to achieve this?

 

Thanks for reading.

 

Link to comment

 

Thanks for the link.  Not sure if I have understood correctly but it would seem I can replace the parity drive first and then replace the failed data drive.  Is this correct? 

 

The replacement drives I have available are both new and larger than the existing parity drive and the failed data drive.  I realise I can use the old Parity drive to replace the failed data drive once I have swapped the Parity drive out, it just seems I may as well kill 2 birds with 1 stone as it were.

 

It is possible the disk is fine, but it has dropped offline for some reason.    Have you tried rebooting the server to see if the disk can then be seen, and if so post the SMART attributes.

 

Not yet, wasn't sure of next steps.  I thought SMART would work even if the disk was offline?

Link to comment

 

Thanks for the link.  Not sure if I have understood correctly but it would seem I can replace the parity drive first and then replace the failed data drive.  Is this correct? 

 

The replacement drives I have available are both new and larger than the existing parity drive and the failed data drive.  I realise I can use the old Parity drive to replace the failed data drive once I have swapped the Parity drive out, it just seems I may as well kill 2 birds with 1 stone as it were.

 

It is possible the disk is fine, but it has dropped offline for some reason.    Have you tried rebooting the server to see if the disk can then be seen, and if so post the SMART attributes.

 

Not yet, wasn't sure of next steps.  I thought SMART would work even if the disk was offline?

Be very careful with the parity swap procedure. In fact, be very careful when you rebuild any disk. In particular, there is no way to "kill 2 birds with 1 stone". You must use 2 stones to kill the 2 birds. In other words, you cannot change 2 drives at the same time.

 

Parity swap works by copying parity to a new drive, then rebuilds the failed drive to the old parity drive. After the parity swap rebuilds the drive to the old parity drive, and everything is good, and you have done a parity check, then you can rebuild the data drive (old parity) to a new larger drive.

 

unRAID won't even let you proceed if you remove more than one drive since parity plus all other drives are required to rebuild a drive.

 

itimpi has a good point about whether the drive is actually bad though. In any case you are going to have to rebuild if the drive has redballed, but you might be able to reuse the drive later if it is OK.

Link to comment

I thought SMART would work even if the disk was offline?

 

SMART handling is actually on the drive itself, all we do is ask the drive for reports and tests.  So if the the drive is offline, there's no SMART.  A reboot usually recovers the drive, unless it has catastrophically failed, won't even spin up.

 

If you can give us the whole syslog (see my sig, Troubleshooting link), and a SMART report for it (after reboot), then we may be able to determine what went wrong, and whether the drive is fine, or needs a little TLC, or needs replacement.  The little syslog excerpt above was from some time after the issues happened to it.  Typically we need to see the very first errors it produced.

Link to comment

unRAID won't even let you proceed if you remove more than one drive since parity plus all other drives are required to rebuild a drive.

 

itimpi has a good point about whether the drive is actually bad though. In any case you are going to have to rebuild if the drive has redballed, but you might be able to reuse the drive later if it is OK.

 

Sorry, I wasn't clear.  It was never my intention to replace both drives at the same time.  I was planning to replace the parity drive with a new larger drive and once that was rebuilt replace the failed data drive with a new larger drive.  I'm happy to follow the Parity Swap procedure if that is the way to go.

 

If you can give us the whole syslog (see my sig, Troubleshooting link), and a SMART report for it (after reboot), then we may be able to determine what went wrong, and whether the drive is fine, or needs a little TLC, or needs replacement.  The little syslog excerpt above was from some time after the issues happened to it.  Typically we need to see the very first errors it produced.

 

The syslog excerpt was all I could get at the time.  I did try to download it but there was no more data.  I have attched a Sylog report and SMART report for the failed drive.  Unfortunately I did these dumps after the reboot so the previous Syslog is lost.  I'm such a goose I forgot about the contents being wiped.

 

syslog_05062016.txt

Smart_Report_SDF_2016_06_05.txt

Link to comment

There is no way to replace and rebuild parity and then still have a chance to rebuild the disabled drive. In order to rebuild parity, all other drives must be enabled. Parity swap takes care of this by copying parity instead of rebuilding it, then rebuilding the disabled disk to the old parity.

 

SMART for the disabled drive looks OK though, so another possibility is to rebuild the drive to itself and if that succeeds, then you can proceed to upgrade parity in the normal way and then upgrade the data drive in the normal way if you want.

 

Re-enable the drive

Link to comment

SMART for the disabled drive looks OK though, so another possibility is to rebuild the drive to itself and if that succeeds, then you can proceed to upgrade parity in the normal way and then upgrade the data drive in the normal way if you want.

 

Do you think the Parity Swap would be the best option given I am wanting to upgrade my drives anyway?  Will this have less risk than rebuilding onto the disabled drive?

Link to comment

SMART for the disabled drive looks OK though, so another possibility is to rebuild the drive to itself and if that succeeds, then you can proceed to upgrade parity in the normal way and then upgrade the data drive in the normal way if you want.

 

Do you think the Parity Swap would be the best option given I am wanting to upgrade my drives anyway?  Will this have less risk than rebuilding onto the disabled drive?

rebuilding onto the disabled drive should not entail any risk, just take up some time.  If it works then you are in a good state for going forward, and if it fails then you are in the same state as at present.
Link to comment

rebuilding onto the disabled drive should not entail any risk, just take up some time.  If it works then you are in a good state for going forward, and if it fails then you are in the same state as at present.

 

OK, I will give it a try and then report back.  I will use the un-trusted drive procedure since I do not know what caused the drive to go off-line.

 

One other question.  When I started looking at the problem I noticed one of my shares "TV Shows" was in amber status.  I checked in the WIKI but I can not find what this means, only the status colours of the drive itself.

Link to comment

rebuilding onto the disabled drive should not entail any risk, just take up some time.  If it works then you are in a good state for going forward, and if it fails then you are in the same state as at present.

 

OK, I will give it a try and then report back.  I will use the un-trusted drive procedure since I do not know what caused the drive to go off-line.

 

One other question.  When I started looking at the problem I noticed one of my shares "TV Shows" was in amber status.  I checked in the WIKI but I can not find what this means, only the status colours of the drive itself.

Probably been nearly 2 years since I used v5. Does the Help button tell you anything? With v6 it would mean some of the share is still on cache.
Link to comment
Probably been nearly 2 years since I used v5. Does the Help button tell you anything? With v6 it would mean some of the share is still on cache.

 

Do you mean from the console?  I have the array stopped at the moment so can't see the share status, I will see what it does when I get it back online.  I do use a cache drive but this was a long-standing share, not a newly created one.

Link to comment
  • 2 weeks later...

OK, so I have finally had some time to work on this again.  I have rebuilt the drive and it seems to be OK.  There was a 0 error count on the status console, just a large number of writes as expected.  I have not yet kicked off a Parity Check as I was not sure whether to include the "Correct any parity errors" option.  I have attached the syslog and SMART report for the drive.

 

Also I note that the share which was in amber status is still the same following the rebuild, have not yet found an explanation for this.

 

Edit: Found this thread that may point to an explanation for the amber share, it seems to indicate there are files in cache waiting to be written to it.

 

http://lime-technology.com/forum/index.php?topic=9071.0

 

From memory my Mover script runs automatically at 4:00 am each morning.  Should I stop it or force it to run sooner?

syslog-2016-06-18.txt

SDF_device_SMART_report_20160617.txt

Link to comment

OK, so I have finally had some time to work on this again.  I have rebuilt the drive and it seems to be OK.  There was a 0 error count on the status console, just a large number of writes as expected.  I have not yet kicked off a Parity Check as I was not sure whether to include the "Correct any parity errors" option.  I have attached the syslog and SMART report for the drive.

 

Also I note that the share which was in amber status is still the same following the rebuild, have not yet found an explanation for this.

 

Edit: Found this thread that may point to an explanation for the amber share, it seems to indicate there are files in cache waiting to be written to it.

 

http://lime-technology.com/forum/index.php?topic=9071.0

 

From memory my Mover script runs automatically at 4:00 am each morning.  Should I stop it or force it to run sooner?

 

1. Do NOT tick "Correct any parity errors" unless you are 100% sure you have zero issue. What will happen if you DO have an issue is the parity will be updated to the issue so there's no way back.

 

2. Stop the mover, unless you are 100% sure you have zero issue. Last thing you want is for the data that's safe and sound in cache to be corrupted when moved to a problematic disk.

Link to comment

1. Do NOT tick "Correct any parity errors" unless you are 100% sure you have zero issue. What will happen if you DO have an issue is the parity will be updated to the issue so there's no way back.

 

I have not kicked it off yet, once I do I'll leave the box un-ticked.

 

2. Stop the mover, unless you are 100% sure you have zero issue. Last thing you want is for the data that's safe and sound in cache to be corrupted when moved to a problematic disk.

 

I did not see your reply in time and it ran as I didn't know how to stop it.  Looks like I still have problems, lots of reiserfs errors and mover related entries in the syslog which I have attached.  The problem drive is in green status and the share is still amber so I'm not sure what to do next.

 

Edit: I just noticed that the reiserfs and other errors are logged against device sdb1 (my cache drive) and the drive that dropper offline was sdf.  Also, my drives are all getting pretty full, 96% is the minimum level. 

 

Edit 2: Have also added the Smart report for SDB1, it has 2 current pending sectors so looks like it might be a bit sick.  Being my cache drive how can I recover from this?

syslog-2016-06-181.zip

SMART_Report_SDB1_20160618.txt

Link to comment

The cache drive has bad sectors AND a corrupted file system, 2 serious problems with very different solutions.  At this point, the best plan would be to make sure any files you want saved have been copied off, unassign it as Cache drive, then Preclear it several times.  You just ran the Mover, so hopefully there's little left.  If there is anything left you want, it may or may not be corrupted, so backups of the files would be safer.  If no backup, make an attempt to get what you can off the drive, then unassign it, and Preclear it, several cycles.

 

As to the parity check, you have just rebuilt Disk 3, so any wrong parity bits have effectively been transferred to Disk 3, making Parity perfectly correct now.  When Disk 3 bits are being reconstructed, the right and wrong bits of the parity drive plus all other drives are used.  If a wrong bit exists, then the wrong bit is written to Disk 3, but after that, the parity bit is now correct!  You can prove that to yourself by dong a parity check, and you'll find that it's perfect, even if it wasn't before.  There's no way to know if there are wrong bits on Disk 3, or where they might be located, so if you suspect some, this would be a good time to check the files on Disk 3 for corruption, with any method you may have.  Best method is to compare with backups, or just restore the files from the backups.  I haven't read the previous posts to see how likely it is you have any corruption.  Often, there isn't any real damage, just fear there might be damage.

Link to comment

Thanks RobJ.  I'm pretty sure we can recover whatever is on cache now from other sources (at least hubby says he has it elsewhere but needs to check) so it's no massive drama.

 

I think Disk 3 will be OK as it appears to have just dropped offline rather than have an actual hardware problem.  I will get hubby to check if the contents of that drive are still held elsewhere.  He has a lot of temporary storage areas on other PCs so there is a good chance another copy of most of the data exists.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.