New 10TB HDD fail after 342 hours and other concerns


Recommended Posts

I'm new to Unraid and a few weeks into building my first server on a trial key.

I'm using four brand new 10TB Seagate enterprise ST10000NM0086-2AA101 drives and an old 3TB Seagate and 4TB Hitachi: two parity drives and 27TB of storage.

 

I just had one of the 10TB data drives fail right around the time my first parity build completed, after 342 hours clocked on the drive.  The UDMA CRC error count has a raw value of 1.

The other 10TB data drive is reporting a reallocated sector count of 16 after 342 hours.

 

I am replacing both of those drives with an Amazon product return.

 

I'm running an extended SMART test on all 6 drives now with the array offline.

 

I also had four 4-year-old drives, two of which failed to pass a preclear test: one from a pair of 3TB Seagates (25,000 hours) and one from a pair of 4TB Hitachis (33,000 hours).  Am I right that I should just dispose of the drives which failed the preclear test and not try to breathe any new life into them?

 

The old 4TB Hitachi drive that did pass the preclear test now has a reallocated sector count of 95.  I forget what that count was when I first added it to the array but it definitely went up over the course of a week.  Should I replace this drive or just keep an eye on it for now?


It's a little discouraging and presumably was very unlikely to have such issues so soon, particularly with 2 out of 4 new drives.  I'm ready to write it off as bad luck and forge ahead.  But do I need to check my system or configuration for something that could have contributed to these issues? 

 

My Thermaltake Level 10 case has excellent cooling.  All HDDs are at 20 deg Celsius during the SMART test.

I believe my 1000 watt power supply is up to the load.

Motherboard: Asus Sabertooth Z77.

RAM: 16GB.

Link to comment

I wouldn't trust any of those suspect disks in my array. Parity alone doesn't provide parity protection. All the other disks are involved too.

 

Single parity requires that every bit of parity plus all other disks be reliably read in order to reconstruct the data of a failed disk.

 

Dual parity just makes it possible to lose 2 disks at once and still rebuild, but every bit of all the other disks must be reliably read to reconstruct.

Link to comment

The 10tb disc with the CRC error is now reporting "A mandatory smart command failed" and no SMART results available.  Both 10tb replacements arrived and are being precleared now.  Here's hoping I complete the replacement before losing a 3rd drive... if so, dual parity will have saved my bacon (though the most important data is backed up elsewhere).

 

What can I do to make sure the problem isn't being caused by other hardware issues?

 

I did try switching connections between a failed and a good drive, but the failed stayed failed, suggesting it wasn't the SATA cable.  Only tried that once with one pair: haven't tried it with the second drive that went belly up.

Link to comment

*I might have just lost 10TB of data.  Or perhaps 20tb.*

 

[Setting aside the probably bad SATA cable for now (no drives connected to it)]

 

After preclearing both replacement 10tb drives I had the system run a parity rebuild.  During the parity rebuild I noticed that one drive was listed as formatted and the other wasn't.  I hadn't formatted either, I believe, but hoped that the parity rebuild would take care of any potential issues or - at worse - I'd have to reformat the unformatted one and do the parity rebuild.

 

When the parity rebuild was complete, both replacement 10tb data drives were listed as unmountable because they hadn't been formatted, although both listed "xfs" as a file system.  I figured that at worse I'd just wasted the rebuild time and would simply have to format them and repeat the rebuild.  To be on the safe side, I powered down and disconnected one of the drives (disc 2), formatted the other (disc 1) and restarted the array, having it remove the missing drive (disc 2) from the array. 

 

The raid status then said parity is valid and drives 1 and 2 each have 9.99gb free.  It looks as if no data was recovered from the parity rebuild.

 

I stopped the array, reconnected disc 2, restarted the array and discovered that disc 2 would not mount when I clicked "mount".

 

Occam's razor says it's user error, due to ignorance.

 

Any information on what happened and, more importantly, any steps to recover up to 20tb of data would be appreciated.

 

I've attached system log and diagnostics.

 

Below is part of the most recent log info.

Disc 2 is "ZA27FQQF".

I assume I should "run xfs_repair (>= v4.3)" on Disc 2 though I'm not at all familiar with that process and hope it can help with that disc's data.

Any chance of data recovery for Disc 2 (ZA28QW6X)?  No data was written to it since the quick format.


Feb 4 22:12:22 YountNAS unassigned.devices: Adding disk '/dev/sdg1'...
Feb 4 22:12:22 YountNAS unassigned.devices: Mount drive command: /sbin/mount -t xfs -o rw,noatime,nodiratime '/dev/sdg1' '/mnt/disks/ST10000NM0086-2AA101_ZA27FQQF'
Feb 4 22:12:22 YountNAS kernel: XFS (sdg1): Mounting V5 Filesystem
Feb 4 22:12:22 YountNAS kernel: XFS (sdg1): totally zeroed log
Feb 4 22:12:22 YountNAS kernel: XFS (sdg1): Corruption warning: Metadata has LSN (1:1041896) ahead of current LSN (1:0). Please unmount and run xfs_repair (>= v4.3) to resolve.
Feb 4 22:12:22 YountNAS kernel: XFS (sdg1): log mount/recovery failed: error -22
Feb 4 22:12:22 YountNAS kernel: XFS (sdg1): log mount failed
Feb 4 22:12:22 YountNAS unassigned.devices: Mount of '/dev/sdg1' failed. Error message: mount: /mnt/disks/ST10000NM0086-2AA101_ZA27FQQF: wrong fs type, bad option, bad superblock on /dev/sdg1, missing codepage or helper program, or other error. 
Feb 4 22:12:22 YountNAS unassigned.devices: Partition 'ST10000NM0086-2AA101_ZA27FQQF' could not be mounted...

yountnas-diagnostics-20190204-2219.zip

yountnas-syslog-20190204-2230.zip

Link to comment
4 hours ago, johnnie.black said:

Formatting is never part of a rebuild, there's a warning that all data will be lost:

 

1132932804_Formatwarning.png.5b28de103859e8f9dda20de6fbf5caaa.png

 

Only option to try and recover the data from the formatted disk is using a file recovery util like UFS explorer.

 

As for the other disk, there's filesystem corruption, run xfs_repair:

https://wiki.unraid.net/Check_Disk_Filesystems#Drives_formatted_with_XFS

 

I don't hold out much hope for this helping, but maybe, just possibly, change the wording of the message?

 

Current: Format will create a file system in all Unmountable disks, discarding all data currently on those disks.

Proposed: Format will create a file system in all Unmountable disks, discarding all data emulated by parity or on the physical disks.

Link to comment
I don't hold out much hope for this helping, but maybe, just possibly, change the wording of the message?

There's already a change for v6.7, where it mentions "format is never part of a rebuild", but I bet i will still keep happening, your proposed change is also good, though not sure if would be better, if a user doesn't get that formatting a disk updates parity... 

 

 

 

Link to comment

I honestly still think my idea of removing the format option from the main GUI altogether is the best idea.

 

If the only place to format the disk is on the disk page itself, it can be placed AFTER the file system check tools, and offered on a per disk basis instead of shotgunning the entire array and formatting everything that's unmountable.

 

Tracking the mountable state of a disk slot from one session to the next would help too, if the disk slot was mountable in a previous boot, maybe don't even offer a format option without passing through a disk check page.

 

If the format is a small sub option on the disk properties page, it can be linked with a more comprehensive explanation, and maybe require typing something in a confirmation box.

 

In my opinion, this "feature" of unraid is responsible for more data loss than any other, and deserves more attention.

  • Upvote 1
Link to comment

Obviously formatting a disk was a mistake as you've no doubt realized now. Format writes an empty filesystem to the disk, and that write operation updates parity just like any other write operation does.

 

But I'm a little unclear about the rest of what you did.

 

Did you format both disks? Or is that other disk just unmountable and might possibly be repaired?

Link to comment
2 hours ago, jonathanm said:

I honestly still think my idea of removing the format option from the main GUI altogether is the best idea.

 

If the only place to format the disk is on the disk page itself, it can be placed AFTER the file system check tools, and offered on a per disk basis instead of shotgunning the entire array and formatting everything that's unmountable.

 

Tracking the mountable state of a disk slot from one session to the next would help too, if the disk slot was mountable in a previous boot, maybe don't even offer a format option without passing through a disk check page.

 

If the format is a small sub option on the disk properties page, it can be linked with a more comprehensive explanation, and maybe require typing something in a confirmation box.

 

In my opinion, this "feature" of unraid is responsible for more data loss than any other, and deserves more attention.

IMO these are all good ideas, is there a feature request about this? If not you should create one so we could discuss this further and so it would be more visible for other users to also make suggestions.

Link to comment
10 hours ago, johnnie.black said:

Only option to try and recover the data from the formatted disk is using a file recovery util like UFS explorer.

 

As for the other disk, there's filesystem corruption, run xfs_repair:

https://wiki.unraid.net/Check_Disk_Filesystems#Drives_formatted_with_XFS

 

4 hours ago, trurl said:

Obviously formatting a disk was a mistake as you've no doubt realized now. Format writes an empty filesystem to the disk, and that write operation updates parity just like any other write operation does.

 

But I'm a little unclear about the rest of what you did.

 

Did you format both disks? Or is that other disk just unmountable and might possibly be repaired?

 

Thanks everyone so far.

[I'm now mainly using serial numbers to refer to the specific disks, since disk numbers change when drives are swapped out.]

 

DISK ZA27FQQF RECOVERY

I now understand that formatting replacement disk 1 (ZA28QW6X) updated parity so my two parity drives can't help me with recovery.

 

Disk 2 (ZA27FQQF) was physically removed from the computer when I formatted disk 1 (ZA28QW6X) and I'm following jonnie.black's lead to recover from filesystem corruption.

 

The "file system check" option in the GUI runs this command on disk ZA27FQQF,: 

/sbin/xfs_repair -n /dev/sdg1 2>&1

 

From that and the xfs_repair instructions I understand that my command line command should read:

xfs_repair -v /dev/sdg1

 

It's currently in Phase 1 trying to find secondary superblocks since it couldn't find the primary.  15 minutes later, it's still writing only dots on the screen.  I don't know how bad a sign that is.

 

DISK ZA28QW6X RECOVERY

This is the new replacement disk that I formatted when, immediately after the parity rebuild showed it had 9.99tb free instead of being nearly full.  It was obviously a quick format, so I'm taking it to my local shop to see what equivalent of UFS Explorer they have.

 

NEXT STEPS

If I'm lucky, the original 10tb disc with the CRC error is indeed just fine and only needs a new SATA cable.  I'll pick that up today.

 

If I'm successful with both disk recoveries (ZA27FQQF and ZA28QW6X) then I'll have to see what drive I have to add to the raid temporarily to copy the recovered data before wiping the 10tb drives to return them to the array.

Edited by Wolfe
Link to comment
1 minute ago, Wolfe said:

From that and the xfs_repair instructions I understand that my command line command should read:

xfs_repair -v /dev/sdg1

That's correct if the disk identifier currently really is sdg

 

1 minute ago, Wolfe said:

It's currently in Phase 1 trying to find secondary superblocks since it couldn't find the primary.  15 minutes later, it's still writing only dots on the screen.  I don't know how bad a sign that is.

If it's running on the correct disk that's a bad sign, but let it look for a backup superblock, that can take a while as it scans the whole disk.

Link to comment

It's definitely the correct disk.  Starting with the two replacement disks I've been keeping track of them with the serial numbers rather than "disk 1" and "disk 2".

 

If xfs_repair fails on ZA27FQQF, can I assume the disk itself is just fine and put it back to use in the array?  Begs the question of why the file system got messed up during the parity rebuild, of course.

 

Was it a mistake to replace both disk 1 and disk 2 at the same time before rebuilding parity?  Could that have been the cause of the file system errors?  (This was a dual parity drive system with two 10tb parity drives reporting no errors.)

 

If it's ok to use disk ZA27FQQF again, how important is it to preclear the disk again, given that it passed the first time?  Dunno if Amazon will give me an extension on the Feb 6th return shipping deadline.

Link to comment
1 minute ago, Wolfe said:

If xfs_repair fails on ZA27FQQF, can I assume the disk itself is just fine and put it back to use in the array?

Yes, disk look fine.

 

2 minutes ago, Wolfe said:

Was it a mistake to replace both disk 1 and disk 2 at the same time before rebuilding parity? 

Not quite sure what you mean by "before rebuilding parity", there's no problem replacing two disks at the same time with dual parity if both parity disks are in sync before starting the rebuilds.

 

 

Link to comment

Update:

I was unable to recover any data from either of the 10tb replacement disks, but it looks like I can get most of the data from the two disks those were replacing.  Krusader is about half done transferring the data to the new disks.  The reallocated sector count is climbing on disk ZA26ESJ8 while I'm copying from it, but not quickly.  It's at 40 so far.

 

I suspended the Parity-Sync during the process of copying the data from the failing disks to increase speed.  As soon as that's done, I'll Sync for parity.

 

Assuming that goes well, my next step will be figuring out if my Asus Sabertooth Z77 can support VT-d.  I already know my i7 3770k cannot (but HVM works fine).  I'll probably only be running one VM: Windows 10, mainly for gaming.  90% sure I need VT-d to do that.

 

After that, it's setting up a mirrored SSD cache array and setting up my data and media serving for our laptops, Roku, etc.

 

I'll update after parity sync is complete.  Successfully, I hope!

Link to comment

Yes, definitely.

But if the motherboard doesn't support VT-d (and it looks like it probably doesn't) I won't rush to replace it.  I have very little time for gaming these days anyway.  NAS was the main point of switching to UnRaid, followed by media serving.  Hadn't even booted my gaming desktop for over a year!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.