[SOLVED] Disk Disabled After Parity Check. Drive Dying?

April 1, 201412 yr

Hello all.

I just got home from work and jumped into the UnRAID GUI as I normally do every day only to notice that one of my disks was marked as Disabled and also marked as Unformatted. Upon closer inspection, I could see that the files and folders were still listed when accessing the drive through windows, however, the files were not accessible.

I have a parity check scheduled to take place at the first of every month. It appears that it was during the parity check that something was detected. UnMenu tells me that the Parity updated 454778778 times to correct sync errors. I looked into the syslog file (see attached) to find that there are what appears to be numerous parity errors.

I am unsure of what happened, but my main concerns right now are:

[*]Is my hard drive dying? Wondering if that's why this happened.

[*]Is it still possible to recover that data after the parity was updated?

I really would appreciate any help that I can get as this has me fairly concerned. Thank you so much in advance!

BTW, I am running unRAID 6.0 Beta 4 if that matters.

syslog-20140401-181429.zip

Quote

April 2, 201412 yr

Looks like a bad or loose SATA cable. See here: http://lime-technology.com/wiki/index.php/Troubleshooting#What_do_I_do_if_I_get_a_red_ball_next_to_a_hard_disk.3F

Quote

April 2, 201412 yr

Author

Looks like a bad or loose SATA cable. See here: http://lime-technology.com/wiki/index.php/Troubleshooting#What_do_I_do_if_I_get_a_red_ball_next_to_a_hard_disk.3F

Thank you for the suggestion! I tried wiggling the cable and it did not fix it, so I'll try another SATA cable tomorrow. I did run a short SMART test which pulled a RAW_VALUE of 1 for "UDMA_CRC_Error_Count". That does seem to indicate it is possibly a SATA cable issue according to your link.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   162   161   021    Pre-fail  Always       -       6866
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       535
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   064   064   000    Old_age   Always       -       26911
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       59
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       17
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       8282
194 Temperature_Celsius     0x0022   127   117   000    Old_age   Always       -       23
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

My question is, why would the drive come up as Unformatted? Does it have anything to do with UnRAID removing the disk as seen in my syslog?

Apr  1 18:12:51 Tower emhttp: Device inventory:
Apr  1 18:12:51 Tower emhttp: WDC_WD20EARS-00MVWB0_WD-WMAZ20251692 (sdb) 1953514584
Apr  1 18:12:51 Tower emhttp: WDC_WD20EARS-00S8B1_WD-WCAVY3007783 (sdc) 1953514584
Apr  1 18:12:51 Tower emhttp: WDC_WD20EARS-00MVWB0_WD-WMAZA0659735 (sdd) 1953514584
Apr  1 18:12:51 Tower emhttp: WDC_WD20EARS-00MVWB0_WD-WMAZA1691542 (sde) 1953514584
Apr  1 18:12:51 Tower emhttp: ST3000DM001-1CH166_W1F1N7ZW (sdf) 2930266584
Apr  1 18:12:51 Tower emhttp: TOSHIBA_DT01ACA300_X3T7M0HKS (sdg) 2930266584
Apr  1 18:12:51 Tower emhttp: TOSHIBA_DT01ACA300_X3S933VGS (sdh) 2930266584
Apr  1 18:12:51 Tower emhttp: ST3000DM001-1CH166_W1F1N823 (sdi) 2930266584
Apr  1 18:12:51 Tower kernel: mdcmd (1): import 0 8,128 2930266532 ST3000DM001-1CH166_W1F1N823
Apr  1 18:12:51 Tower kernel: md: import disk0: [8,128] (sdi) ST3000DM001-1CH166_W1F1N823 size: 2930266532
Apr  1 18:12:51 Tower kernel: mdcmd (2): import 1 8,80 2930266532 ST3000DM001-1CH166_W1F1N7ZW
Apr  1 18:12:51 Tower kernel: md: import disk1: [8,80] (sdf) ST3000DM001-1CH166_W1F1N7ZW size: 2930266532
Apr  1 18:12:51 Tower emhttp: ckmbr: read: Input/output error
Apr  1 18:12:51 Tower kernel: mdcmd (3): import 2 8,96 2930266532 TOSHIBA_DT01ACA300_X3T7M0HKS
Apr  1 18:12:51 Tower kernel: md: import disk2: [8,96] (sdg) TOSHIBA_DT01ACA300_X3T7M0HKS size: 2930266532
Apr  1 18:12:51 Tower kernel: mdcmd (4): import 3 8,32 1953514552 WDC_WD20EARS-00S8B1_WD-WCAVY3007783
Apr  1 18:12:51 Tower kernel: md: import disk3: [8,32] (sdc) WDC_WD20EARS-00S8B1_WD-WCAVY3007783 size: 1953514552
Apr  1 18:12:51 Tower kernel: sd 1:0:0:0: [sdb] Unhandled error code
Apr  1 18:12:51 Tower kernel: sd 1:0:0:0: [sdb]  
Apr  1 18:12:51 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00
Apr  1 18:12:51 Tower kernel: sd 1:0:0:0: [sdb] CDB: 
Apr  1 18:12:51 Tower kernel: cdb[0]=0x28: 28 00 00 00 00 00 00 00 20 00
Apr  1 18:12:51 Tower kernel: end_request: I/O error, dev sdb, sector 0
Apr  1 18:12:51 Tower kernel: Buffer I/O error on device sdb, logical block 0
Apr  1 18:12:51 Tower kernel: Buffer I/O error on device sdb, logical block 1
Apr  1 18:12:51 Tower kernel: Buffer I/O error on device sdb, logical block 2
Apr  1 18:12:51 Tower kernel: Buffer I/O error on device sdb, logical block 3
Apr  1 18:12:51 Tower kernel: sd 1:0:0:0: [sdb] Unhandled error code
Apr  1 18:12:51 Tower kernel: sd 1:0:0:0: [sdb]  
Apr  1 18:12:51 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00
Apr  1 18:12:51 Tower kernel: sd 1:0:0:0: [sdb] CDB: 
Apr  1 18:12:51 Tower kernel: cdb[0]=0x28: 28 00 00 00 00 00 00 00 08 00
Apr  1 18:12:51 Tower kernel: end_request: I/O error, dev sdb, sector 0
Apr  1 18:12:51 Tower kernel: Buffer I/O error on device sdb, logical block 0
Apr  1 18:12:51 Tower kernel: mdcmd (5): import 4 0,0
Apr  1 18:12:51 Tower kernel: md: disk4 removed
Apr  1 18:12:51 Tower kernel: mdcmd (6): import 5 8,48 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA0659735
Apr  1 18:12:51 Tower kernel: md: import disk5: [8,48] (sdd) WDC_WD20EARS-00MVWB0_WD-WMAZA0659735 size: 1953514552
Apr  1 18:12:51 Tower kernel: mdcmd (7): import 6 8,64 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA1691542
Apr  1 18:12:51 Tower kernel: md: import disk6: [8,64] (sde) WDC_WD20EARS-00MVWB0_WD-WMAZA1691542 size: 1953514552
Apr  1 18:12:51 Tower emhttp: shcmd (92): /usr/local/sbin/emhttp_event driver_loaded
Apr  1 18:12:51 Tower kernel: mdcmd (: import 7 8,112 2930266532 TOSHIBA_DT01ACA300_X3S933VGS
Apr  1 18:12:51 Tower kernel: md: import disk7: [8,112] (sdh) TOSHIBA_DT01ACA300_X3S933VGS size: 2930266532

smart.txt

Quote

April 2, 201412 yr

My question is, why would the drive come up as Unformatted? Does it have anything to do with UnRAID removing the disk as seen in my syslog?

This can simply mean that unRAID was unable to mount it - not that it is really unformatted. If you have had a write failure (which a disk being disabled suggest has happened) then there can be a corrupted file system on the disk.

Typically this can be fixed by running reiserfsck against the drive in question. This should be done by putting the array into maintenance mode and then running a command from a console/telnet session of the form

reiserfsck --check /dev/md??

where ?? corresponds to the disk number in the array. The output from the check run will indicate if any problems are found and what is the suggested course of action if problems are found. If you are not sure you should check back here before taking any action to correct issues reported.

Quote

April 2, 201412 yr

Any disk which cannot be mounted is reported as unformatted.

Quote

April 2, 201412 yr

Author

Thank you both for your replies! You've been very helpful.

I ended up running the reiserfsck --check /dev/md? command in a telnet session while the array was in maintenance mode, and it appears that there is indeed a problem detected. I'm hopeful that it is just a simple fix, but as itimpi suggested, I feel the need to solicit advice before proceeding as I would love to not lose any data at all in this process.

See below a copy of the reiserfsck report:

root@Tower:~# reiserfsck --check /dev/md4
reiserfsck 3.6.24

Will read-only check consistency of the filesystem on /dev/md4
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Wed Apr  2 18:56:08 2014
###########
Replaying journal: Done.
Reiserfs journal '/dev/md4' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. \block 59292607: The level of the node (54065) is not correct, (4) expected
the problem in the internal node occured (59292607), whole subtree is skipped finished
Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.
Bad nodes were found, Semantic pass skipped
1 found corruptions can be fixed only when running with --rebuild-tree
###########
reiserfsck finished at Wed Apr  2 18:59:25 2014
###########

I see where it says "1 found corruptions can be fixed only when running with --rebuild-tree" so I'm assuming that is what i will need to do. Before I proceed however, I just wanted to post this to get some advice as to not screw anything up. Thank you so much in advance!

Quote

April 3, 201412 yr

Yes, run with rebuild-tree.

Quote

April 3, 201412 yr

Author

Yes, run with rebuild-tree.

Great, thanks again for your assistance! I ran reiserfsck with the rebuild-tree handle last night, which took several hours. I am not sure if it creates a log file... I think it writes it to the disk itself, but anyway, I was able to copy some of the report from my telnet session. Instead of pasting all of it here, I have just added the last bit of the report that hopefully gives most of the details. I also attached a txt file with more of the report itself.

vpf-10260: The file we are inserting the new item (12 13 0x94b40001 IND (1), len 4048, location 48 entry count 0, fsck need 0, format 

new) into has no StatData, insertion was skipped
100%                         left 0, 444 /sec
Flushing..finished
        Leaves inserted item by item 222
Pass 3 (semantic):
####### Pass 3 #########
Flushing..finished
        Files found: 0
        Directories found: 2
Pass 3a (looking for lost dir/files):
####### Pass 3a (lost+found pass) #########
Looking for lost directories:
Flushing..finishede 0, 0 /sec
Pass 4 - finished       done 1, 0 /sec
        Deleted unreachable items 2106
Flushing..finished
Syncing..finished
###########
reiserfsck finished at Thu Apr  3 07:05:23 2014
###########

I'm unsure of what step to take next. My limited knowledge prevents me from moving forward as I don't want to break anything. I do see where it says "Deleted unreachable items 2106" which I am not sure what those files actually are, and again why I'm afraid to do anything further at this point. I hate to keep asking, but I just want to make sure I do everything right before proceeding. Thank you so much!

Quote

April 4, 201412 yr

1. You're correct to be cautious. Great damage can be done with Reiserfsck....and it would be quick and easy to do.

2. Ask away, it may seem frustrating and slow to await a reply on the forums, but folks will help. They don't mind helping. And their advice is excellent.

3. For where you're at right now, you need one of the real Reiserfsck guru's to weigh in on where you're at. What's repaired is repaired...but there may be some 'lost' files. Advice is coming on what can be done.

MEANTIME: Can you POST THE FULL TELNET OR SYSLOG of your entire session? Not just the last screen full?

Open the Telnet window and select ALL, even the stuff that scrolled off the screen? If your Telnet tool hasn't kept the data, can you try getting the full SYSLOG...see here for ideas:

Include your VERSION and SYSTEM LOG for support issues

Quote

April 4, 201412 yr

Author

1. You're correct to be cautious. Great damage can be done with Reiserfsck....and it would be quick and easy to do.

2. Ask away, it may seem frustrating and slow to await a reply on the forums, but folks will help. They don't mind helping. And their advice is excellent.

3. For where you're at right now, you need one of the real Reiserfsck guru's to weigh in on where you're at. What's repaired is repaired...but there may be some 'lost' files. Advice is coming on what can be done.

MEANTIME: Can you POST THE FULL TELNET OR SYSLOG of your entire session? Not just the last screen full?

Open the Telnet window and select ALL, even the stuff that scrolled off the screen? If your Telnet tool hasn't kept the data, can you try getting the full SYSLOG...see here for ideas:

Include your VERSION and SYSTEM LOG for support issues

Crap! I closed out of my telnet session because I figured I could only get what was on the screen, not knowing a select all would have snagged everything. I don't even see anything about the Reiserfsck in the syslog at all either... I'm not sure there is any way I can get that information to post now

I did go in and pull a full syslog though and attached it to here. I'm very hopeful that someone can help me out! I'm really boggled as to why this drive became disabled after a parity check, and I'm keeping my fingers crossed that it is possible to rebuild the drive if any data is missing.

Thank you and everyone else for all the assistance!!

syslog.txt

Quote

April 4, 201412 yr

Run reiserfsck check again.

Quote

April 4, 201412 yr

Author

Run reiserfsck check again.

Thank you dgaschk, I will run reiserfsck check again once I get home from work.

Thank you as well to everyone in this entire community for being so helpful. I'm not sure if this is a common issue, but I really appreciate the time everyone has taken to try and help out. I am not the most familiar with Linux, but thankfully there are people here that have been more than helpful!

Quote

April 4, 201412 yr

Author

Okay, so I ran the reiserfsck again on the disk, and my limited experience with the information in the report makes me believe that there is no longer anything on the drive other than 2 directories, which has me a bit scared. See the info from the report below:

root@Tower:~# reiserfsck --check /dev/md4
reiserfsck 3.6.24

Will read-only check consistency of the filesystem on /dev/md4
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Fri Apr  4 18:54:22 2014
###########
Replaying journal: Done.
Reiserfs journal '/dev/md4' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished
No corruptions found
There are on the filesystem:
        Leaves 1
        Internal nodes 0
        Directories 2
        Other files 0
        Data block pointers 0 (0 of them are zero)
        Safe links 0
###########
reiserfsck finished at Fri Apr  4 18:57:22 2014
###########

If the data is gone, then I guess there isn't a lot I can do unless I can rebuild the drive from the parity (granted that data hasn't been removed after the parity check that led to this situation right now). Any advice would again be incredibly welcome and greatly appreciated $:-\$

Quote

April 4, 201412 yr

root@Tower:~# reiserfsck --check /dev/md4

If the data is gone, then I guess there isn't a lot I can do unless I can rebuild the drive from the parity

The md devices are in the parity set, so any changes made to the md devices are immediately committed to parity. However... from reading through what's transpired, I don't think you had valid parity to begin with. This is a very confused situation, where data from the bad disk was out of sync with the parity disk, as you indicated by all the parity errors in the monthly check. Is your monthly check a correcting or non-correcting check? If non-correcting, you may stand a chance of getting more data back by physically removing disk4, and do a reiserfsck check on the virtual md4 device.

Quote

April 4, 201412 yr

Author

root@Tower:~# reiserfsck --check /dev/md4

If the data is gone, then I guess there isn't a lot I can do unless I can rebuild the drive from the parity

The md devices are in the parity set, so any changes made to the md devices are immediately committed to parity. However... from reading through what's transpired, I don't think you had valid parity to begin with. This is a very confused situation, where data from the bad disk was out of sync with the parity disk, as you indicated by all the parity errors in the monthly check. Is your monthly check a correcting or non-correcting check? If non-correcting, you may stand a chance of getting more data back by physically removing disk4, and do a reiserfsck check on the virtual md4 device.

It's very strange, right? I've never had anything like this happen over the past 4 years of using this server. Honestly, I'm wondering if it has anything to do with UnRAID 6.0 Beta 4 as I upgraded literally right before the end of the night on the 31st, which would have been shortly before the scheduled monthly parity check. I was previously using 5.0.4 before upgrading.

To answer your question about my monthly parity check, I'm 99% positive it's a non-correcting check. By looking at my log file from April 1st, the first line for the day says:

Apr  1 00:00:01 Tower kernel: mdcmd (52): check NOCORRECT

If that's the case, and I were to reiserfsck check the virtual md4 device as you suggested, that may be the best thing to do. I just want to try and do whatever I can so I don't lose anything, or at least lose as much as I potentially could. I just did a search on how to do this, and I am seemingly having a tough time figuring it out. I'm thinking it would be to do the check on the drive letter and not number? Ex: Instead of /dev/md4, it would be /dev/sdd?

Thank you again for your assistance! Best community ever

Quote

April 5, 201412 yr

Reiserfsck was run on /dev/md4. This means parity reflects what is currently on the disk.

Try this:

reiserfsck --rebuild-tree --scan-whole-partition /dev/md4

Quote

April 5, 201412 yr

Reiserfsck was run on /dev/md4. This means parity reflects what is currently on the disk.

But only where writes were made to the disk. Parity was out of sync, as evidenced by all the non-corrected errors. Until a correcting parity check, I'm pretty sure the parity disk would emulate different content than was actually physically on disk4. Reiserfsck was run on the md device, which means any writes would be sent to the parity disk, but anything not written would still be out of sync with the physical disk. Since there were thousands of incorrect parity locations, it's conceivable that the content may be more recoverable from the emulated disk. I think it's worth a shot anyway.

Quote

April 5, 201412 yr

If that's the case, and I were to reiserfsck check the virtual md4 device as you suggested, that may be the best thing to do. I just want to try and do whatever I can so I don't lose anything, or at least lose as much as I potentially could. I just did a search on how to do this, and I am seemingly having a tough time figuring it out. I'm thinking it would be to do the check on the drive letter and not number? Ex: Instead of /dev/md4, it would be /dev/sdd?

Whatever you do - do NOT run a check against /dev/sdd as that would definitely mess up things. When running against raw devices you have to include the partition number (e.g. /dev/sdd1). Using the /dev/md?? type devices means unRAID handles the partition for you, but you are probably running against an emulated device rather than the real device..

At this point I must admit I am not sure whether you would be better off running against the physical disk, or the /dev/md?? device which may be emulated. Someone else may be able to recommend what to do.

Quote

April 5, 201412 yr

Author

Thanks to everyone for all the suggestions and help! Being fairly unfamiliar with this, I'm hopeful that there is something that I can do to recover at least some of my data.

Is there an easy way to figure out if a device is being emulated?

Also, is it possible to access the disk in say midnight commander or something while it's disabled to see if there is data on the drive?

I'm probably talking nonsense, but I'm just hoping to contribute to the solution of my own problem. Everything that everyone had said has been greatly appreciated however and I'm thankful for the suggestions!

Before I move forward with anything, I'm going to see if anyone else has a recommendation as well since this seems to be a very strange and isolated incident

You guys are the best! Thanks again for your assistance!

Sent from my SCH-I535 using Tapatalk

Quote

April 5, 201412 yr

Is there an easy way to figure out if a device is being emulated?

Yes, if the disk is red balled, then all operations are actually being done on the emulated disk, the physical disk is not being used at all. If that is the case, then perhaps one way to attempt a better recovery would be to assign that disk as a cache drive, and see if it mounts and is readable. Or, pull it completely out of the box, and attempt to read it using one of the windows reiserfs utilities.

Quote

April 5, 201412 yr

Reiserfsck was run on /dev/md4. This means parity reflects what is currently on the disk.

You are totally right, I missed the HUGE GLARING second word in the title, "Disabled". That means the physical drive hasn't been messed with yet, so maybe there is a chance to recover files off of it.

Quote

April 5, 201412 yr

You can do both. Run the rebuild/scan-whole on md4 and you can tun it in sdd1. First put the physical disk in a PC or linux box. Run SystemRescueCD to make a copy image of the disk and run recovery on the copy image. At the same time run reiserfsck on md4.

Quote

April 6, 201412 yr

Author

You can do both. Run the rebuild/scan-whole on md4 and you can tun it in sdd1. First put the physical disk in a PC or linux box. Run SystemRescueCD to make a copy image of the disk and run recovery on the copy image. At the same time run reiserfsck on md4.

Awesome! Is there a reason to do both, or is it just that it's a possibility that you can? I was just going to do it on md4, but if I need to run it on the sdd# that's no problem.

Before I do that, I'll put the disk in my PC so I can create an image of the disk. I'll need to go buy a spare 2TB drive just to write the image to since I'm assuming the image will be roughly as large as the amount of data I had on there, and I don't have enough free space on any drive in my PC currently. I've never worked with SystemRescueCD, but I'm sure it will let me create the drive image to another disk.

I'll also run the reiserfsck on md4 while doing all that. Just to triple clarify, I'd be doing the rebuild as you suggested in this poist, your previous post (and also quoted below)?

reiserfsck --rebuild-tree --scan-whole-partition /dev/md4

Thanks again so much for all of your help, it's really appreciated more than you can imagine.

Quote

April 6, 201412 yr

Yes. That is the command. You can run it wile the disk is missing. Do both in case one method fails.

Quote

April 6, 201412 yr

Author

Yes. That is the command. You can run it wile the disk is missing. Do both in case one method fails.

Nice, this is what my plan will be. I didn't get a chance to go buy another 2TB drive to create the image to with SystemRescueCD, so I'm going to try and go there after work tomorrow. I'll then create the image and then run recovery on it, then reiserfsck rebuild md4 and sdd1, or whatever the number is for that disk on my server. Is there an easy way to find out the actual partition number for that drive in UnRAID?

I am hoping this works as I've been too afraid to even start the array and use any of my other disks while this issue is going on as I don't want to mess anything up more than it might be.

Thanks again for the help, and I'll be sure to report back once I've tried those things!

Quote

[SOLVED] Disk Disabled After Parity Check. Drive Dying?

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)