April 1, 201412 yr Hello all. I just got home from work and jumped into the UnRAID GUI as I normally do every day only to notice that one of my disks was marked as Disabled and also marked as Unformatted. Upon closer inspection, I could see that the files and folders were still listed when accessing the drive through windows, however, the files were not accessible. I have a parity check scheduled to take place at the first of every month. It appears that it was during the parity check that something was detected. UnMenu tells me that the Parity updated 454778778 times to correct sync errors. I looked into the syslog file (see attached) to find that there are what appears to be numerous parity errors. I am unsure of what happened, but my main concerns right now are: [*]Is my hard drive dying? Wondering if that's why this happened. [*]Is it still possible to recover that data after the parity was updated? I really would appreciate any help that I can get as this has me fairly concerned. Thank you so much in advance! BTW, I am running unRAID 6.0 Beta 4 if that matters. syslog-20140401-181429.zip
April 2, 201412 yr Looks like a bad or loose SATA cable. See here: http://lime-technology.com/wiki/index.php/Troubleshooting#What_do_I_do_if_I_get_a_red_ball_next_to_a_hard_disk.3F
April 2, 201412 yr Author Looks like a bad or loose SATA cable. See here: http://lime-technology.com/wiki/index.php/Troubleshooting#What_do_I_do_if_I_get_a_red_ball_next_to_a_hard_disk.3F Thank you for the suggestion! I tried wiggling the cable and it did not fix it, so I'll try another SATA cable tomorrow. I did run a short SMART test which pulled a RAW_VALUE of 1 for "UDMA_CRC_Error_Count". That does seem to indicate it is possibly a SATA cable issue according to your link. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 162 161 021 Pre-fail Always - 6866 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 535 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 064 064 000 Old_age Always - 26911 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 59 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 17 193 Load_Cycle_Count 0x0032 198 198 000 Old_age Always - 8282 194 Temperature_Celsius 0x0022 127 117 000 Old_age Always - 23 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 1 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 My question is, why would the drive come up as Unformatted? Does it have anything to do with UnRAID removing the disk as seen in my syslog? Apr 1 18:12:51 Tower emhttp: Device inventory: Apr 1 18:12:51 Tower emhttp: WDC_WD20EARS-00MVWB0_WD-WMAZ20251692 (sdb) 1953514584 Apr 1 18:12:51 Tower emhttp: WDC_WD20EARS-00S8B1_WD-WCAVY3007783 (sdc) 1953514584 Apr 1 18:12:51 Tower emhttp: WDC_WD20EARS-00MVWB0_WD-WMAZA0659735 (sdd) 1953514584 Apr 1 18:12:51 Tower emhttp: WDC_WD20EARS-00MVWB0_WD-WMAZA1691542 (sde) 1953514584 Apr 1 18:12:51 Tower emhttp: ST3000DM001-1CH166_W1F1N7ZW (sdf) 2930266584 Apr 1 18:12:51 Tower emhttp: TOSHIBA_DT01ACA300_X3T7M0HKS (sdg) 2930266584 Apr 1 18:12:51 Tower emhttp: TOSHIBA_DT01ACA300_X3S933VGS (sdh) 2930266584 Apr 1 18:12:51 Tower emhttp: ST3000DM001-1CH166_W1F1N823 (sdi) 2930266584 Apr 1 18:12:51 Tower kernel: mdcmd (1): import 0 8,128 2930266532 ST3000DM001-1CH166_W1F1N823 Apr 1 18:12:51 Tower kernel: md: import disk0: [8,128] (sdi) ST3000DM001-1CH166_W1F1N823 size: 2930266532 Apr 1 18:12:51 Tower kernel: mdcmd (2): import 1 8,80 2930266532 ST3000DM001-1CH166_W1F1N7ZW Apr 1 18:12:51 Tower kernel: md: import disk1: [8,80] (sdf) ST3000DM001-1CH166_W1F1N7ZW size: 2930266532 Apr 1 18:12:51 Tower emhttp: ckmbr: read: Input/output error Apr 1 18:12:51 Tower kernel: mdcmd (3): import 2 8,96 2930266532 TOSHIBA_DT01ACA300_X3T7M0HKS Apr 1 18:12:51 Tower kernel: md: import disk2: [8,96] (sdg) TOSHIBA_DT01ACA300_X3T7M0HKS size: 2930266532 Apr 1 18:12:51 Tower kernel: mdcmd (4): import 3 8,32 1953514552 WDC_WD20EARS-00S8B1_WD-WCAVY3007783 Apr 1 18:12:51 Tower kernel: md: import disk3: [8,32] (sdc) WDC_WD20EARS-00S8B1_WD-WCAVY3007783 size: 1953514552 Apr 1 18:12:51 Tower kernel: sd 1:0:0:0: [sdb] Unhandled error code Apr 1 18:12:51 Tower kernel: sd 1:0:0:0: [sdb] Apr 1 18:12:51 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00 Apr 1 18:12:51 Tower kernel: sd 1:0:0:0: [sdb] CDB: Apr 1 18:12:51 Tower kernel: cdb[0]=0x28: 28 00 00 00 00 00 00 00 20 00 Apr 1 18:12:51 Tower kernel: end_request: I/O error, dev sdb, sector 0 Apr 1 18:12:51 Tower kernel: Buffer I/O error on device sdb, logical block 0 Apr 1 18:12:51 Tower kernel: Buffer I/O error on device sdb, logical block 1 Apr 1 18:12:51 Tower kernel: Buffer I/O error on device sdb, logical block 2 Apr 1 18:12:51 Tower kernel: Buffer I/O error on device sdb, logical block 3 Apr 1 18:12:51 Tower kernel: sd 1:0:0:0: [sdb] Unhandled error code Apr 1 18:12:51 Tower kernel: sd 1:0:0:0: [sdb] Apr 1 18:12:51 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00 Apr 1 18:12:51 Tower kernel: sd 1:0:0:0: [sdb] CDB: Apr 1 18:12:51 Tower kernel: cdb[0]=0x28: 28 00 00 00 00 00 00 00 08 00 Apr 1 18:12:51 Tower kernel: end_request: I/O error, dev sdb, sector 0 Apr 1 18:12:51 Tower kernel: Buffer I/O error on device sdb, logical block 0 Apr 1 18:12:51 Tower kernel: mdcmd (5): import 4 0,0 Apr 1 18:12:51 Tower kernel: md: disk4 removed Apr 1 18:12:51 Tower kernel: mdcmd (6): import 5 8,48 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA0659735 Apr 1 18:12:51 Tower kernel: md: import disk5: [8,48] (sdd) WDC_WD20EARS-00MVWB0_WD-WMAZA0659735 size: 1953514552 Apr 1 18:12:51 Tower kernel: mdcmd (7): import 6 8,64 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA1691542 Apr 1 18:12:51 Tower kernel: md: import disk6: [8,64] (sde) WDC_WD20EARS-00MVWB0_WD-WMAZA1691542 size: 1953514552 Apr 1 18:12:51 Tower emhttp: shcmd (92): /usr/local/sbin/emhttp_event driver_loaded Apr 1 18:12:51 Tower kernel: mdcmd (: import 7 8,112 2930266532 TOSHIBA_DT01ACA300_X3S933VGS Apr 1 18:12:51 Tower kernel: md: import disk7: [8,112] (sdh) TOSHIBA_DT01ACA300_X3S933VGS size: 2930266532 smart.txt
April 2, 201412 yr My question is, why would the drive come up as Unformatted? Does it have anything to do with UnRAID removing the disk as seen in my syslog? This can simply mean that unRAID was unable to mount it - not that it is really unformatted. If you have had a write failure (which a disk being disabled suggest has happened) then there can be a corrupted file system on the disk. Typically this can be fixed by running reiserfsck against the drive in question. This should be done by putting the array into maintenance mode and then running a command from a console/telnet session of the form reiserfsck --check /dev/md?? where ?? corresponds to the disk number in the array. The output from the check run will indicate if any problems are found and what is the suggested course of action if problems are found. If you are not sure you should check back here before taking any action to correct issues reported.
April 2, 201412 yr Author Thank you both for your replies! You've been very helpful. I ended up running the reiserfsck --check /dev/md? command in a telnet session while the array was in maintenance mode, and it appears that there is indeed a problem detected. I'm hopeful that it is just a simple fix, but as itimpi suggested, I feel the need to solicit advice before proceeding as I would love to not lose any data at all in this process. See below a copy of the reiserfsck report: root@Tower:~# reiserfsck --check /dev/md4 reiserfsck 3.6.24 Will read-only check consistency of the filesystem on /dev/md4 Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes ########### reiserfsck --check started at Wed Apr 2 18:56:08 2014 ########### Replaying journal: Done. Reiserfs journal '/dev/md4' in blocks [18..8211]: 0 transactions replayed Checking internal tree.. \block 59292607: The level of the node (54065) is not correct, (4) expected the problem in the internal node occured (59292607), whole subtree is skipped finished Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs. Bad nodes were found, Semantic pass skipped 1 found corruptions can be fixed only when running with --rebuild-tree ########### reiserfsck finished at Wed Apr 2 18:59:25 2014 ########### I see where it says "1 found corruptions can be fixed only when running with --rebuild-tree" so I'm assuming that is what i will need to do. Before I proceed however, I just wanted to post this to get some advice as to not screw anything up. Thank you so much in advance!
April 3, 201412 yr Author Yes, run with rebuild-tree. Great, thanks again for your assistance! I ran reiserfsck with the rebuild-tree handle last night, which took several hours. I am not sure if it creates a log file... I think it writes it to the disk itself, but anyway, I was able to copy some of the report from my telnet session. Instead of pasting all of it here, I have just added the last bit of the report that hopefully gives most of the details. I also attached a txt file with more of the report itself. vpf-10260: The file we are inserting the new item (12 13 0x94b40001 IND (1), len 4048, location 48 entry count 0, fsck need 0, format new) into has no StatData, insertion was skipped 100% left 0, 444 /sec Flushing..finished Leaves inserted item by item 222 Pass 3 (semantic): ####### Pass 3 ######### Flushing..finished Files found: 0 Directories found: 2 Pass 3a (looking for lost dir/files): ####### Pass 3a (lost+found pass) ######### Looking for lost directories: Flushing..finishede 0, 0 /sec Pass 4 - finished done 1, 0 /sec Deleted unreachable items 2106 Flushing..finished Syncing..finished ########### reiserfsck finished at Thu Apr 3 07:05:23 2014 ########### I'm unsure of what step to take next. My limited knowledge prevents me from moving forward as I don't want to break anything. I do see where it says "Deleted unreachable items 2106" which I am not sure what those files actually are, and again why I'm afraid to do anything further at this point. I hate to keep asking, but I just want to make sure I do everything right before proceeding. Thank you so much!
April 4, 201412 yr 1. You're correct to be cautious. Great damage can be done with Reiserfsck....and it would be quick and easy to do. 2. Ask away, it may seem frustrating and slow to await a reply on the forums, but folks will help. They don't mind helping. And their advice is excellent. 3. For where you're at right now, you need one of the real Reiserfsck guru's to weigh in on where you're at. What's repaired is repaired...but there may be some 'lost' files. Advice is coming on what can be done. MEANTIME: Can you POST THE FULL TELNET OR SYSLOG of your entire session? Not just the last screen full? Open the Telnet window and select ALL, even the stuff that scrolled off the screen? If your Telnet tool hasn't kept the data, can you try getting the full SYSLOG...see here for ideas: Include your VERSION and SYSTEM LOG for support issues
April 4, 201412 yr Author 1. You're correct to be cautious. Great damage can be done with Reiserfsck....and it would be quick and easy to do. 2. Ask away, it may seem frustrating and slow to await a reply on the forums, but folks will help. They don't mind helping. And their advice is excellent. 3. For where you're at right now, you need one of the real Reiserfsck guru's to weigh in on where you're at. What's repaired is repaired...but there may be some 'lost' files. Advice is coming on what can be done. MEANTIME: Can you POST THE FULL TELNET OR SYSLOG of your entire session? Not just the last screen full? Open the Telnet window and select ALL, even the stuff that scrolled off the screen? If your Telnet tool hasn't kept the data, can you try getting the full SYSLOG...see here for ideas: Include your VERSION and SYSTEM LOG for support issues Crap! I closed out of my telnet session because I figured I could only get what was on the screen, not knowing a select all would have snagged everything. I don't even see anything about the Reiserfsck in the syslog at all either... I'm not sure there is any way I can get that information to post now I did go in and pull a full syslog though and attached it to here. I'm very hopeful that someone can help me out! I'm really boggled as to why this drive became disabled after a parity check, and I'm keeping my fingers crossed that it is possible to rebuild the drive if any data is missing. Thank you and everyone else for all the assistance!! syslog.txt
April 4, 201412 yr Author Run reiserfsck check again. Thank you dgaschk, I will run reiserfsck check again once I get home from work. Thank you as well to everyone in this entire community for being so helpful. I'm not sure if this is a common issue, but I really appreciate the time everyone has taken to try and help out. I am not the most familiar with Linux, but thankfully there are people here that have been more than helpful!
April 4, 201412 yr Author Okay, so I ran the reiserfsck again on the disk, and my limited experience with the information in the report makes me believe that there is no longer anything on the drive other than 2 directories, which has me a bit scared. See the info from the report below: root@Tower:~# reiserfsck --check /dev/md4 reiserfsck 3.6.24 Will read-only check consistency of the filesystem on /dev/md4 Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes ########### reiserfsck --check started at Fri Apr 4 18:54:22 2014 ########### Replaying journal: Done. Reiserfs journal '/dev/md4' in blocks [18..8211]: 0 transactions replayed Checking internal tree.. finished Comparing bitmaps..finished Checking Semantic tree: finished No corruptions found There are on the filesystem: Leaves 1 Internal nodes 0 Directories 2 Other files 0 Data block pointers 0 (0 of them are zero) Safe links 0 ########### reiserfsck finished at Fri Apr 4 18:57:22 2014 ########### If the data is gone, then I guess there isn't a lot I can do unless I can rebuild the drive from the parity (granted that data hasn't been removed after the parity check that led to this situation right now). Any advice would again be incredibly welcome and greatly appreciated
April 4, 201412 yr root@Tower:~# reiserfsck --check /dev/md4 If the data is gone, then I guess there isn't a lot I can do unless I can rebuild the drive from the parity The md devices are in the parity set, so any changes made to the md devices are immediately committed to parity. However... from reading through what's transpired, I don't think you had valid parity to begin with. This is a very confused situation, where data from the bad disk was out of sync with the parity disk, as you indicated by all the parity errors in the monthly check. Is your monthly check a correcting or non-correcting check? If non-correcting, you may stand a chance of getting more data back by physically removing disk4, and do a reiserfsck check on the virtual md4 device.
April 4, 201412 yr Author root@Tower:~# reiserfsck --check /dev/md4 If the data is gone, then I guess there isn't a lot I can do unless I can rebuild the drive from the parity The md devices are in the parity set, so any changes made to the md devices are immediately committed to parity. However... from reading through what's transpired, I don't think you had valid parity to begin with. This is a very confused situation, where data from the bad disk was out of sync with the parity disk, as you indicated by all the parity errors in the monthly check. Is your monthly check a correcting or non-correcting check? If non-correcting, you may stand a chance of getting more data back by physically removing disk4, and do a reiserfsck check on the virtual md4 device. It's very strange, right? I've never had anything like this happen over the past 4 years of using this server. Honestly, I'm wondering if it has anything to do with UnRAID 6.0 Beta 4 as I upgraded literally right before the end of the night on the 31st, which would have been shortly before the scheduled monthly parity check. I was previously using 5.0.4 before upgrading. To answer your question about my monthly parity check, I'm 99% positive it's a non-correcting check. By looking at my log file from April 1st, the first line for the day says: Apr 1 00:00:01 Tower kernel: mdcmd (52): check NOCORRECT If that's the case, and I were to reiserfsck check the virtual md4 device as you suggested, that may be the best thing to do. I just want to try and do whatever I can so I don't lose anything, or at least lose as much as I potentially could. I just did a search on how to do this, and I am seemingly having a tough time figuring it out. I'm thinking it would be to do the check on the drive letter and not number? Ex: Instead of /dev/md4, it would be /dev/sdd? Thank you again for your assistance! Best community ever
April 5, 201412 yr Reiserfsck was run on /dev/md4. This means parity reflects what is currently on the disk. Try this: reiserfsck --rebuild-tree --scan-whole-partition /dev/md4
April 5, 201412 yr Reiserfsck was run on /dev/md4. This means parity reflects what is currently on the disk.But only where writes were made to the disk. Parity was out of sync, as evidenced by all the non-corrected errors. Until a correcting parity check, I'm pretty sure the parity disk would emulate different content than was actually physically on disk4. Reiserfsck was run on the md device, which means any writes would be sent to the parity disk, but anything not written would still be out of sync with the physical disk. Since there were thousands of incorrect parity locations, it's conceivable that the content may be more recoverable from the emulated disk. I think it's worth a shot anyway.
April 5, 201412 yr If that's the case, and I were to reiserfsck check the virtual md4 device as you suggested, that may be the best thing to do. I just want to try and do whatever I can so I don't lose anything, or at least lose as much as I potentially could. I just did a search on how to do this, and I am seemingly having a tough time figuring it out. I'm thinking it would be to do the check on the drive letter and not number? Ex: Instead of /dev/md4, it would be /dev/sdd? Whatever you do - do NOT run a check against /dev/sdd as that would definitely mess up things. When running against raw devices you have to include the partition number (e.g. /dev/sdd1). Using the /dev/md?? type devices means unRAID handles the partition for you, but you are probably running against an emulated device rather than the real device.. At this point I must admit I am not sure whether you would be better off running against the physical disk, or the /dev/md?? device which may be emulated. Someone else may be able to recommend what to do.
April 5, 201412 yr Author Thanks to everyone for all the suggestions and help! Being fairly unfamiliar with this, I'm hopeful that there is something that I can do to recover at least some of my data. Is there an easy way to figure out if a device is being emulated? Also, is it possible to access the disk in say midnight commander or something while it's disabled to see if there is data on the drive? I'm probably talking nonsense, but I'm just hoping to contribute to the solution of my own problem. Everything that everyone had said has been greatly appreciated however and I'm thankful for the suggestions! Before I move forward with anything, I'm going to see if anyone else has a recommendation as well since this seems to be a very strange and isolated incident You guys are the best! Thanks again for your assistance! Sent from my SCH-I535 using Tapatalk
April 5, 201412 yr Is there an easy way to figure out if a device is being emulated?Yes, if the disk is red balled, then all operations are actually being done on the emulated disk, the physical disk is not being used at all. If that is the case, then perhaps one way to attempt a better recovery would be to assign that disk as a cache drive, and see if it mounts and is readable. Or, pull it completely out of the box, and attempt to read it using one of the windows reiserfs utilities.
April 5, 201412 yr Reiserfsck was run on /dev/md4. This means parity reflects what is currently on the disk.You are totally right, I missed the HUGE GLARING second word in the title, "Disabled". That means the physical drive hasn't been messed with yet, so maybe there is a chance to recover files off of it.
April 5, 201412 yr You can do both. Run the rebuild/scan-whole on md4 and you can tun it in sdd1. First put the physical disk in a PC or linux box. Run SystemRescueCD to make a copy image of the disk and run recovery on the copy image. At the same time run reiserfsck on md4.
April 6, 201412 yr Author You can do both. Run the rebuild/scan-whole on md4 and you can tun it in sdd1. First put the physical disk in a PC or linux box. Run SystemRescueCD to make a copy image of the disk and run recovery on the copy image. At the same time run reiserfsck on md4. Awesome! Is there a reason to do both, or is it just that it's a possibility that you can? I was just going to do it on md4, but if I need to run it on the sdd# that's no problem. Before I do that, I'll put the disk in my PC so I can create an image of the disk. I'll need to go buy a spare 2TB drive just to write the image to since I'm assuming the image will be roughly as large as the amount of data I had on there, and I don't have enough free space on any drive in my PC currently. I've never worked with SystemRescueCD, but I'm sure it will let me create the drive image to another disk. I'll also run the reiserfsck on md4 while doing all that. Just to triple clarify, I'd be doing the rebuild as you suggested in this poist, your previous post (and also quoted below)? reiserfsck --rebuild-tree --scan-whole-partition /dev/md4 Thanks again so much for all of your help, it's really appreciated more than you can imagine.
April 6, 201412 yr Yes. That is the command. You can run it wile the disk is missing. Do both in case one method fails.
April 6, 201412 yr Author Yes. That is the command. You can run it wile the disk is missing. Do both in case one method fails. Nice, this is what my plan will be. I didn't get a chance to go buy another 2TB drive to create the image to with SystemRescueCD, so I'm going to try and go there after work tomorrow. I'll then create the image and then run recovery on it, then reiserfsck rebuild md4 and sdd1, or whatever the number is for that disk on my server. Is there an easy way to find out the actual partition number for that drive in UnRAID? I am hoping this works as I've been too afraid to even start the array and use any of my other disks while this issue is going on as I don't want to mess anything up more than it might be. Thanks again for the help, and I'll be sure to report back once I've tried those things!
Archived
This topic is now archived and is closed to further replies.