August 27, 201312 yr hi all, i had a disk (disk1) redball on me earlier and attempted a rebuild by removing it from the array, starting, stopping, and re-adding. during the rebuild, disk4 started throwing tons of errors and i lost my telnet connection to the box (connection refused trying to reconnect). i cancelled the rebuild and stopped the array, and now unraid is showing disk4 as missing. i've tried repeating this process after rebooting/swapping ports, same thing happens. i've attached a smartctl report of disk4 (ran the short test a few times) and here's the syslog from the last reboot (cut off since it just repeats the io errors after a while). i'm not sure why the Device Inventory/other bootup stuff keeps repeating, never seen that before. i believe i have the syslog from the initial problem, i'll need to go through it later though as it's over a gig with the repeating read errors. obviously i'd like to recover all my data if possible, but if this isn't an option, so be it. i have another drive on order that should be here by Wednesday, and until then the array can stay down as it's nothing critical. open to any and all suggestions. smartctl_possibleBadDrive.txt
August 27, 201312 yr If you have two failed disks, there's nothing you can do r.e. rebuilding one. First, however, I'd be sure to re-seat all of your cables ... both SATA and power ... and see if the system then "sees" the drives differently. With any luck, you simply loosened the connection to disk #4 while replacing disk #1 ... and may be able to get back to the state where disk #1 is red-balled, but all the others are okay. Is the parity on the system good? ... how long ago was your last check? If you can't get to the state where you can do a rebuild, you'll simply need to replace the failed drives, do a new config, and simply copy the missing data from your backups. In the event you don't have backups, then what you can do (aside from instituting a good backup process so that doesn't happen again) is to connect the failed drives to one of your Windows machines; install a file system that can the ReiserFS disks; and copy any recoverable data back to your array. The free LinuxReader will provide you with that capability: http://www.diskinternals.com/linux-reader/
August 27, 201312 yr Author last parity check was ~80 days ago i think; i've never had an issue with parity checks so i'd figure it was still good. upon boot disk4 is fine (green ball), it's just that at a certain point in the data rebuild of disk1 (i assume when it gets to the bad block referred to in disk4's smart report) i get the read errors from disk4 and when i stop the data rebuild and array, disk4 is missing. i don't think the redballed disk1 is bad as smartctl reports look fine for it - i think it may have redballed due to an OOM issue when testing a new plugin, however now that i've started/cancelled a data rebuild on the drive the data on it is toast, correct? at this point i'm in maintenance mode (cancelled the automatic data rebuild for disk1) running reiserfsck against disk4 (after re-seating the connectors). depending on the outcome, i suppose i'll just try to put it in another linux box and see if i can view/copy files off it, or try another reiserfsck when connected to the motherboard rather than the sata controller.
August 27, 201312 yr Author reiserfsck just errored out (near finish?) on disk4: reiserfsck --check on disk4 (AOC-SASLP-MV8) root@archive:~# reiserfsck --check /dev/md4 reiserfsck 3.6.21 (2009 www.namesys.com) ************************************************************* ** If you are using the latest reiserfsprogs and it fails ** ** please email bug reports to [email protected], ** ** providing as much information as possible -- your ** ** hardware, kernel, patches, settings, all reiserfsck ** ** messages (including version), the reiserfsck logfile, ** ** check the syslog file for any related information. ** ** If you would like advice on using this program, support ** ** is available for $25 at www.namesys.com/support.html. ** ************************************************************* Will read-only check consistency of the filesystem on /dev/md4 Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes ########### reiserfsck --check started at Tue Aug 27 08:49:55 2013 ########### Replaying journal: Done. Reiserfs journal '/dev/md4' in blocks [18..8211]: 0 transactions replayed Checking internal tree.. \/ 22 (of 22//125 (of 125|/ 34 (of 127- The problem has occurred looks like a hardware problem. If you have bad blocks, we advise you to get a new hard drive, because once you get one bad block that the disk drive internals cannot hide from your sight,the chances of getting more are generally said to become much higher (precise statistics are unknown to us), and this disk drive is probably not expensive enough for you to you to risk your time and data on it. If you don't want to follow that follow that advice then if you have just a few bad blocks, try writing to the bad blocks and see if the drive remaps the bad blocks (that means it takes a block it has in reserve and allocates it for use for of that block number). If it cannot remap the block, use badblock option (-B) with reiserfs utils to handle this block correctly. bread: Cannot read the block (487036335): (Input/output error). Aborted syslog snippet from the failure: Aug 27 09:54:00 archive kernel: sd 0:0:3:0: [sdg] command f70870c0 timed out Aug 27 09:54:00 archive kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1 Aug 27 09:54:00 archive kernel: sas: trying to find task 0xef23ac80 Aug 27 09:54:00 archive kernel: sas: sas_scsi_find_task: aborting task 0xef23ac80 Aug 27 09:54:00 archive kernel: sas: sas_scsi_find_task: task 0xef23ac80 is aborted Aug 27 09:54:00 archive kernel: sas: sas_eh_handle_sas_errors: task 0xef23ac80 is aborted Aug 27 09:54:00 archive kernel: sas: ata12: end_device-0:3: cmd error handler Aug 27 09:54:00 archive kernel: sas: ata9: end_device-0:0: dev error handler Aug 27 09:54:00 archive kernel: sas: ata10: end_device-0:1: dev error handler Aug 27 09:54:00 archive kernel: sas: ata11: end_device-0:2: dev error handler Aug 27 09:54:00 archive kernel: sas: ata12: end_device-0:3: dev error handler Aug 27 09:54:00 archive kernel: sas: ata13: end_device-0:4: dev error handler Aug 27 09:54:00 archive kernel: ata12.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen Aug 27 09:54:00 archive kernel: ata12.00: failed command: READ FPDMA QUEUED Aug 27 09:54:00 archive kernel: ata12.00: cmd 60/08:00:b8:ad:3c/00:00:e8:00:00/40 tag 0 ncq 4096 in Aug 27 09:54:00 archive kernel: res 40/00:04:b8:df:30/00:00:e8:00:00/40 Emask 0x4 (timeout) Aug 27 09:54:00 archive kernel: sas: ata14: end_device-0:5: dev error handler Aug 27 09:54:00 archive kernel: ata12.00: status: { DRDY } Aug 27 09:54:00 archive kernel: sas: ata15: end_device-0:6: dev error handler Aug 27 09:54:00 archive kernel: ata12: hard resetting link Aug 27 09:54:00 archive kernel: sas: ata16: end_device-0:7: dev error handler Aug 27 09:54:00 archive in.telnetd[21163]: connect from 192.168.58.205 (192.168.58.205) Aug 27 09:54:00 archive login[21164]: ROOT LOGIN on '/dev/pts/1' from 'neuromancer.straylight.lan' Aug 27 09:54:02 archive kernel: mvsas 0000:02:00.0: Phy3 : No sig fis Aug 27 09:54:02 archive kernel: drivers/scsi/mvsas/mv_sas.c 1521:mvs_I_T_nexus_reset for device[3]:rc= 0 Aug 27 09:54:02 archive kernel: sas: sas_form_port: phy3 belongs to port3 already(1)! Aug 27 09:54:08 archive kernel: ata12.00: qc timeout (cmd 0x27) Aug 27 09:54:08 archive kernel: ata12.00: failed to read native max address (err_mask=0x4) Aug 27 09:54:08 archive kernel: ata12.00: HPA support seems broken, skipping HPA handling Aug 27 09:54:08 archive kernel: ata12.00: revalidation failed (errno=-5) Aug 27 09:54:08 archive kernel: ata12: hard resetting link Aug 27 09:54:10 archive kernel: mvsas 0000:02:00.0: Phy3 : No sig fis Aug 27 09:54:10 archive kernel: drivers/scsi/mvsas/mv_sas.c 1521:mvs_I_T_nexus_reset for device[3]:rc= 0 Aug 27 09:54:14 archive kernel: drivers/scsi/mvsas/mv_sas.c 1951:Release slot [0] tag[0], task [ee9ab680]: Aug 27 09:54:14 archive kernel: sas: sas_ata_task_done: SAS error 8a Aug 27 09:54:14 archive kernel: ata12.00: failed to set xfermode (err_mask=0x11) Aug 27 09:54:14 archive kernel: ata12.00: limiting speed to UDMA/133:PIO3 Aug 27 09:54:14 archive kernel: sas: sas_form_port: phy3 belongs to port3 already(1)! Aug 27 09:54:16 archive kernel: ata12: hard resetting link Aug 27 09:54:22 archive kernel: ata12.00: qc timeout (cmd 0xec) Aug 27 09:54:22 archive kernel: ata12.00: failed to IDENTIFY (I/O error, err_mask=0x5) Aug 27 09:54:22 archive kernel: ata12.00: revalidation failed (errno=-5) Aug 27 09:54:22 archive kernel: ata12.00: disabled Aug 27 09:54:22 archive kernel: ata12: hard resetting link Aug 27 09:54:22 archive kernel: sas: sas_form_port: phy3 belongs to port3 already(1)! Aug 27 09:54:25 archive kernel: drivers/scsi/mvsas/mv_sas.c 1521:mvs_I_T_nexus_reset for device[3]:rc= 0 Aug 27 09:54:25 archive kernel: ata12: EH complete Aug 27 09:54:25 archive kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] Unhandled error code Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00 Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] CDB: cdb[0]=0x28: 28 00 e8 3c ad b8 00 00 08 00 Aug 27 09:54:25 archive kernel: end_request: I/O error, dev sdg, sector 3896290744 Aug 27 09:54:25 archive kernel: md: disk4 read error, sector=3896290680 Aug 27 09:54:25 archive kernel: Buffer I/O error on device md4, logical block 487036335 Aug 27 09:54:25 archive kernel: Buffer I/O error on device md4, logical block 487036335 Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] READ CAPACITY(16) failed Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00 Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] Sense not available. Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] READ CAPACITY failed Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00 Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] Sense not available. Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] Asking for cache data failed Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] Assuming drive cache: write through Aug 27 09:54:25 archive kernel: sdg: detected capacity change from 2000398934016 to 0 Aug 27 09:54:25 archive kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Aug 27 09:54:57 archive last message repeated 54 times Aug 27 09:56:00 archive last message repeated 98 times Aug 27 09:57:02 archive last message repeated 92 times Aug 27 09:57:42 archive last message repeated 33 times Aug 27 09:57:45 archive in.telnetd[22013]: connect from 192.168.58.205 (192.168.58.205) Aug 27 09:57:45 archive login[22014]: ROOT LOGIN on '/dev/pts/2' from 'neuromancer.straylight.lan' Aug 27 09:57:48 archive kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Aug 27 09:58:24 archive last message repeated 24 times Aug 27 09:59:30 archive last message repeated 44 times Aug 27 10:00:36 archive last message repeated 44 times Aug 27 10:01:42 archive last message repeated 42 times Aug 27 10:02:48 archive last message repeated 44 times Aug 27 10:03:54 archive last message repeated 44 times Aug 27 10:05:00 archive last message repeated 46 times Aug 27 10:06:06 archive last message repeated 44 times Aug 27 10:07:12 archive last message repeated 42 times Aug 27 10:08:18 archive last message repeated 44 times Aug 27 10:09:24 archive last message repeated 44 times Aug 27 10:09:48 archive last message repeated 19 times i can't mount disk1 or disk4, getting 'can't read superblock'. it looks like any time a bad block is attempted to be read from disk4, the drive 'drops out' of unraid and goes missing in the web ui when i stop the array, which is what is preventing the Data Rebuild for disk1 from completing. perhaps if there was a way to skip/fix these bad blocks on disk4, i could rebuild disk1 successfully? the weird thing is on a clean boot, disk4 has a green ball. currently trying another reiserfsck on disk4 with a motherboard sata port to take the controller out of the mix
August 27, 201312 yr Author interesting... the reiserfsck --check on disk4 did not error out during 'Checking Internal Tree' while the drive was connected to the motherboard as it did when connected to the AOC-SASLP-MV8 - the error seems to have been handled somehow? reiserfsck is still running, so we'll see how it turns out. syslog error snippet while running reiserfsck --check when connected to the motherboard: Aug 27 12:00:01 archive kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Aug 27 12:00:01 archive kernel: ata5.00: irq_stat 0x40000001 Aug 27 12:00:01 archive kernel: ata5.00: failed command: READ DMA EXT Aug 27 12:00:01 archive kernel: ata5.00: cmd 25/00:08:a0:a9:e8/00:00:df:00:00/e0 tag 0 dma 4096 in Aug 27 12:00:01 archive kernel: res 51/40:08:a0:a9:e8/00:00:df:00:00/e0 Emask 0x9 (media error) Aug 27 12:00:01 archive kernel: ata5.00: status: { DRDY ERR } Aug 27 12:00:01 archive kernel: ata5.00: error: { UNC } Aug 27 12:00:01 archive kernel: ata5.00: configured for UDMA/133 Aug 27 12:00:01 archive kernel: ata5: EH complete Aug 27 12:01:00 archive kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Aug 27 12:01:00 archive kernel: ata5.00: irq_stat 0x40000001 Aug 27 12:01:00 archive kernel: ata5.00: failed command: READ DMA EXT Aug 27 12:01:00 archive kernel: ata5.00: cmd 25/00:08:b8:ad:3c/00:00:e8:00:00/e0 tag 0 dma 4096 in Aug 27 12:01:00 archive kernel: res 51/40:08:b8:ad:3c/00:00:e8:00:00/e0 Emask 0x9 (media error) Aug 27 12:01:00 archive kernel: ata5.00: status: { DRDY ERR } Aug 27 12:01:00 archive kernel: ata5.00: error: { UNC } Aug 27 12:01:00 archive kernel: ata5.00: configured for UDMA/133 Aug 27 12:01:00 archive kernel: ata5: EH complete i'm thinking of attempting another Data Rebuild of disk1 after reiserfsck is done running on disk4, since it seems to be working better when connected to the motherboad. since i've already started/stopped a Data Rebuild on disk1 several times because of the read errors from disk4 (while it was rebuilding disk 1 and connected to the AOC-SASLP-MV8), it's not like i'm losing anything, correct? if it does succeed, the best case scenario would be a data rebuild of disk4 onto a brand new drive? thoughts?
August 27, 201312 yr There are a lot of bad sectors on Disk 4, and they do take time for the disk to deal with. I suspect the SAS card was not patient enough to wait, and wrote the drive off when it was too busy to respond. Because of all the bad sectors though, I'm very pessimistic that you can try a rebuild with Disk 4 in the system, currently Reallocated_Sector_Ct=96, Current_Pending_Sector=1182, and Offline_Uncorrectable=258, with other bad attributes also. I'm going to take a look at your syslog, but the one thing I do recommend now is to try and copy off all important files, if possible, from both Disk 1 and Disk 4.
August 27, 201312 yr Author thanks for the feedback. after resierfsck completed, i can mount and read both disk4 and disk1, and with a few quick glances it looks like they still have all their data - this surprises me for disk1, as i mentioned there were multiple started/cancelled Data Rebuilds against disk1 when i initially noticed all the read errors from disk4. i'm clearing some space elsewhere and i'll try to recover data from those disks on a separate linux machine. from there, i suppose i'll run some more tests on the rest of my disks for piece of mind, replace disk4 with a new disk, then reAdd any recovered data and Parity Sync, at which point i should be good for the future. does this sound like a sane course of action?
August 27, 201312 yr ... this surprises me for disk1, as i mentioned there were multiple started/cancelled Data Rebuilds against disk1 when i initially noticed all the read errors from disk4. No need for surprise => that simply means the rebuild worked correctly. Note that any rebuilt sector SHOULD have exactly the same data that was there before ... and apparently that's the way it worked.
August 27, 201312 yr Author No need for surprise => that simply means the rebuild worked correctly. Note that any rebuilt sector SHOULD have exactly the same data that was there before ... and apparently that's the way it worked. i suppose so - i figured the read errors while rebuilding the disk and/or the cancelling of the rebuild would have a negative effect. i'm backing up the data from disk1 now, so i'll test a few files later on to make sure they're still good.
August 27, 201312 yr Syslog shows basically what you already knew. At 1 hour and 5 minutes into the rebuild Disk 4 timed out too long for the SAS cards error handling, and the drive was marked as disabled, so all of the subsequent errors can be ignored. The drive was no longer considered present. It's great that you can see and read from both Disk 1 and Disk 4, although I cannot guarantee total file integrity on either, especially Disk 4. Trying to rebuild a good Disk 1 with a bad Disk 4 though could possibly have caused a bit of corruption on Disk 1. I'd leave Disk 1 alone, and test the files as much as you can. I'm hopeful that damage should be very small, if any. Once you feel you can trust the rest of the array, consider attempting to rebuild Disk 4 on a replacement. i'm not sure why the Device Inventory/other bootup stuff keeps repeating, never seen that before. That happens with v5 releases now, when the web interface is refreshed. If you refresh that page, Tom assumes that you *might* have attached or detached a drive, so he unloads UnRAID then reloads it and checks the inventory of drives for any changes. You normally would only see a few of those cycles, but you are running some plugin (possibly a certain version of SimpleFeatures) that refreshes the screen every 5 seconds, so that is why there are so many of them in your syslog.
August 27, 201312 yr Author i'm currently copying the data off disk1/disk4, but i'd rather review it for a while first. can i just unassign disk1/disk4 from the array, reinitialize (webGui > Utils > New Config) and parity sync to shrink the array and keep the data on the other disks? that way i can add new drives as i get them and re-add the data after reviewing it.
Archived
This topic is now archived and is closed to further replies.