[SOLVED] drive redballed / rebuilding shows many read errors from another drive

August 27, 201312 yr

hi all,

i had a disk (disk1) redball on me earlier and attempted a rebuild by removing it from the array, starting, stopping, and re-adding. during the rebuild, disk4 started throwing tons of errors and i lost my telnet connection to the box (connection refused trying to reconnect). i cancelled the rebuild and stopped the array, and now unraid is showing disk4 as missing. i've tried repeating this process after rebooting/swapping ports, same thing happens.

i've attached a smartctl report of disk4 (ran the short test a few times) and here's the syslog from the last reboot (cut off since it just repeats the io errors after a while). i'm not sure why the Device Inventory/other bootup stuff keeps repeating, never seen that before. i believe i have the syslog from the initial problem, i'll need to go through it later though as it's over a gig with the repeating read errors.

obviously i'd like to recover all my data if possible, but if this isn't an option, so be it. i have another drive on order that should be here by Wednesday, and until then the array can stay down as it's nothing critical. open to any and all suggestions.

smartctl_possibleBadDrive.txt

Quote

August 27, 201312 yr

If you have two failed disks, there's nothing you can do r.e. rebuilding one.

First, however, I'd be sure to re-seat all of your cables ... both SATA and power ... and see if the system then "sees" the drives differently. With any luck, you simply loosened the connection to disk #4 while replacing disk #1 ... and may be able to get back to the state where disk #1 is red-balled, but all the others are okay.

Is the parity on the system good? ... how long ago was your last check?

If you can't get to the state where you can do a rebuild, you'll simply need to replace the failed drives, do a new config, and simply copy the missing data from your backups.

In the event you don't have backups, then what you can do (aside from instituting a good backup process so that doesn't happen again) is to connect the failed drives to one of your Windows machines; install a file system that can the ReiserFS disks; and copy any recoverable data back to your array. The free LinuxReader will provide you with that capability: http://www.diskinternals.com/linux-reader/

Quote

August 27, 201312 yr

Author

last parity check was ~80 days ago i think; i've never had an issue with parity checks so i'd figure it was still good.

upon boot disk4 is fine (green ball), it's just that at a certain point in the data rebuild of disk1 (i assume when it gets to the bad block referred to in disk4's smart report) i get the read errors from disk4 and when i stop the data rebuild and array, disk4 is missing.

i don't think the redballed disk1 is bad as smartctl reports look fine for it - i think it may have redballed due to an OOM issue when testing a new plugin, however now that i've started/cancelled a data rebuild on the drive the data on it is toast, correct?

at this point i'm in maintenance mode (cancelled the automatic data rebuild for disk1) running reiserfsck against disk4 (after re-seating the connectors). depending on the outcome, i suppose i'll just try to put it in another linux box and see if i can view/copy files off it, or try another reiserfsck when connected to the motherboard rather than the sata controller.

Quote

August 27, 201312 yr

Author

reiserfsck just errored out (near finish?) on disk4:

reiserfsck --check on disk4 (AOC-SASLP-MV8)

root@archive:~# reiserfsck --check /dev/md4
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to [email protected], **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will read-only check consistency of the filesystem on /dev/md4
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Tue Aug 27 08:49:55 2013
###########
Replaying journal: Done.
Reiserfs journal '/dev/md4' in blocks [18..8211]: 0 transactions replayed



Checking internal tree.. \/ 22 (of  22//125 (of 125|/ 34 (of 127-
The problem has occurred looks like a hardware problem. If you have
bad blocks, we advise you to get a new hard drive, because once you
get one bad block  that the disk  drive internals  cannot hide from
your sight,the chances of getting more are generally said to become
much higher  (precise statistics are unknown to us), and  this disk
drive is probably not expensive enough  for you to you to risk your
time and  data on it.  If you don't want to follow that follow that
advice then  if you have just a few bad blocks,  try writing to the
bad blocks  and see if the drive remaps  the bad blocks (that means
it takes a block  it has  in reserve  and allocates  it for use for
of that block number).  If it cannot remap the block,  use badblock
option (-B) with  reiserfs utils to handle this block correctly.

bread: Cannot read the block (487036335): (Input/output error).

Aborted

syslog snippet from the failure:

Aug 27 09:54:00 archive kernel: sd 0:0:3:0: [sdg] command f70870c0 timed out
Aug 27 09:54:00 archive kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Aug 27 09:54:00 archive kernel: sas: trying to find task 0xef23ac80
Aug 27 09:54:00 archive kernel: sas: sas_scsi_find_task: aborting task 0xef23ac80
Aug 27 09:54:00 archive kernel: sas: sas_scsi_find_task: task 0xef23ac80 is aborted
Aug 27 09:54:00 archive kernel: sas: sas_eh_handle_sas_errors: task 0xef23ac80 is aborted
Aug 27 09:54:00 archive kernel: sas: ata12: end_device-0:3: cmd error handler
Aug 27 09:54:00 archive kernel: sas: ata9: end_device-0:0: dev error handler
Aug 27 09:54:00 archive kernel: sas: ata10: end_device-0:1: dev error handler
Aug 27 09:54:00 archive kernel: sas: ata11: end_device-0:2: dev error handler
Aug 27 09:54:00 archive kernel: sas: ata12: end_device-0:3: dev error handler
Aug 27 09:54:00 archive kernel: sas: ata13: end_device-0:4: dev error handler
Aug 27 09:54:00 archive kernel: ata12.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Aug 27 09:54:00 archive kernel: ata12.00: failed command: READ FPDMA QUEUED
Aug 27 09:54:00 archive kernel: ata12.00: cmd 60/08:00:b8:ad:3c/00:00:e8:00:00/40 tag 0 ncq 4096 in
Aug 27 09:54:00 archive kernel:          res 40/00:04:b8:df:30/00:00:e8:00:00/40 Emask 0x4 (timeout)
Aug 27 09:54:00 archive kernel: sas: ata14: end_device-0:5: dev error handler
Aug 27 09:54:00 archive kernel: ata12.00: status: { DRDY }
Aug 27 09:54:00 archive kernel: sas: ata15: end_device-0:6: dev error handler
Aug 27 09:54:00 archive kernel: ata12: hard resetting link
Aug 27 09:54:00 archive kernel: sas: ata16: end_device-0:7: dev error handler
Aug 27 09:54:00 archive in.telnetd[21163]: connect from 192.168.58.205 (192.168.58.205)
Aug 27 09:54:00 archive login[21164]: ROOT LOGIN  on '/dev/pts/1' from 'neuromancer.straylight.lan'
Aug 27 09:54:02 archive kernel: mvsas 0000:02:00.0: Phy3 : No sig fis
Aug 27 09:54:02 archive kernel: drivers/scsi/mvsas/mv_sas.c 1521:mvs_I_T_nexus_reset for device[3]:rc= 0
Aug 27 09:54:02 archive kernel: sas: sas_form_port: phy3 belongs to port3 already(1)!
Aug 27 09:54:08 archive kernel: ata12.00: qc timeout (cmd 0x27)
Aug 27 09:54:08 archive kernel: ata12.00: failed to read native max address (err_mask=0x4)
Aug 27 09:54:08 archive kernel: ata12.00: HPA support seems broken, skipping HPA handling
Aug 27 09:54:08 archive kernel: ata12.00: revalidation failed (errno=-5)
Aug 27 09:54:08 archive kernel: ata12: hard resetting link
Aug 27 09:54:10 archive kernel: mvsas 0000:02:00.0: Phy3 : No sig fis
Aug 27 09:54:10 archive kernel: drivers/scsi/mvsas/mv_sas.c 1521:mvs_I_T_nexus_reset for device[3]:rc= 0
Aug 27 09:54:14 archive kernel: drivers/scsi/mvsas/mv_sas.c 1951:Release slot [0] tag[0], task [ee9ab680]:
Aug 27 09:54:14 archive kernel: sas: sas_ata_task_done: SAS error 8a
Aug 27 09:54:14 archive kernel: ata12.00: failed to set xfermode (err_mask=0x11)
Aug 27 09:54:14 archive kernel: ata12.00: limiting speed to UDMA/133:PIO3
Aug 27 09:54:14 archive kernel: sas: sas_form_port: phy3 belongs to port3 already(1)!
Aug 27 09:54:16 archive kernel: ata12: hard resetting link
Aug 27 09:54:22 archive kernel: ata12.00: qc timeout (cmd 0xec)
Aug 27 09:54:22 archive kernel: ata12.00: failed to IDENTIFY (I/O error, err_mask=0x5)
Aug 27 09:54:22 archive kernel: ata12.00: revalidation failed (errno=-5)
Aug 27 09:54:22 archive kernel: ata12.00: disabled
Aug 27 09:54:22 archive kernel: ata12: hard resetting link
Aug 27 09:54:22 archive kernel: sas: sas_form_port: phy3 belongs to port3 already(1)!
Aug 27 09:54:25 archive kernel: drivers/scsi/mvsas/mv_sas.c 1521:mvs_I_T_nexus_reset for device[3]:rc= 0
Aug 27 09:54:25 archive kernel: ata12: EH complete
Aug 27 09:54:25 archive kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0
Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] Unhandled error code
Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg]  Result: hostbyte=0x04 driverbyte=0x00
Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] CDB: cdb[0]=0x28: 28 00 e8 3c ad b8 00 00 08 00
Aug 27 09:54:25 archive kernel: end_request: I/O error, dev sdg, sector 3896290744
Aug 27 09:54:25 archive kernel: md: disk4 read error, sector=3896290680
Aug 27 09:54:25 archive kernel: Buffer I/O error on device md4, logical block 487036335
Aug 27 09:54:25 archive kernel: Buffer I/O error on device md4, logical block 487036335
Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] READ CAPACITY(16) failed
Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg]  Result: hostbyte=0x04 driverbyte=0x00
Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] Sense not available.
Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] READ CAPACITY failed
Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg]  Result: hostbyte=0x04 driverbyte=0x00
Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] Sense not available.
Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] Asking for cache data failed
Aug 27 09:54:25 archive kernel: sd 0:0:3:0: [sdg] Assuming drive cache: write through
Aug 27 09:54:25 archive kernel: sdg: detected capacity change from 2000398934016 to 0
Aug 27 09:54:25 archive kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Aug 27 09:54:57 archive last message repeated 54 times
Aug 27 09:56:00 archive last message repeated 98 times
Aug 27 09:57:02 archive last message repeated 92 times
Aug 27 09:57:42 archive last message repeated 33 times
Aug 27 09:57:45 archive in.telnetd[22013]: connect from 192.168.58.205 (192.168.58.205)
Aug 27 09:57:45 archive login[22014]: ROOT LOGIN  on '/dev/pts/2' from 'neuromancer.straylight.lan'
Aug 27 09:57:48 archive kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Aug 27 09:58:24 archive last message repeated 24 times
Aug 27 09:59:30 archive last message repeated 44 times
Aug 27 10:00:36 archive last message repeated 44 times
Aug 27 10:01:42 archive last message repeated 42 times
Aug 27 10:02:48 archive last message repeated 44 times
Aug 27 10:03:54 archive last message repeated 44 times
Aug 27 10:05:00 archive last message repeated 46 times
Aug 27 10:06:06 archive last message repeated 44 times
Aug 27 10:07:12 archive last message repeated 42 times
Aug 27 10:08:18 archive last message repeated 44 times
Aug 27 10:09:24 archive last message repeated 44 times
Aug 27 10:09:48 archive last message repeated 19 times

i can't mount disk1 or disk4, getting 'can't read superblock'. it looks like any time a bad block is attempted to be read from disk4, the drive 'drops out' of unraid and goes missing in the web ui when i stop the array, which is what is preventing the Data Rebuild for disk1 from completing. perhaps if there was a way to skip/fix these bad blocks on disk4, i could rebuild disk1 successfully? the weird thing is on a clean boot, disk4 has a green ball.

currently trying another reiserfsck on disk4 with a motherboard sata port to take the controller out of the mix

Quote

August 27, 201312 yr

Author

interesting... the reiserfsck --check on disk4 did not error out during 'Checking Internal Tree' while the drive was connected to the motherboard as it did when connected to the AOC-SASLP-MV8 - the error seems to have been handled somehow? reiserfsck is still running, so we'll see how it turns out.

syslog error snippet while running reiserfsck --check when connected to the motherboard:

Aug 27 12:00:01 archive kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Aug 27 12:00:01 archive kernel: ata5.00: irq_stat 0x40000001
Aug 27 12:00:01 archive kernel: ata5.00: failed command: READ DMA EXT
Aug 27 12:00:01 archive kernel: ata5.00: cmd 25/00:08:a0:a9:e8/00:00:df:00:00/e0 tag 0 dma 4096 in
Aug 27 12:00:01 archive kernel:          res 51/40:08:a0:a9:e8/00:00:df:00:00/e0 Emask 0x9 (media error)
Aug 27 12:00:01 archive kernel: ata5.00: status: { DRDY ERR }
Aug 27 12:00:01 archive kernel: ata5.00: error: { UNC }
Aug 27 12:00:01 archive kernel: ata5.00: configured for UDMA/133
Aug 27 12:00:01 archive kernel: ata5: EH complete
Aug 27 12:01:00 archive kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Aug 27 12:01:00 archive kernel: ata5.00: irq_stat 0x40000001
Aug 27 12:01:00 archive kernel: ata5.00: failed command: READ DMA EXT
Aug 27 12:01:00 archive kernel: ata5.00: cmd 25/00:08:b8:ad:3c/00:00:e8:00:00/e0 tag 0 dma 4096 in
Aug 27 12:01:00 archive kernel:          res 51/40:08:b8:ad:3c/00:00:e8:00:00/e0 Emask 0x9 (media error)
Aug 27 12:01:00 archive kernel: ata5.00: status: { DRDY ERR }
Aug 27 12:01:00 archive kernel: ata5.00: error: { UNC }
Aug 27 12:01:00 archive kernel: ata5.00: configured for UDMA/133
Aug 27 12:01:00 archive kernel: ata5: EH complete

i'm thinking of attempting another Data Rebuild of disk1 after reiserfsck is done running on disk4, since it seems to be working better when connected to the motherboad. since i've already started/stopped a Data Rebuild on disk1 several times because of the read errors from disk4 (while it was rebuilding disk 1 and connected to the AOC-SASLP-MV8), it's not like i'm losing anything, correct? if it does succeed, the best case scenario would be a data rebuild of disk4 onto a brand new drive?

thoughts?

Quote

August 27, 201312 yr

There are a lot of bad sectors on Disk 4, and they do take time for the disk to deal with. I suspect the SAS card was not patient enough to wait, and wrote the drive off when it was too busy to respond. Because of all the bad sectors though, I'm very pessimistic that you can try a rebuild with Disk 4 in the system, currently Reallocated_Sector_Ct=96, Current_Pending_Sector=1182, and Offline_Uncorrectable=258, with other bad attributes also. I'm going to take a look at your syslog, but the one thing I do recommend now is to try and copy off all important files, if possible, from both Disk 1 and Disk 4.

Quote

August 27, 201312 yr

Author

thanks for the feedback.

after resierfsck completed, i can mount and read both disk4 and disk1, and with a few quick glances it looks like they still have all their data - this surprises me for disk1, as i mentioned there were multiple started/cancelled Data Rebuilds against disk1 when i initially noticed all the read errors from disk4. i'm clearing some space elsewhere and i'll try to recover data from those disks on a separate linux machine.

from there, i suppose i'll run some more tests on the rest of my disks for piece of mind, replace disk4 with a new disk, then reAdd any recovered data and Parity Sync, at which point i should be good for the future.

does this sound like a sane course of action?

Quote

August 27, 201312 yr

... this surprises me for disk1, as i mentioned there were multiple started/cancelled Data Rebuilds against disk1 when i initially noticed all the read errors from disk4.

No need for surprise => that simply means the rebuild worked correctly. Note that any rebuilt sector SHOULD have exactly the same data that was there before ... and apparently that's the way it worked.

Quote

August 27, 201312 yr

Author

No need for surprise => that simply means the rebuild worked correctly. Note that any rebuilt sector SHOULD have exactly the same data that was there before ... and apparently that's the way it worked.

i suppose so - i figured the read errors while rebuilding the disk and/or the cancelling of the rebuild would have a negative effect. i'm backing up the data from disk1 now, so i'll test a few files later on to make sure they're still good.

Quote

August 27, 201312 yr

Syslog shows basically what you already knew. At 1 hour and 5 minutes into the rebuild Disk 4 timed out too long for the SAS cards error handling, and the drive was marked as disabled, so all of the subsequent errors can be ignored. The drive was no longer considered present.

It's great that you can see and read from both Disk 1 and Disk 4, although I cannot guarantee total file integrity on either, especially Disk 4. Trying to rebuild a good Disk 1 with a bad Disk 4 though could possibly have caused a bit of corruption on Disk 1. I'd leave Disk 1 alone, and test the files as much as you can. I'm hopeful that damage should be very small, if any. Once you feel you can trust the rest of the array, consider attempting to rebuild Disk 4 on a replacement.

i'm not sure why the Device Inventory/other bootup stuff keeps repeating, never seen that before.

That happens with v5 releases now, when the web interface is refreshed. If you refresh that page, Tom assumes that you *might* have attached or detached a drive, so he unloads UnRAID then reloads it and checks the inventory of drives for any changes. You normally would only see a few of those cycles, but you are running some plugin (possibly a certain version of SimpleFeatures) that refreshes the screen every 5 seconds, so that is why there are so many of them in your syslog.

Quote

August 27, 201312 yr

Author

i'm currently copying the data off disk1/disk4, but i'd rather review it for a while first. can i just unassign disk1/disk4 from the array, reinitialize (webGui > Utils > New Config) and parity sync to shrink the array and keep the data on the other disks? that way i can add new drives as i get them and re-add the data after reviewing it.

Quote

August 28, 201312 yr

Author

thanks for the info and advice guys.

Quote

[SOLVED] drive redballed / rebuilding shows many read errors from another drive

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)