READONLY mounting? parity bad, SAS fail during CHECK after UPGRADE_DISK stopped


Recommended Posts

While upgrading a 750GB IDE drive to a 3TB SATA (UPGRADE_DISK) my SAS card went offline along with eight drives. A spontaneous parity CHECK then corrupted the parity drive. I'd like to rebuild parity, but with the data disks mounted read-only. Is this possible?

 

--------------------------------------

 

The 3TB drive was precleared beforehand and it was on the SAS controller. The parity drive was not on the controller. My unraid  version is 5.0-rc16c.

 

It appears that the upgrade terminated way too early, possibly because unmenu restarted, and then a CHECK started and wrote what must be garbage to the parity drive. I wasn't watching the system or interacting with the web GUIs while these things occurred.

 

unraid made about 17,000 writes to the parity drive, so it can't be trusted to rebuild the old data (750GB) onto the new drive (3TB).

 

The rebuild log looks like this:

Jan 21 18:45:53 Tower emhttp: Spinning up all drives... (Other emhttp)
Jan 21 18:45:53 Tower kernel: mdcmd (46): spinup 0 (Routine)
Jan 21 18:45:53 Tower kernel: mdcmd (47): spinup 1 (Routine)
Jan 21 18:45:53 Tower kernel: mdcmd (48): spinup 2 (Routine)
<snip>
Jan 21 18:45:53 Tower kernel: mdcmd (62): spinup 16 (Routine)
Jan 21 18:45:53 Tower kernel: mdcmd (63): spinup 17 (Routine)
Jan 21 18:45:53 Tower emhttp: writing GPT on disk (sdo), with partition 1 offset 64, erased: 0 (Drive related)
Jan 21 18:45:53 Tower emhttp: shcmd (75): sgdisk -Z /dev/sdo $stuff$> /dev/null (Drive related)
Jan 21 18:45:54 Tower emhttp: shcmd (76): sgdisk -o -a 64 -n 1:64:0 /dev/sdo |$stuff$ logger (Drive related)
Jan 21 18:45:54 Tower kernel:  sdo: sdo1 (Drive related)
Jan 21 18:45:56 Tower logger: Creating new GPT entries.
Jan 21 18:45:56 Tower logger: The operation has completed successfully.
Jan 21 18:45:56 Tower emhttp: shcmd (77): udevadm settle (Other emhttp)
Jan 21 18:45:56 Tower kernel:  sdo: sdo1 (Drive related)
Jan 21 18:45:56 Tower emhttp: Start array... (Other emhttp)
Jan 21 18:45:56 Tower kernel: mdcmd (64): start UPGRADE_DISK (unRAID engine)
Jan 21 18:45:56 Tower kernel: unraid: allocating 95460K for 1280 stripes (18 disks)
Jan 21 18:45:56 Tower kernel: md1: running, size: 1465138552 blocks (Drive related)
Jan 21 18:45:56 Tower kernel: md2: running, size: 1465138552 blocks (Drive related)
<snip>
Jan 21 18:46:00 Tower kernel: REISERFS (device md17): journal params: device md17, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 (Routine)
Jan 21 18:46:00 Tower kernel: REISERFS (device md17): checking transaction log (md17) (Routine)
Jan 21 18:46:01 Tower kernel: REISERFS (device md17): Using r5 hash to sort names (Routine)
Jan 21 18:53:22 Tower unmenu-status: Exiting unmenu web-server, exit status code = 141
Jan 21 18:53:22 Tower unmenu-status: Starting unmenu web-server
Jan 21 18:54:42 Tower emhttp: resized: /mnt/disk17 (Other emhttp)
Jan 21 18:54:42 Tower emhttp: shcmd (146): chmod 777 '/mnt/disk17' (Other emhttp)
Jan 21 18:54:42 Tower emhttp: shcmd (147): chown nobody:users '/mnt/disk17' (Other emhttp)
Jan 21 18:54:42 Tower emhttp: shcmd (148): mkdir /mnt/user (Other emhttp)
Jan 21 18:54:42 Tower emhttp: shcmd (149): /usr/local/sbin/shfs /mnt/user -disks 16777214 -o noatime,big_writes,allow_other -o remember=0  |$stuff$ logger (Other emhttp)
Jan 21 18:54:42 Tower emhttp: shcmd (150): crontab -c /etc/cron.d -d $stuff$> /dev/null (Other emhttp)
Jan 21 18:54:42 Tower emhttp: shcmd (151): /usr/local/sbin/emhttp_event disks_mounted (Other emhttp)
Jan 21 18:54:42 Tower emhttp_event: disks_mounted (Other emhttp)

 

It can't possibly have rebuilt 750GB of data in under 9 minutes. Perhaps unmenu restarting about 1 minute before the rebuild ended caused the rebuild to stop prematurely.

 

The next log entries show a parity check commencing, then an hour later the SAS card stops working and parity is trashed:

 

Jan 21 18:54:42 Tower kernel: mdcmd (65): check CORRECT (unRAID engine)
Jan 21 18:54:42 Tower kernel: md: recovery thread woken up ... (unRAID engine)
Jan 21 18:54:42 Tower kernel: md: recovery thread rebuilding disk17 ... (unRAID engine)
Jan 21 18:54:42 Tower kernel: md: using 1536k window, over a total of 2930266532 blocks. (unRAID engine)
Jan 21 18:54:43 Tower emhttp: shcmd (152): :>/etc/samba/smb-shares.conf (Other emhttp)
Jan 21 18:54:44 Tower emhttp: Restart SMB... (Other emhttp)
Jan 21 18:54:44 Tower emhttp: shcmd (153): killall -HUP smbd (Minor Issues)
Jan 21 18:54:44 Tower emhttp: shcmd (154): ps axc | grep -q rpc.mountd (Other emhttp)
Jan 21 18:54:44 Tower emhttp: _shcmd: shcmd (154): exit status: 1 (Other emhttp)
Jan 21 18:54:44 Tower emhttp: shcmd (155): /usr/local/sbin/emhttp_event svcs_restarted (Other emhttp)
Jan 21 18:54:44 Tower emhttp_event: svcs_restarted (Other emhttp)
Jan 21 19:42:01 Tower crond[1239]: failed parsing crontab for user root: cron=""  (Minor Issues)
Jan 21 19:55:45 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1957:Release slot [2] tag[2], task [f746ee00]: (System)
Jan 21 19:55:45 Tower kernel: sas: sas_ata_task_done: SAS error 8a (Errors)
Jan 21 19:55:45 Tower kernel: sd 1:0:0:0: [sdl] command f7601cc0 timed out (Drive related)

 

By the time I actually checked up on the rebuild progress, the log is full of errors such as this snippet:

 


Jan 21 19:56:14 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xf746e500 is aborted (Errors)
Jan 21 19:56:14 Tower kernel: sas: ata11: end_device-1:0: cmd error handler (Errors)
Jan 21 19:56:14 Tower kernel: sas: ata16: end_device-1:5: cmd error handler (Errors)
Jan 21 19:56:14 Tower kernel: sas: ata18: end_device-1:7: cmd error handler (Errors)
Jan 21 19:56:14 Tower kernel: sas: ata17: end_device-1:6: cmd error handler (Errors)
Jan 21 19:56:14 Tower kernel: sas: ata14: end_device-1:3: cmd error handler (Errors)
Jan 21 19:56:14 Tower kernel: sas: ata11: end_device-1:0: dev error handler (Drive related)
Jan 21 19:56:14 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 (Errors)
Jan 21 19:56:14 Tower kernel: ata11.00: failed command: READ DMA EXT (Minor Issues)
Jan 21 19:56:14 Tower kernel: ata11.00: cmd 25/00:00:78:e8:17/00:02:17:00:00/e0 tag 0 dma 262144 in (Drive related)
Jan 21 19:56:14 Tower kernel:          res 01/04:00:77:e8:17/00:00:17:00:00/e0 Emask 0x12 (ATA bus error) (Errors)
Jan 21 19:56:14 Tower kernel: ata11.00: status: { ERR } (Drive related)
Jan 21 19:56:14 Tower kernel: ata11.00: error: { ABRT } (Errors)
Jan 21 19:56:14 Tower kernel: sas: ata12: end_device-1:1: dev error handler (Drive related)
Jan 21 19:56:14 Tower kernel: ata11: hard resetting link (Minor Issues)
Jan 21 19:56:14 Tower kernel: sas: ata13: end_device-1:2: dev error handler (Drive related)
Jan 21 19:56:14 Tower kernel: sas: ata14: end_device-1:3: dev error handler (Drive related)
Jan 21 19:56:14 Tower kernel: ata14.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x6 frozen (Errors)
Jan 21 19:56:14 Tower kernel: ata14.00: failed command: WRITE FPDMA QUEUED (Minor Issues)
Jan 21 19:56:14 Tower kernel: ata14.00: cmd 61/00:00:d0:de:17/02:00:17:00:00/40 tag 0 ncq 262144 out (Drive related)
Jan 21 19:56:14 Tower kernel:          res 40/00:10:10:27:11/00:00:17:00:00/40 Emask 0x4 (timeout) (Errors)
Jan 21 19:56:14 Tower kernel: ata14.00: status: { DRDY } (Drive related)
Jan 21 19:56:14 Tower kernel: ata14.00: failed command: WRITE FPDMA QUEUED (Minor Issues)
Jan 21 19:56:14 Tower kernel: ata14.00: cmd 61/00:00:d0:e0:17/02:00:17:00:00/40 tag 1 ncq 262144 out (Drive related)
Jan 21 19:56:14 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) (Errors)
Jan 21 19:56:14 Tower kernel: ata14.00: status: { DRDY } (Drive related)
Jan 21 19:56:14 Tower kernel: ata14.00: failed command: WRITE FPDMA QUEUED (Minor Issues)
Jan 21 19:56:14 Tower kernel: ata14.00: cmd 61/00:00:d0:e2:17/02:00:17:00:00/40 tag 2 ncq 262144 out (Drive related)
Jan 21 19:56:14 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) (Errors)
Jan 21 19:56:14 Tower kernel: ata14.00: status: { DRDY } (Drive related)
Jan 21 19:56:14 Tower kernel: ata14.00: failed command: WRITE FPDMA QUEUED (Minor Issues)
Jan 21 19:56:14 Tower kernel: ata14.00: cmd 61/a8:00:d0:e4:17/01:00:17:00:00/40 tag 3 ncq 217088 out (Drive related)
Jan 21 19:56:14 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) (Errors)
Jan 21 19:56:14 Tower kernel: ata14.00: status: { DRDY } (Drive related)
Jan 21 19:56:14 Tower kernel: ata14: hard resetting link (Minor Issues)
Jan 21 19:56:16 Tower kernel: sas: ata15: end_device-1:4: dev error handler (Drive related)
Jan 21 19:56:16 Tower kernel: sas: ata16: end_device-1:5: dev error handler (Drive related)
Jan 21 19:56:16 Tower kernel: ata16.00: exception Emask 0x0 SAct 0x6 SErr 0x0 action 0x6 frozen (Errors)
Jan 21 19:56:16 Tower kernel: ata16.00: failed command: READ FPDMA QUEUED (Minor Issues)
Jan 21 19:56:16 Tower kernel: ata16.00: cmd 60/00:00:77:e8:17/02:00:17:00:00/40 tag 1 ncq 262144 in (Drive related)
Jan 21 19:56:16 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) (Errors)
Jan 21 19:56:16 Tower kernel: ata16.00: status: { DRDY } (Drive related)
Jan 21 19:56:16 Tower kernel: ata16.00: failed command: READ FPDMA QUEUED (Minor Issues)
Jan 21 19:56:16 Tower kernel: ata16.00: cmd 60/58:00:77:ea:17/00:00:17:00:00/40 tag 2 ncq 45056 in (Drive related)
Jan 21 19:56:16 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) (Errors)
Jan 21 19:56:16 Tower kernel: ata16.00: status: { DRDY } (Drive related)
Jan 21 19:56:16 Tower kernel: ata16: hard resetting link (Minor Issues)
Jan 21 19:56:16 Tower kernel: sas: ata17: end_device-1:6: dev error handler (Drive related)
Jan 21 19:56:16 Tower kernel: ata17.00: exception Emask 0x0 SAct 0x6 SErr 0x0 action 0x6 frozen (Errors)
Jan 21 19:56:16 Tower kernel: ata17.00: failed command: READ FPDMA QUEUED (Minor Issues)
Jan 21 19:56:16 Tower kernel: ata17.00: cmd 60/00:00:77:e8:17/02:00:17:00:00/40 tag 1 ncq 262144 in (Drive related)
Jan 21 19:56:16 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) (Errors)
Jan 21 19:56:16 Tower kernel: ata17.00: status: { DRDY } (Drive related)
Jan 21 19:56:16 Tower kernel: ata17.00: failed command: READ FPDMA QUEUED (Minor Issues)
Jan 21 19:56:16 Tower kernel: ata17.00: cmd 60/58:00:77:ea:17/00:00:17:00:00/40 tag 2 ncq 45056 in (Drive related)
Jan 21 19:56:16 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) (Errors)
Jan 21 19:56:16 Tower kernel: ata17.00: status: { DRDY } (Drive related)
Jan 21 19:56:16 Tower kernel: ata17: hard resetting link (Minor Issues)
Jan 21 19:56:17 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1527:mvs_I_T_nexus_reset for device[3]:rc= 0 (System)
Jan 21 19:56:17 Tower kernel: mvsas 0000:02:00.0: Phy3 : No sig fis (Drive related)
Jan 21 19:56:17 Tower kernel: sas: ata18: end_device-1:7: dev error handler (Drive related)
Jan 21 19:56:17 Tower kernel: ata18.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen (Errors)
Jan 21 19:56:17 Tower kernel: ata18.00: failed command: READ DMA EXT (Minor Issues)
Jan 21 19:56:17 Tower kernel: ata18.00: cmd 25/00:00:77:e6:17/00:02:17:00:00/e0 tag 0 dma 262144 in (Drive related)
Jan 21 19:56:17 Tower kernel:          res 40/00:00:0f:29:11/00:00:17:00:00/e0 Emask 0x4 (timeout) (Errors)
J
<snip>

Jan 21 19:56:41 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jan 21 19:56:41 Tower kernel: md: disk17 write error, sector=387441888 (Errors)
Jan 21 19:56:41 Tower kernel: md: disk17 write error, sector=387441896 (Errors)
Jan 21 19:56:41 Tower kernel: md: disk17 write error, sector=387441904 (Errors)
Jan 21 19:56:41 Tower kernel: md: disk17 write error, sector=387441912 (Errors)
Jan 21 19:56:41 Tower kernel: md: disk17 write error, sector=387441920 (Errors)
Jan 21 19:56:41 Tower kernel: sd 1:0:3:0: [sdo] READ CAPACITY failed (Drive related)
Jan 21 19:56:41 Tower kernel: md: disk17 write error, sector=387441928 (Errors)
Jan 21 19:56:41 Tower kernel: sd 1:0:3:0: [sdo]   (Drive related)
Jan 21 19:56:41 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00 (System)
Jan 21 19:56:41 Tower kernel: sd 1:0:3:0: [sdo] Sense not available. (Drive related)
Jan 21 19:56:41 Tower kernel: md: disk17 write error, sector=387441936 (Errors)
Jan 21 19:56:41 Tower kernel: md: disk17 write error, sector=387441944 (Errors)
Jan 21 19:56:41 Tower kernel: md: disk17 write error, sector=387441952 (Errors)
Jan 21 19:56:41 Tower kernel: md: disk17 write error, sector=387441960 (Errors)
Jan 21 19:56:41 Tower kernel: md: disk17 write error, sector=387441968 (Errors)
Jan 21 19:56:41 Tower kernel: md: disk17 write error, sector=387441976 (Errors)
Jan 21 19:56:41 Tower kernel: md: disk17 write error, sector=387441984 (Errors)

 

The new 3TB drive shows used space that's about the same as the full 750GB drive it replaced, but the file system has only a few directories and files. Both it and the other seven drives on the SAS card are red-balled.

 

Since the old 750GB drive and its data is still fine, I can put it back into the system and rebuild the parity drive, then try to upgrade to the 3TB again. The system is current turned off, and before I try this again, I'll see if there are any seating or cabling problems with the SAS card.

 

I would like to ensure that if I try to rebuild parity from the data drives and the SAS card goes offline again, unraid doesn't corrupt the data drives. Is it possible to mount them read-only before rebuilding parity?

 

Thanks,

- Eric

syslog-2014-01-21.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.