Jump to content

Controller crash (from: EMPTY and Remove a Drive Without Losing Parity)


EdgarWallace

Recommended Posts

The process finished with >400mio parity errors corrected. I wasn't smart enough to implement bunker yet but a quick check showed that the data was all ok.

 

I removed disk 7, started the server and was running another parity check. These errors that were causing the parity check to stop:

Aug 28 13:38:10 Tower kernel: md: parity incorrect, sector=2004194360
Aug 28 13:38:41 Tower kernel: sas: Enter sas_scsi_recover_host busy: 32 failed: 32
Aug 28 13:38:41 Tower kernel: sas: trying to find task 0xffff880205dec400

 

Next were read and write errors and finally disk 3 was disabled.

 

I ordered a new drive but in what order should I run the process? Would that be correct:

1.) preclear the new drive

2.) swap disc 3

3.) run a new parity check

 

I have the the diagnostic attached and would appreciate any further guidance. Thanks a lot.

tower-diagnostics-20150828-1354.zip

Link to comment

I removed disk 7, started the server and was running another parity check. These errors that were causing the parity check to stop:

Aug 28 13:38:10 Tower kernel: md: parity incorrect, sector=2004194360
Aug 28 13:38:41 Tower kernel: sas: Enter sas_scsi_recover_host busy: 32 failed: 32
Aug 28 13:38:41 Tower kernel: sas: trying to find task 0xffff880205dec400

 

Next were read and write errors and finally disk 3 was disabled.

 

I ordered a new drive but in what order should I run the process? Would that be correct:

1.) preclear the new drive

2.) swap disc 3

3.) run a new parity check

 

I have the the diagnostic attached and would appreciate any further guidance. Thanks a lot.

 

This kind of error is not usually a physical drive issue but an interface issue.  It usually means contact with the drive is lost, so nothing wrong with the drive, but if you cannot talk to it, you will have a lot of errors including read and write errors and then the drive is dropped by the kernel.  It's not the drive's fault, and nothing is wrong with the drive.  It could be a cabling or power issue, or a controller issue.  Sorry, I haven't looked at the diagnostics yet.  Maybe later...

Link to comment

Thanks Rob. If you would kindly look into the diagnostics...that would be excellent.

 

I have almost all data from disc 3 on my 2nd server and doing the remaining backup as we speak. The Smart Test I was starting is stuck at 10%...so I might need to first power down the server and checking the cabling. You are right..I might have damaged the cabling when I pulled off the empty disc drive but I'm going to install the new drive afterwards.

 

Let's see how the "old" drive behaves when I preclear it....this will making sure that the drive is in proper condition. I will post the results here.

 

Thanks a lot.

 

[EDIT] The data rebuild ended without any further errors. I started a SMART extended self-test (the short self-test didn't reported any errors). I will upload the SMART report as soon as it finishes.

 

However, 12 movies are damaged now. HUGE issue as I can't trust the data integrity any more. I saw this before but this is massive now.

Link to comment

Your syslog shows unRAID 6.0.1, but it's not.  Here is 6.0.1 -

Aug 21 14:57:33 Tower kernel: Linux version 4.0.4-unRAID (root@develop64) (gcc version 4.8.2 (GCC) ) #5 SMP PREEMPT Fri Jun 19 22:47:24 PDT 2015

Your unRAID appears to have been built 8 days later, probably based on 6.0.1, looks like this -

Aug 28 10:54:54 Tower kernel: Linux version 4.0.4-unRAID (root@Server) (gcc version 4.8.2 (GCC) ) #1 SMP PREEMPT Sat Jun 27 01:01:48 BST 2015

I have no idea whether that is significant or not.

 

Your SAS card (using mvsas) quit on you at Aug 28 13:38:10, just over 2.5 hours into the parity check.  First errors were -

Aug 28 13:38:10 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 0FFFFDFF,  slot [69].

Aug 28 13:38:10 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active FFFFFFFD,  slot [21].

Aug 28 13:38:10 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active FFFFFFEF,  slot [44].

Aug 28 13:38:10 Tower kernel: md: parity incorrect, sector=2004192336

Aug 28 13:38:10 Tower kernel: md: parity incorrect, sector=2004192344

...snipped - more parity errors...

Aug 28 13:38:41 Tower kernel: sas: Enter sas_scsi_recover_host busy: 32 failed: 32

Aug 28 13:38:41 Tower kernel: sas: trying to find task 0xffff880205dec400

Aug 28 13:38:41 Tower kernel: sas: sas_scsi_find_task: aborting task 0xffff880205dec400

Aug 28 13:38:41 Tower kernel: sas: sas_scsi_find_task: task 0xffff880205dec400 is aborted

Aug 28 13:38:41 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xffff880205dec400 is aborted

...snipped - aborted the tasks...

No clues above about what actually went wrong with the card.  When I have seen this in the past, the next drives to be accessed are unresponsive, and are dropped.  In your case, ata11 (Disk 3) was first and ata13 was second.  Some how, the drive associated with ata13 recovers after a hard reset, but ata11 does not, and is finally dropped, disabled.  That usually always means there is nothing wrong with the drive, just a lost connection, fixed by a reboot.  A SAS controller crash like this has always been hard to diagnose.  You can check for a firmware update.  Some have just replaced the card.  Wish I had better advice, but I don't.  What I *can* say is the drives are probably all fine.

 

Edit: I will probably split these last posts out, into a support topic in the General Support for v6 board.

Link to comment

Rob, yes please feel free to move the last posts wherever you think it should be posted. Only two comments:

1.) I'm running the unRAID 6.0.1 Media Build, that's why you see a different build date.

2.) I will check the firmware of my SAS card, but I do believe that it's the latest 4.0.0.1812

 

Thanks again, you advise is much appreciated! The server is up and running again, now I will check why some of the data got corrupted.

Link to comment

I'm facing new issues after having updated today to 6.1.1 Media Build today. Parity check is at around 60% and 137 errors have been corrected. This is the first time after the removal of the disc that I see parity errors:

Sep  7 13:59:58 Tower kernel: mdcmd (27): check CORRECT
Sep  7 13:59:58 Tower kernel: md: recovery thread woken up ...
Sep  7 13:59:58 Tower kernel: md: recovery thread checking parity...
Sep  7 13:59:58 Tower kernel: md: using 8192k window, over a total of 2930266532 blocks.
Sep  7 14:00:10 Tower kernel: md: correcting parity, sector=3343440
Sep  7 14:00:25 Tower kernel: md: correcting parity, sector=7459984
Sep  7 14:00:27 Tower kernel: md: correcting parity, sector=8189472
Sep  7 14:00:38 Tower kernel: md: correcting parity, sector=11363248
Sep  7 14:00:58 Tower kernel: md: correcting parity, sector=16961608
Sep  7 14:01:26 Tower kernel: md: correcting parity, sector=25038880
Sep  7 14:01:37 Tower kernel: md: correcting parity, sector=28293928
Sep  7 14:02:22 Tower kernel: md: correcting parity, sector=41052152
Sep  7 14:02:41 Tower autofan: Highest disk temp is 35°C, adjusting fan speed from: 173 (67% @ 1757rpm) to: 185 (72% @ 1814rpm)
Sep  7 14:07:49 Tower autofan: Highest disk temp is 36°C, adjusting fan speed from: 185 (72% @ 1856rpm) to: 197 (77% @ 1901rpm)
Sep  7 14:12:58 Tower autofan: Highest disk temp is 38°C, adjusting fan speed from: 197 (77% @ 1939rpm) to: 221 (86% @ 2061rpm)
Sep  7 14:14:55 Tower kernel: md: correcting parity, sector=252510240
Sep  7 14:14:58 Tower kernel: md: correcting parity, sector=253283392
Sep  7 14:15:18 Tower kernel: md: correcting parity, sector=259051408
Sep  7 14:15:57 Tower kernel: md: correcting parity, sector=269990640
Sep  7 14:15:57 Tower kernel: md: correcting parity, sector=270041944
Sep  7 14:16:21 Tower kernel: md: correcting parity, sector=276592496
Sep  7 14:16:33 Tower kernel: md: correcting parity, sector=280145144
Sep  7 14:16:47 Tower kernel: md: correcting parity, sector=283961064
Sep  7 14:17:03 Tower kernel: md: correcting parity, sector=288320288
Sep  7 14:18:05 Tower kernel: md: correcting parity, sector=304850328
Sep  7 14:18:28 Tower kernel: md: correcting parity, sector=311543808
Sep  7 14:18:41 Tower kernel: md: correcting parity, sector=314957544
Sep  7 14:18:48 Tower kernel: md: correcting parity, sector=317048336
Sep  7 14:18:52 Tower kernel: md: correcting parity, sector=318027648
Sep  7 14:19:11 Tower kernel: md: correcting parity, sector=323384584
Sep  7 14:19:57 Tower kernel: md: correcting parity, sector=336244240
Sep  7 14:20:05 Tower kernel: md: correcting parity, sector=338529168
Sep  7 14:20:12 Tower kernel: md: correcting parity, sector=340368640
Sep  7 14:20:14 Tower kernel: md: correcting parity, sector=340988768
Sep  7 14:20:17 Tower kernel: md: correcting parity, sector=341966792
Sep  7 14:20:50 Tower kernel: md: correcting parity, sector=351065160
Sep  7 14:21:34 Tower kernel: md: correcting parity, sector=363281072
Sep  7 14:21:34 Tower kernel: md: correcting parity, sector=363346856
Sep  7 14:22:00 Tower kernel: md: correcting parity, sector=370568344
Sep  7 14:22:02 Tower kernel: md: correcting parity, sector=371111200
Sep  7 14:22:12 Tower kernel: md: correcting parity, sector=374002624
Sep  7 14:23:10 Tower autofan: Highest disk temp is 39°C, adjusting fan speed from: 221 (86% @ 2057rpm) to: 233 (91% @ 2129rpm)
Sep  7 14:38:27 Tower autofan: Highest disk temp is 40°C, adjusting fan speed from: 233 (91% @ 2173rpm) to: FULL (100% @ 2261rpm)
Sep  7 14:43:21 Tower afpd[16689]: transmit: Request to dbd daemon (volume TMO) timed out.
Sep  7 14:43:21 Tower afpd[16689]: afp_openvol(/mnt/user/TMO): Fatal error: Unable to get stamp value from CNID backend
Sep  7 14:43:21 Tower afpd[16689]: dsi_stream_read: len:0, unexpected EOF
Sep  7 15:05:28 Tower kernel: md: correcting parity, sector=980278608
Sep  7 15:05:57 Tower kernel: md: correcting parity, sector=985759880
Sep  7 15:07:18 Tower kernel: md: correcting parity, sector=1001025664
Sep  7 15:07:22 Tower kernel: md: correcting parity, sector=1001759392
Sep  7 15:07:39 Tower kernel: md: correcting parity, sector=1004835824
Sep  7 15:18:25 Tower kernel: md: correcting parity, sector=1161449848
Sep  7 15:19:46 Tower kernel: md: correcting parity, sector=1182265736
Sep  7 15:20:02 Tower kernel: md: correcting parity, sector=1186601712
Sep  7 15:20:15 Tower kernel: md: correcting parity, sector=1190034936
Sep  7 15:21:11 Tower kernel: md: correcting parity, sector=1205051528
Sep  7 15:23:25 Tower kernel: md: correcting parity, sector=1240723488
Sep  7 15:24:12 Tower kernel: md: correcting parity, sector=1250962064
Sep  7 15:24:22 Tower kernel: md: correcting parity, sector=1253593120
Sep  7 15:24:26 Tower kernel: md: correcting parity, sector=1254518840
Sep  7 15:24:35 Tower kernel: md: correcting parity, sector=1257111824
Sep  7 15:25:01 Tower kernel: md: correcting parity, sector=1263885848
Sep  7 15:25:33 Tower kernel: md: correcting parity, sector=1272204680
Sep  7 15:25:45 Tower kernel: md: correcting parity, sector=1275460168
Sep  7 15:25:48 Tower kernel: md: correcting parity, sector=1276331032
Sep  7 15:26:07 Tower kernel: md: correcting parity, sector=1281163872
Sep  7 15:26:32 Tower kernel: md: correcting parity, sector=1287914224
Sep  7 15:27:35 Tower kernel: md: correcting parity, sector=1304660520
Sep  7 15:27:39 Tower kernel: md: correcting parity, sector=1305574320
Sep  7 15:31:52 Tower kernel: md: correcting parity, sector=1372377848
Sep  7 15:32:00 Tower kernel: md: correcting parity, sector=1374544968
Sep  7 15:32:32 Tower kernel: md: correcting parity, sector=1382969976
Sep  7 15:32:35 Tower kernel: md: correcting parity, sector=1383982824
Sep  7 15:33:25 Tower kernel: md: correcting parity, sector=1397184600
Sep  7 15:33:37 Tower kernel: md: correcting parity, sector=1400384120
Sep  7 15:33:45 Tower kernel: md: correcting parity, sector=1402511136
Sep  7 15:42:07 Tower kernel: md: correcting parity, sector=1531778480
Sep  7 15:42:22 Tower kernel: md: correcting parity, sector=1535854792
Sep  7 15:44:31 Tower kernel: md: correcting parity, sector=1568516280
Sep  7 15:44:46 Tower kernel: md: correcting parity, sector=1572348816
Sep  7 15:45:22 Tower kernel: md: correcting parity, sector=1581746784
Sep  7 15:45:38 Tower kernel: md: correcting parity, sector=1586174096
Sep  7 15:45:45 Tower kernel: md: correcting parity, sector=1587889496
Sep  7 15:45:57 Tower kernel: md: correcting parity, sector=1590897576
Sep  7 15:46:02 Tower kernel: md: correcting parity, sector=1592287784
Sep  7 15:46:16 Tower kernel: md: correcting parity, sector=1596037288
Sep  7 15:46:40 Tower kernel: md: correcting parity, sector=1602296856
Sep  7 15:47:17 Tower kernel: md: correcting parity, sector=1611713448
Sep  7 15:47:24 Tower kernel: md: correcting parity, sector=1613579848
Sep  7 15:47:30 Tower kernel: md: correcting parity, sector=1615184976
Sep  7 15:48:03 Tower kernel: md: correcting parity, sector=1623562832
Sep  7 15:48:10 Tower kernel: md: correcting parity, sector=1625251352
Sep  7 15:49:04 Tower kernel: md: correcting parity, sector=1639365744
Sep  7 15:52:44 Tower kernel: md: correcting parity, sector=1695258144
Sep  7 15:52:53 Tower kernel: md: correcting parity, sector=1697609320
Sep  7 15:53:36 Tower kernel: md: correcting parity, sector=1708735264
Sep  7 16:11:20 Tower kernel: md: correcting parity, sector=1973717144
Sep  7 16:11:49 Tower kernel: md: correcting parity, sector=1979377664
Sep  7 16:12:08 Tower kernel: md: correcting parity, sector=1982562488
Sep  7 16:13:43 Tower kernel: md: correcting parity, sector=1999711824
Sep  7 16:14:12 Tower kernel: md: correcting parity, sector=2005403240
Sep  7 16:14:31 Tower kernel: md: correcting parity, sector=2008878720
Sep  7 16:14:36 Tower kernel: md: correcting parity, sector=2009968720
Sep  7 16:14:40 Tower kernel: md: correcting parity, sector=2010523232
Sep  7 16:15:12 Tower kernel: md: correcting parity, sector=2016596312
Sep  7 16:15:20 Tower kernel: md: correcting parity, sector=2018135696
Sep  7 16:15:21 Tower kernel: md: correcting parity, sector=2018352872
Sep  7 16:15:59 Tower kernel: md: correcting parity, sector=2025801360
Sep  7 16:16:15 Tower kernel: md: correcting parity, sector=2028925880
Sep  7 16:16:41 Tower kernel: md: correcting parity, sector=2033556864
Sep  7 16:16:45 Tower kernel: md: correcting parity, sector=2034392456
Sep  7 16:17:19 Tower kernel: md: correcting parity, sector=2040497560
Sep  7 16:18:07 Tower kernel: md: correcting parity, stopped logging
Sep  7 16:33:02 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Sep  7 16:33:02 Tower kernel: sas: trying to find task 0xffff88010d2bce00
Sep  7 16:33:02 Tower kernel: sas: sas_scsi_find_task: aborting task 0xffff88010d2bce00
Sep  7 16:33:02 Tower kernel: sas: sas_scsi_find_task: task 0xffff88010d2bce00 is aborted
Sep  7 16:33:02 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xffff88010d2bce00 is aborted
Sep  7 16:33:02 Tower kernel: sas: ata7: end_device-1:0: cmd error handler
Sep  7 16:33:02 Tower kernel: sas: ata7: end_device-1:0: dev error handler
Sep  7 16:33:02 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  7 16:33:02 Tower kernel: ata7.00: failed command: READ DMA EXT
Sep  7 16:33:02 Tower kernel: ata7.00: cmd 25/00:00:e8:23:1f/00:02:86:00:00/e0 tag 10 dma 262144 in
Sep  7 16:33:02 Tower kernel:         res 40/00:00:2f:d2:d0/00:00:83:00:00/e3 Emask 0x4 (timeout)
Sep  7 16:33:02 Tower kernel: ata7.00: status: { DRDY }
Sep  7 16:33:02 Tower kernel: ata7: hard resetting link
Sep  7 16:33:02 Tower kernel: sas: ata8: end_device-1:1: dev error handler
Sep  7 16:33:02 Tower kernel: sas: ata9: end_device-1:2: dev error handler
Sep  7 16:33:02 Tower kernel: sas: ata10: end_device-1:3: dev error handler
Sep  7 16:33:02 Tower kernel: sas: ata11: end_device-1:4: dev error handler
Sep  7 16:33:02 Tower kernel: sas: ata12: end_device-1:5: dev error handler
Sep  7 16:33:02 Tower kernel: sas: ata13: end_device-1:6: dev error handler
Sep  7 16:33:02 Tower kernel: sas: sas_form_port: phy0 belongs to port0 already(1)!
Sep  7 16:33:04 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1428:mvs_I_T_nexus_reset for device[0]:rc= 0
Sep  7 16:33:04 Tower kernel: ata7.00: configured for UDMA/133
Sep  7 16:33:04 Tower kernel: ata7: EH complete
Sep  7 16:33:04 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
Sep  7 17:35:55 Tower autofan: Highest disk temp is 39°C, adjusting fan speed from: FULL (100% @ 2246rpm) to: 233 (91% @ 2163rpm)
Sep  7 17:42:46 Tower afpd[23316]: dsi_stream_read: len:0, unexpected EOF

 

Is there any way to identify, which files have been affected by the parity error correction?

 

I don't know if this is linked but here is an overview of files that are corrupted:

Atonement (2007) 		   	30.12.2010
Australia (2008)			05.01.2011
Flickan som lekte med elden (2009       05.01.2011
Unbreakable (2000)			05.01.2011
Lost in Translation (2003)		06.01.2011
Proposal, The (2009)	                06.01.2011
Vanilla Sky (2001)			09.01.2011
Vertical Limit (2000)			09.01.2011
Blade Runner (1982)			04.02.2011
Kaerlighedshistorie, En (2001)          14.02.2013

 

Very similar modification dates.  :(

Link to comment
Is there any way to identify, which files have been affected by the parity error correction?

 

I don't know if this is linked but here is an overview of files that are corrupted:

 

Chances are good to find the corrupted files, if you have a backup.

Run the rsync with the "checksum + dry run" option and you should get a list of files that will be copied

from your backup.

Link to comment

Thanks Fireball. Is that the correct syntax?

rsync -avH --stats --progress --checksum --ignore-existing --dry-run /mnt/user/Data/  rsync://tower2/mnt/user/Data/

 

It throws out a bunch of files that have been added/changed:

Number of files: 9,465 (reg: 8,703, dir: 762)
Number of created files: 37 (reg: 37)
Number of deleted files: 0
Number of regular files transferred: 37
Total file size: 20,820,956,184 bytes
Total transferred file size: 6,730,465 bytes
Literal data: 0 bytes
Matched data: 0 bytes
File list size: 262,027
File list generation time: 0.757 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 488,694
Total bytes received: 1,028

sent 488,694 bytes  received 1,028 bytes  182.63 bytes/sec
total size is 20,820,956,184  speedup is 42,515.87 (DRY RUN)

Link to comment

Here is the next try....providing this output:

root@Tower:/boot/custom# cat rsync.log
2015/09/08 17:20:51 [2097] opening tcp connection to tower2 port 873
2015/09/08 17:20:51 [2097] Connected to tower2 (192.168.178.10)
2015/09/08 17:20:51 [2097] msg checking charset: UTF-8
2015/09/08 17:20:51 [2097] (Client) Protocol versions: remote=31, negotiated=31
2015/09/08 17:20:51 [2097] building file list
2015/09/08 17:20:51 [2097] [sender] make_file(.,*,0)
2015/09/08 17:20:51 [2097] [sender] pushing local filters for /mnt/user/Fotos/
2015/09/08 17:20:51 [2097] [sender] make_file(2001,*,2)

...followed by many many lines like this one:

2015/09/08 17:20:52 [2097] [sender] i=44 /mnt/user/Fotos 1970 - 1975/imm001.jpg mode=0100666 len=1,707,321 uid=99 gid=100 flags=0

 

Let's see how it looks like when there is any difference...hopefully that is visible then.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...