Tophicles Posted December 8, 2011 Posted December 8, 2011 Hey folks! Up until recently, it has been smooth sailing with unRAID. The other day, I come home to find that my drive 9 has 4 errors and the syslog indicated it was a problem with bad sectors. No biggie, I pulled the offending drive out and popped in a new WD20EARS drive. The data rebuild began, but then sank to 10KB/s transfer rate (which would have taken 230K minutes to complete). I shut it down, and swapped spots on the controller, swapped controllers and cables and no luck, each time it was transferring like molasses in Antarctica. So as a last resort I decided to forgo the data-rebuild and let the new drive pre-clear first. So far it has been running for 12 hours and has only pre-cleared 9%. Here is another strange thing, I put the ~bad~ drive into a USB cage and ran the Seagate SeaTOOLS diagnostics on it. It has passed every single test. I am not sure what the heck is going on, but some advice would be appreciated before I lose another drive and my whole system goes into the toilet I have attached my syslog to this post, I was running 5b14, but rolled back to 5b12 (thinking there may have been an error in the new version). Many thanks in advance! syslog.zip
dgaschk Posted December 9, 2011 Posted December 9, 2011 The log shows disk9 has write errors and disk10 and disk11 are having write errors. You may have nocked some cables loose.
RobJ Posted December 9, 2011 Posted December 9, 2011 Very unusual behavior in your syslog, but I don't see any actual evidence of issues with the physical drives themselves. Dec 7 23:07:53 Angband kernel: mdcmd (1): import ... (array is started) ... Dec 7 23:08:03 Angband kernel: md: recovery thread rebuilding disk9 ... ... Dec 7 23:10:05 Angband login[13103]: ROOT LOGIN on '/dev/pts/0' from '172.16.17.175' Dec 7 23:10:34 Angband kernel: mdcmd (51): spindown 9 Dec 7 23:10:37 Angband kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Dec 7 23:10:37 Angband kernel: ata5.00: irq_stat 0x40000001 Dec 7 23:10:37 Angband kernel: ata5.00: failed command: WRITE DMA EXT Dec 7 23:10:37 Angband kernel: ata5.00: cmd 35/00:08:6f:12:0c/00:00:86:00:00/e0 tag 0 dma 4096 out Dec 7 23:10:37 Angband kernel: res 61/04:08:6f:12:0c/00:00:86:00:00/e0 Emask 0x1 (device error) Dec 7 23:10:37 Angband kernel: ata5.00: status: { DRDY DF ERR } Dec 7 23:10:37 Angband kernel: ata5.00: error: { ABRT } Dec 7 23:10:39 Angband kernel: ata5.00: failed to read native max address (err_mask=0x1) Dec 7 23:10:39 Angband kernel: ata5.00: HPA support seems broken, skipping HPA handling Dec 7 23:10:39 Angband kernel: ata5.00: failed to enable AA (error_mask=0x1) ... Dec 7 23:11:07 Angband kernel: ata5.00: disabled About 30 seconds after you log in, and only 2 minutes into the rebuild, a command to spin down Disk 9 is sent(!), followed immediately by all hell! All further commands to Disk 9 fail, beginning with the write command that was currently in progress. After multiple attempts, the kernel gives up on the drive and marks it disabled. At that point, you can completely ignore everything else in the syslog that relates to Disk 9. I'm assuming that you did not try to spin the drive down yourself? And that you don't have something configured to spin it down? You may want to disable all addons and extras and go script commands you may have running, and try again. Dec 7 23:11:07 Angband kernel: mdcmd (52): spindown 10 ... Dec 7 23:11:08 Angband kernel: mdcmd (53): spindown 11 These quickly follow the Disk 9 device failure, and all subsequent commands to Disk 10 and Disk 11 fail too.
Tophicles Posted December 9, 2011 Author Posted December 9, 2011 Ok, so it looks like I had a bad drive, the speed went back up to normal when I put in a brand new drive. I successfully pre-cleared the old ~bad~ drive (which passed the low-level SeaTools diags with flying colours) and the data rebuild was humming along UNTIL about an hour ago. Then I got all kinds of crap happening in the syslog, I am attaching it here to see what the heck I can do now.... I am totally lost. The data rebuild cancelled itself, so I am not sure what to do. Any thoughts? syslog.zip
RobJ Posted December 9, 2011 Posted December 9, 2011 There do not appear to be any drive issues here. The array was started without Disk 9, it was disabled previously, so the array is running in disabled mode, with Disk 9 emulated from the rest of the drives. (No rebuild is occurring here, because Disk 9 is not assigned, will have to be re-assigned first.) kernel: mptscsih: ioc0: attempting task abort! (sc=cf89a9c0) unmenu[1284]: terminate called after throwing an instance of 'int' You do have a bunch of errors with 2 different sub-systems, UnMENU and mptscsih, no idea if the errors are related. Perhaps JoeL can help with the UnMENU errors. Just a guess, the mptscsih errors may be related to a disk controller. I don't know what to suggest, except try an older kernel.
Tophicles Posted December 9, 2011 Author Posted December 9, 2011 There do not appear to be any drive issues here. The array was started without Disk 9, it was disabled previously, so the array is running in disabled mode, with Disk 9 emulated from the rest of the drives. (No rebuild is occurring here, because Disk 9 is not assigned, will have to be re-assigned first.) kernel: mptscsih: ioc0: attempting task abort! (sc=cf89a9c0) unmenu[1284]: terminate called after throwing an instance of 'int' You do have a bunch of errors with 2 different sub-systems, UnMENU and mptscsih, no idea if the errors are related. Perhaps JoeL can help with the UnMENU errors. Just a guess, the mptscsih errors may be related to a disk controller. I don't know what to suggest, except try an older kernel. I guess where my confusion lies is in the fact that there was a data rebuild occurring, it was 93.3% finished when it suddenly spat out errors. Then the data rebuild apparently cancelled itself and now I cannot seem to load the EMHTTP interface. I suppose I should try rolling back to 5b12? Km.
RobJ Posted December 10, 2011 Posted December 10, 2011 I guess where my confusion lies is in the fact that there was a data rebuild occurring, it was 93.3% finished when it suddenly spat out errors. Then the data rebuild apparently cancelled itself and now I cannot seem to load the EMHTTP interface. Did you perhaps attach the wrong syslog? The first syslog covers a period of about 5 and a half minutes late on Dec 7. The second syslog begins on Dec 8 at about 12:30, the array starts at 12:31, and a shutdown completes at 18:35. During this second syslog, Disk 9 is not assigned, is emulated only, and no data rebuild is started. (The numerous unmenu and mptscsih errors occur between 12:33 and 14:15.) If you can locate the syslog covering the data rebuild that quits after 93%, that could be helpful. I've been out-of-touch for awhile, so please excuse the question, but has unRAID compatibility with hardware using the mptscsih module been tested/verified?
Tophicles Posted December 12, 2011 Author Posted December 12, 2011 Ok, so it looks like my disk7 has a poop-tonne of read errors on it, which is why portions of the data-rebuild and parity-check went so slowly. I will replace this disk at a later time, right now everything is back up to snuff. More weirdness ensues... I have added a new, 2TB WD drive to the array. I pre-cleared it using the 1.13 version of the pre-clear script. I added it to the array and then started the array, agreeing to format any new disks. It formatted the new disk (disk12) and green-balled it into the array. I can see the new disk via it#s SMB direct share \\tower\disk12 and I can create/delete folders on it. my unMENU main screen correctly reports the total new size of the array as 2.3 TB free (I had 300 or so GB free until adding the new drive). However, my shares are set to ~most free~, and so far, nothing is being written to the new drive. My SABnzbd reports there is only ~300Gb free (correct, if the new drive is not in) but that conflicts with what my unMENU screen says. I have gone back to 5b12 and I have included links to the unMENU and unRAID main screens. Any thoughts? unMENU Main: http://imm.io/cRw3 unRAID Main: http://imm.io/cRwb syslog-2011-12-12.zip
Tophicles Posted December 12, 2011 Author Posted December 12, 2011 Ok, so after a reboot, the drive issue appears to be resolved. Why it took after this reboot and not the others, I have no idea All hail ignorance, isn't it blissfull! Thanks to those who looked over my quandry... it was appreciated.
Recommended Posts
Archived
This topic is now archived and is closed to further replies.