July 2, 201016 yr I woke up this morning and checking my emails it shows that one of the disks is disabled (happens to be my parity drive ) below is the automated report and i have attached the syslog. i don't want to reboot the system yet so what are my options? Subject:unRaid Failure Notification - One or more disks are disabled or invalid. Status update for unRAID NAS - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Status: ERROR: The unRaid array needs attention. One or more disks are disabled or invalid. Disk 0=DISK_DSBL Server Name: NAS Server IP: 192.168.200.10 Date: Fri Jul 2 09:47:01 GMT+7 2010 Output of /proc/mdcmd: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - cmdOper=status cmdResult=ok sbName=/boot/config/super.dat sbVersion=0.95.4 sbCreated=1271227392 sbUpdated=1278054675 sbEvents=68 sbState=0 sbNumDisks=11 sbSynced=1274810422 sbSyncErrs=0 mdVersion=0.95.4 mdState=STARTED mdNumProtected=11 mdNumDisabled=1 mdDisabledDisk=0 mdNumInvalid=1 mdInvalidDisk=0 mdNumMissing=0 mdMissingDisk=0 mdNumNew=0 mdResync=0 diskNumber.0=0 diskName.0= diskSize.0=0 diskState.0=4 diskModel.0=WDC WD20EADS-00S diskSerial.0=WD-WCAVY1446339 diskId.0=WDC_WD20EADS-00S_WD-WCAVY1446339 rdevNumber.0=0 rdevStatus.0=DISK_DSBL rdevName.0=sdl rdevSize.0=1953514552 rdevModel.0=WDC WD20EADS-00S rdevSerial.0=WD-WCAVY1446339 rdevId.0=WDC_WD20EADS-00S_WD-WCAVY1446339 rdevNumErrors.0=1670 rdevLastIO.0=0 rdevSpinupGroup.0=0 diskNumber.1=1 diskName.1=md1 diskSize.1=1953514552 diskState.1=7 diskModel.1=WDC WD20EARS-00S diskSerial.1=WD-WCAVY1867806 diskId.1=WDC_WD20EARS-00S_WD-WCAVY1867806 rdevNumber.1=1 rdevStatus.1=DISK_OK rdevName.1=sdc rdevSize.1=1953514552 rdevModel.1=WDC WD20EARS-00S rdevSerial.1=WD-WCAVY1867806 rdevId.1=WDC_WD20EARS-00S_WD-WCAVY1867806 rdevNumErrors.1=0 rdevLastIO.1=0 rdevSpinupGroup.1=360 diskNumber.2=2 diskName.2=md2 diskSize.2=1953514552 diskState.2=7 diskModel.2=WDC WD20EARS-00S diskSerial.2=WD-WCAVY2078305 diskId.2=WDC_WD20EARS-00S_WD-WCAVY2078305 rdevNumber.2=2 rdevStatus.2=DISK_OK rdevName.2=sdg rdevSize.2=1953514552 rdevModel.2=WDC WD20EARS-00S rdevSerial.2=WD-WCAVY2078305 rdevId.2=WDC_WD20EARS-00S_WD-WCAVY2078305 rdevNumErrors.2=0 rdevLastIO.2=0 rdevSpinupGroup.2=1680 diskNumber.3=3 diskName.3=md3 diskSize.3=1953514552 diskState.3=7 diskModel.3=WDC WD20EADS-00S diskSerial.3=WD-WCAVY1328762 diskId.3=WDC_WD20EADS-00S_WD-WCAVY1328762 rdevNumber.3=3 rdevStatus.3=DISK_OK rdevName.3=sdf rdevSize.3=1953514552 rdevModel.3=WDC WD20EADS-00S rdevSerial.3=WD-WCAVY1328762 rdevId.3=WDC_WD20EADS-00S_WD-WCAVY1328762 rdevNumErrors.3=0 rdevLastIO.3=0 rdevSpinupGroup.3=354 diskNumber.4=4 diskName.4=md4 diskSize.4=1953514552 diskState.4=7 diskModel.4=WDC WD20EARS-00S diskSerial.4=WD-WCAVY2755687 diskId.4=WDC_WD20EARS-00S_WD-WCAVY2755687 rdevNumber.4=4 rdevStatus.4=DISK_OK rdevName.4=sdj rdevSize.4=1953514552 rdevModel.4=WDC WD20EARS-00S rdevSerial.4=WD-WCAVY2755687 rdevId.4=WDC_WD20EARS-00S_WD-WCAVY2755687 rdevNumErrors.4=0 rdevLastIO.4=0 rdevSpinupGroup.4=1668 diskNumber.5=5 diskName.5=md5 diskSize.5=1953514552 diskState.5=7 diskModel.5=WDC WD20EARS-00S diskSerial.5=WD-WCAVY2900727 diskId.5=WDC_WD20EARS-00S_WD-WCAVY2900727 rdevNumber.5=5 rdevStatus.5=DISK_OK rdevName.5=sdd rdevSize.5=1953514552 rdevModel.5=WDC WD20EARS-00S rdevSerial.5=WD-WCAVY2900727 rdevId.5=WDC_WD20EARS-00S_WD-WCAVY2900727 rdevNumErrors.5=0 rdevLastIO.5=0 rdevSpinupGroup.5=330 diskNumber.6=6 diskName.6=md6 diskSize.6=1953514552 diskState.6=7 diskModel.6=WDC WD20EARS-00S diskSerial.6=WD-WCAVY2592832 diskId.6=WDC_WD20EARS-00S_WD-WCAVY2592832 rdevNumber.6=6 rdevStatus.6=DISK_OK rdevName.6=sdb rdevSize.6=1953514552 rdevModel.6=WDC WD20EARS-00S rdevSerial.6=WD-WCAVY2592832 rdevId.6=WDC_WD20EARS-00S_WD-WCAVY2592832 rdevNumErrors.6=0 rdevLastIO.6=0 rdevSpinupGroup.6=298 diskNumber.7=7 diskName.7=md7 diskSize.7=1953514552 diskState.7=7 diskModel.7=WDC WD20EARS-00S diskSerial.7=WD-WCAVY2461835 diskId.7=WDC_WD20EARS-00S_WD-WCAVY2461835 rdevNumber.7=7 rdevStatus.7=DISK_OK rdevName.7=sdi rdevSize.7=1953514552 rdevModel.7=WDC WD20EARS-00S rdevSerial.7=WD-WCAVY2461835 rdevId.7=WDC_WD20EARS-00S_WD-WCAVY2461835 rdevNumErrors.7=0 rdevLastIO.7=0 rdevSpinupGroup.7=1556 diskNumber.8=8 diskName.8=md8 diskSize.8=1953514552 diskState.8=7 diskModel.8=WDC WD20EARS-00S diskSerial.8=WD-WCAVY2662502 diskId.8=WDC_WD20EARS-00S_WD-WCAVY2662502 rdevNumber.8=8 rdevStatus.8=DISK_OK rdevName.8=sde rdevSize.8=1953514552 rdevModel.8=WDC WD20EARS-00S rdevSerial.8=WD-WCAVY2662502 rdevId.8=WDC_WD20EARS-00S_WD-WCAVY2662502 rdevNumErrors.8=0 rdevLastIO.8=0 rdevSpinupGroup.8=106 diskNumber.9=9 diskName.9=md9 diskSize.9=1953514552 diskState.9=7 diskModel.9=WDC WD20EARS-00S diskSerial.9=WD-WCAVY2461819 diskId.9=WDC_WD20EARS-00S_WD-WCAVY2461819 rdevNumber.9=9 rdevStatus.9=DISK_OK rdevName.9=sdh rdevSize.9=1953514552 rdevModel.9=WDC WD20EARS-00S rdevSerial.9=WD-WCAVY2461819 rdevId.9=WDC_WD20EARS-00S_WD-WCAVY2461819 rdevNumErrors.9=0 rdevLastIO.9=0 rdevSpinupGroup.9=1172 diskNumber.10=10 diskName.10=md10 diskSize.10=1953514552 diskState.10=7 diskModel.10=WDC WD20EARS-00S diskSerial.10=WD-WCAVY2461905 diskId.10=WDC_WD20EARS-00S_WD-WCAVY2461905 rdevNumber.10=10 rdevStatus.10=DISK_OK rdevName.10=sdk rdevSize.10=1953514552 rdevModel.10=WDC WD20EARS-00S rdevSerial.10=WD-WCAVY2461905 rdevId.10=WDC_WD20EARS-00S_WD-WCAVY2461905 rdevNumErrors.10=0 rdevLastIO.10=0 rdevSpinupGroup.10=660 Thanks James syslog-2010-07-02.zip
July 2, 201016 yr The log where the drive was taken out of service was rotated out already. Instead of /var/log/syslog, it would be helpful to see /var/log/syslog.1 and potentially /var/log/syslog.2 You can get to either from the normal unRAID web interface by typing in your browser: //tower/log/syslog.1 or //tower/log/syslog.2 The current log is filled with attempts to spin down the failed drive. (And that seems like it might be a bug in unRAID, since if a drive has failed, it is not likely to spin down, and filling the log with attempts to spin it down will eventually use up all the RAM) The actual failure was in a prior log file. Basically, once you captured the syslog, you can: stop the array power down verify the connectors (power and data) to the drive are secure power back up attempt to get a "smartctl" status report for the drive. If you can get it to respond, then it might have just been a loose cable. If you cannot, it is time to purchase a replacement drive. unRAID will not put the drive that failed back into service on its own automatically, since you will need to perform a few steps to let it know how you want the failure handled. First thing to remember, the drive was taken out of service because a "write" to if failed. That means the data written to it was not. To put it back into service you'll need to first un-assign the drive, then start the array with it un-assigned, (that will make it forget the model/serial number of the drive so it will accept it as its own replacement) Then, stop the array once more, re-assign the parity drive, an let it rebuild itself. You could also use the procedure described in the wiki here: http://lime-technology.com/wiki/index.php?title=Make_unRAID_Trust_the_Parity_Drive,_Avoid_Rebuilding_Parity_Unnecessarily to have the drive initially trust that the parity drive is ok (even though we know it must be somewhat out of date, since writes to it failed) and let a full parity check get things back into sync. (expect that there will be parity errors detected and corrected) Joe L.
July 2, 201016 yr Author Joe L. thanks for the info i will follow your instructions and see what happens.. i could not attach the logs because they are too big so i put them on my works server and you can download them from here http://www.dpispecialtyfoods.com/syslogs.zip thanks James
July 2, 201016 yr Looks like the drive just stopped responding to commands. Jul 2 00:10:46 NAS kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Jul 2 00:10:46 NAS kernel: ata11.00: BMDMA stat 0x25 Jul 2 00:10:46 NAS kernel: ata11.00: failed command: READ DMA EXT Jul 2 00:10:46 NAS kernel: ata11.00: cmd 25/00:00:6f:71:95/00:04:2d:00:00/e0 tag 0 dma 524288 in Jul 2 00:10:46 NAS kernel: res 41/04:ef:6f:71:95/04:03:2d:00:00/e0 Emask 0x1 (device error) Jul 2 00:10:46 NAS kernel: ata11.00: status: { DRDY ERR } Jul 2 00:10:46 NAS kernel: ata11.00: error: { ABRT } Jul 2 00:10:46 NAS kernel: ata11.00: configured for UDMA/133 Jul 2 00:10:46 NAS kernel: ata11: EH complete Jul 2 00:10:46 NAS kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Jul 2 00:10:46 NAS kernel: ata11.00: BMDMA stat 0x25 Jul 2 00:10:46 NAS kernel: ata11.00: failed command: READ DMA EXT Jul 2 00:10:46 NAS kernel: ata11.00: cmd 25/00:00:6f:71:95/00:04:2d:00:00/e0 tag 0 dma 524288 in Jul 2 00:10:46 NAS kernel: res 51/04:00:6f:71:95/04:04:2d:00:00/e0 Emask 0x1 (device error) Jul 2 00:10:46 NAS kernel: ata11.00: status: { DRDY ERR } Jul 2 00:10:46 NAS kernel: ata11.00: error: { ABRT } Jul 2 00:10:46 NAS kernel: ata11.00: configured for UDMA/133 Jul 2 00:10:46 NAS kernel: ata11: EH complete Jul 2 00:10:46 NAS kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Jul 2 00:10:46 NAS kernel: ata11.00: BMDMA stat 0x25 Jul 2 00:10:46 NAS kernel: ata11.00: failed command: READ DMA EXT Jul 2 00:10:46 NAS kernel: ata11.00: cmd 25/00:00:6f:71:95/00:04:2d:00:00/e0 tag 0 dma 524288 in Jul 2 00:10:46 NAS kernel: res 51/04:00:6f:71:95/04:04:2d:00:00/e0 Emask 0x1 (device error) Jul 2 00:10:46 NAS kernel: ata11.00: status: { DRDY ERR } Jul 2 00:10:46 NAS kernel: ata11.00: error: { ABRT } The last message in this series is one I'd never seen before. From 2TB to 0TB in one step. Jul 2 00:11:15 NAS kernel: md: disk0 read error Jul 2 00:11:15 NAS kernel: handle_stripe read error: 764769616/0, count: 1 Jul 2 00:11:15 NAS kernel: md: disk0 read error Jul 2 00:11:15 NAS kernel: handle_stripe read error: 764769624/0, count: 1 Jul 2 00:11:15 NAS kernel: sd 3:0:0:0: [sdl] Asking for cache data failed Jul 2 00:11:15 NAS kernel: sd 3:0:0:0: [sdl] Assuming drive cache: write through Jul 2 00:11:15 NAS kernel: md: disk0 read error Jul 2 00:11:15 NAS kernel: handle_stripe read error: 764769632/0, count: 1 Jul 2 00:11:15 NAS kernel: sdl: detected capacity change from 2000398934016 to 0 As I said, you can try re-seating the connectors on its cables, but other than that, purchase a new drive, install it, and then just press "Start" to rebuild the parity drive. Then, RMA the old one and use the one you get back as a replacement for more movies. Oh yes, be careful not to dislodge the cables on the working drives as you work in the server... don't want multiple failed drives. Make sure you power down before moving any cabling. Joe L.
July 2, 201016 yr Author Joe thanks. i did reboot the system and it thinks that there is a new parity drive in the slot, i will pull it and not take any chances. i do have 2 drives that have not had any data written to them yet in the array (they show up as empty), can i just remove one of them from the array and then add it to the parity slot? thanks james
July 4, 201016 yr While it may be empty pulling it will make the server think a second drive has failed and it will want to rebuild parity. I guess since it seems to be the parity drive that has failed you could do this and it would build the array from scratch. Hrm, not sure that's a great path to follow...
July 4, 201016 yr The current log is filled with attempts to spin down the failed drive. (And that seems like it might be a bug in unRAID, since if a drive has failed, it is not likely to spin down, and filling the log with attempts to spin it down will eventually use up all the RAM) They are two separate bugs actually: 1. BUG: Issuing spindown commands to a disk 10 seconds apart. Such commands should be at worst spindownDelay minutes apart. And it doesn't have to be a bad disk. I've seen this effect on a otherwise perfectly good disk which silently ignored spindown commands. 2. BUG: Allowing the possibility of a runaway syslog to completly use up all RAM and crash the server. Such train wrecks can be so easily prevented by mounting /var/log/ into its own limited-size ramdisk. I've actually done this on my custom bzroot build by adding this single line to /etc/fstab none /var/log tmpfs size=128m 0 0 That ramdisk doesn't set apart any RAM. It will only start using up RAM when stuff starts filing it up, up to the set limit. I've been trying to bring the above two issues to Limetech's attention for at least six months already. If more people email Limetech about this, then maybe we can get these bugs a little higher on the to-do list. Email: [email protected]
July 4, 201016 yr i do have 2 drives that have not had any data written to them yet in the array (they show up as empty), can i just remove one of them from the array and then add it to the parity slot? In this case, provided one of those empty disks is as large or larger than any of your other data disks, you can move it to the Parity slot. Then powerup server and confirm array is not Started (it will show the disk you moved as being 'missing'). Now from the console or telnet, type: initconfig (answer Yes to the confirmation). Next, back on the webGui, click Refresh and all drive status will show as 'New' (blue dots). You may now click Start, and unRAID will mount your remaining data disks and start parity-sync on the disk in the Parity slot. IMPORTANT: this case is for handling a failed/disabled Parity disk only, and you want to commission an existing Data disk to replace the failed Parity disk.
July 4, 201016 yr They are two separate bugs actually: 1. BUG: Issuing spindown commands to a disk 10 seconds apart. Such commands should be at worst spindownDelay minutes apart. And it doesn't have to be a bad disk. I've seen this effect on a otherwise perfectly good disk which silently ignored spindown commands. Not a BUG. If your disk does not support spindown, then set spindown delay for that disk to 'never'. 2. BUG: Allowing the possibility of a runaway syslog to completly use up all RAM and crash the server. Such train wrecks can be so easily prevented by mounting /var/log/ into its own limited-size ramdisk. I've actually done this on my custom bzroot build by adding this single line to /etc/fstab none /var/log tmpfs size=128m 0 0 That ramdisk doesn't set apart any RAM. It will only start using up RAM when stuff starts filing it up, up to the set limit. That's a really good idea & I will incorporate in the next release. I've been trying to bring the above two issues to Limetech's attention for at least six months already. If more people email Limetech about this, then maybe we can get these bugs a little higher on the to-do list. To my knowledge, I have never received an email from you. If this is incorrect, please enlighten me.
July 4, 201016 yr I've been trying to bring the above two issues to Limetech's attention for at least six months already. If more people email Limetech about this, then maybe we can get these bugs a little higher on the to-do list. To my knowledge, I have never received an email from you. If this is incorrect, please enlighten me. I can't update the email address in my profile, because I've long ago lost my password for this forum. I am only able to still be active here thanks to the Firefox cookie. Didn't want to bother you about resetting my password or email address. But that's not really important. Since I got your attention here, I'll continue here... They are two separate bugs actually: 1. BUG: Issuing spindown commands to a disk 10 seconds apart. Such commands should be at worst spindownDelay minutes apart. And it doesn't have to be a bad disk. I've seen this effect on a otherwise perfectly good disk which silently ignored spindown commands. Not a BUG. If your disk does not support spindown, then set spindown delay for that disk to 'never'. I only learned that the new disk does not support spindown after it crashed the whole server through the syslog. Give me a moment... I'll gather links to various previous posts, and then I'll try to make a better argument about why it is a bug.
July 4, 201016 yr I can't update the email address in my profile, because I've long ago lost my password for this forum. I am only able to still be active here thanks to the Firefox cookie. Didn't want to bother you about resetting my password or email address. But that's not really important. Since I got your attention here, I'll continue here... Even though your email address is obviously invalid in your profile, my email address is well-known. You complain about problems and suggest people email me, but as far as I can tell you have never emailed me with any issue. Most of the time, email from a customer gets my attention right away (not always perfect with that). Others have complained about what I call "maintenance" releases without a prior beta first. Well almost all the time, these releases are in response to issues from customers brought to my attention via email, and for which most of the time, consisted of multiple email exchanges over several days, along with customer testing via private releases. In the early days I could read and respond to almost every forum post, but that is impossible for me now. I do not see some really critical issues, though there are some here that forward me links to threads saying, "you might want to look at this". To those people, Thank You!! So if have a dire problem, then no good to complain about a non-response on the forum, you need to email me directly: [email protected] I hope this is clear.
July 5, 201016 yr So if have a dire problem, then no good to complain about a non-response on the forum, you need to email me directly No, this BUG is not a "dire problem" for me -- I have my own workaround. I disabled all unRAID's spin up/down functionality from the web management interface. But even with this thing "disabled", I am still getting logs like these... ... Jul 4 05:17:57 v454 kernel: md: disk0: ATA_OP_SETIDLE1 ioctl error: -5 Jul 4 05:17:57 v454 kernel: md: disk2: ATA_OP_SETIDLE1 ioctl error: -5 Jul 4 05:17:57 v454 kernel: md: disk1: ATA_OP_SETIDLE1 ioctl error: -5 ... Shouldn't "disabled" mean "disabled"? It is strange to me how you seem more interested in lecturing me about email, when in my eyes I am doing you a favor. We aren't here as friends, freely exchanging free software. We are here as seller and customer. Fact: Spinning down hard disks every 10 seconds is a bug, no matter what you like to call it. ---- P.S.: I was in the process of gathering links to all past posts related to the spindown BUG, to make things easier for you, when it occured to me that you've already seen them all, as can be inferred from your latest post. ...You complain about problems and suggest people email me...
July 5, 201015 yr So if have a dire problem, then no good to complain about a non-response on the forum, you need to email me directly: [email protected] I hope this is clear. There have been numerous complaints on the forum about non response to email. It might help that once someone becomes a customer, they get a customer ID and then they can post to some help desk ticket system this way the query and response are tied together. Status is viewable and resolutions can be posted into a knowledge base. My friend did this with his web hosting company and it really very well. I believe he used Cererus Helpdesk http://www.cerberusweb.com
July 5, 201015 yr 2. BUG: Allowing the possibility of a runaway syslog to completly use up all RAM and crash the server. Such train wrecks can be so easily prevented by mounting /var/log/ into its own limited-size ramdisk. I've actually done this on my custom bzroot build by adding this single line to /etc/fstab none /var/log tmpfs size=128m 0 0 That ramdisk doesn't set apart any RAM. It will only start using up RAM when stuff starts filing it up, up to the set limit. That's a really good idea & I will incorporate in the next release. I started a new thread here related to this topic. http://lime-technology.com/forum/index.php?topic=6910.0
Archived
This topic is now archived and is closed to further replies.