hgeorges Posted April 25, 2011 Share Posted April 25, 2011 Hello I'm running Unraid 4.7 and I have not monitored the system for a while. I just found this morning a disk disabled and the following errors: Apr 23 21:19:54 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Apr 23 21:19:54 Tower kernel: ata2.00: failed command: READ DMA EXT Apr 23 21:19:54 Tower kernel: ata2.00: cmd 25/00:00:3f:1d:18/00:04:67:00:00/e0 tag 0 dma 524288 in Apr 23 21:19:54 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Apr 23 21:19:54 Tower kernel: ata2.00: status: { DRDY } Apr 23 21:19:59 Tower kernel: ata2: link is slow to respond, please be patient (ready=0) Apr 23 21:20:04 Tower kernel: ata2: soft resetting link Apr 23 21:20:09 Tower kernel: ata2: link is slow to respond, please be patient (ready=0) Apr 23 21:20:14 Tower kernel: ata2: SRST failed (errno=-16) Apr 23 21:20:14 Tower kernel: ata2: soft resetting link Apr 23 21:20:19 Tower kernel: ata2: link is slow to respond, please be patient (ready=0) Apr 23 21:20:24 Tower kernel: ata2: SRST failed (errno=-16) Apr 23 21:20:24 Tower kernel: ata2: soft resetting link Apr 23 21:20:29 Tower kernel: ata2: link is slow to respond, please be patient (ready=0) Apr 23 21:20:59 Tower kernel: ata2: SRST failed (errno=-16) Apr 23 21:20:59 Tower kernel: ata2: soft resetting link Apr 23 21:21:04 Tower kernel: ata2: SRST failed (errno=-16) Apr 23 21:21:04 Tower kernel: ata2: reset failed, giving up Apr 23 21:21:04 Tower kernel: ata2.00: disabled Apr 23 21:21:04 Tower kernel: ata2.00: device reported invalid CHS sector 0 Apr 23 21:21:04 Tower kernel: ata2: EH complete Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] Unhandled error code Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00 Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] CDB: cdb[0]=0x28: 28 00 67 18 1d 3f 00 04 00 00 Apr 23 21:21:04 Tower kernel: end_request: I/O error, dev sdb, sector 1729633599 Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] Unhandled error code Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00 Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] CDB: cdb[0]=0x28: 28 00 67 18 21 3f 00 03 f8 00 Apr 23 21:21:04 Tower kernel: end_request: I/O error, dev sdb, sector 1729634623 Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] Unhandled error code Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00 Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] CDB: cdb[0]=0x28: 28 00 00 00 42 d7 00 00 40 00 Apr 23 21:21:04 Tower kernel: end_request: I/O error, dev sdb, sector 17111 Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] Unhandled error code Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00 and further down sd 2:0:0:0: [sdb] CDB: cdb[0]=0x28: 28 00 67 19 5f 9f 00 04 00 00 Apr 23 21:21:04 Tower kernel: end_request: I/O error, dev sdb, sector 1729716127 Apr 23 21:21:04 Tower kernel: md: disk6 read error Apr 23 21:21:04 Tower kernel: handle_stripe read error: 1729633536/2, count: 1 Apr 23 21:21:04 Tower kernel: md: disk6 read error Apr 23 21:21:04 Tower kernel: handle_stripe read error: 1729633544/2, count: 1 Apr 23 21:21:04 Tower kernel: md: disk6 read error Apr 23 21:21:04 Tower kernel: handle_stripe read error: 1729633552/2, count: 1 Apr 23 21:21:04 Tower kernel: md: disk6 read error Apr 23 21:21:04 Tower kernel: handle_stripe read error: 1729633560/2, count: 1 Apr 23 21:21:04 Tower kernel: md: disk6 read error etc. I'm attaching the full syslog below.... and I'm sorry for the trouble.. After reading the sections regarding alignment in unraid v 4.7, I'm uncertain (read confused!) what I need to do to hold onto my data... If it is still intact. I'd appreciate very much your guidance... Thank you! hg syslog.1.zip Quote Link to comment
dgaschk Posted April 26, 2011 Share Posted April 26, 2011 Disk6 is disabled. Please post a SMART report for /dev/sdb. Quote Link to comment
hgeorges Posted April 26, 2011 Author Share Posted April 26, 2011 Disk6 is disabled. Please post a SMART report for /dev/sdb. Hello; Thanks so much for taking the time... I have gone through some radical activities, pulling everything out of the case, cleaning the thick dust (computers are like vacuum cleaners!...) and renstalled everything in a larger and better ventilated case. I just finished reinstalling everything, and rebooting, and the problem persists (disk 6 disabled, due to, i guess, MBR misalligned). Clicking on the disk link, I'mredirected to another screen, which has on it just this: disk6 settings Partition format: MBR: unaligned File sytem type: reiserfs Spin down delay: Use default Spinup group(s): host1 Checking the syslog more carefully, I couldn't dind obvious errors related to either the hard drive, or the controller. I found a few other (new, I guess) messages, which I cannot understand where they come from and what do I need to do about them. I've added to the message the new syslog, as well as a smart report for each of the disks, including sdb. I've extracetd from syslog those messages, and posted them below, for visibility. Apr 26 17:41:09 Tower kernel: PCI: Using MMCONFIG for extended config space Apr 26 17:41:09 Tower kernel: ACPI Warning: Incorrect checksum in table [OEMB] - 0D, should be 08 (20090903/tbutils-314) Apr 26 17:41:09 Tower kernel: ACPI: No dock devices found. Apr 26 17:41:09 Tower kernel: pci 0000:00:10.0: supports D1 D2 Apr 26 17:41:09 Tower kernel: pci 0000:00:10.0: PME# supported from D0 D1 D2 D3hot D3cold Apr 26 17:41:09 Tower kernel: pci 0000:00:10.0: PME# disabled Apr 26 17:41:09 Tower kernel: pci 0000:00:10.1: reg 20 io port: [0x9800-0x981f] Apr 26 17:41:09 Tower kernel: pci 0000:00:10.1: supports D1 D2 Apr 26 17:41:09 Tower kernel: pci 0000:00:10.1: PME# supported from D0 D1 D2 D3hot D3cold Apr 26 17:41:09 Tower kernel: pci 0000:00:10.1: PME# disabled Apr 26 17:41:09 Tower kernel: pci 0000:00:10.2: reg 20 io port: [0x9880-0x989f] Apr 26 17:41:09 Tower kernel: pci 0000:00:10.2: supports D1 D2 Apr 26 17:41:09 Tower kernel: pci 0000:00:10.2: PME# supported from D0 D1 D2 D3hot D3cold Apr 26 17:41:09 Tower kernel: pci 0000:00:10.2: PME# disabled Apr 26 17:41:09 Tower kernel: pci 0000:00:10.3: reg 20 io port: [0x9c00-0x9c1f] 26 17:41:09 Tower kernel: system 00:05: iomem range 0xfecc0000-0xfecc0fff could not be reserved Apr 26 17:41:09 Tower kernel: system 00:07: iomem range 0xfec00000-0xfec00fff could not be reserved Apr 26 17:41:09 Tower kernel: system 00:07: iomem range 0xfee00000-0xfee00fff has been reserved Apr 26 17:41:09 Tower kernel: system 00:0a: iomem range 0xe0000000-0xefffffff has been reserved Apr 26 17:41:09 Tower kernel: system 00:0b: iomem range 0x0-0x9ffff could not be reserved Apr 26 17:41:09 Tower kernel: system 00:0b: iomem range 0xc0000-0xcffff could not be reserved Apr 26 17:41:09 Tower kernel: system 00:0b: iomem range 0xe0000-0xfffff could not be reserved Apr 26 17:41:09 Tower kernel: system 00:0b: iomem range 0x100000-0x7fffffff could not be reserved Apr 26 17:41:09 Tower kernel: system 00:0b: iomem range 0xfeb00000-0xffffffff could not be reserved Apr 26 17:41:09 Tower kernel: pci 0000:03:04.0: BAR 6: address space collision on of device [0xfea00000-0xfea7ffff] Apr 26 17:41:09 Tower kernel: pci 0000:03:05.0: BAR 6: address space collision on of device [0xfe980000-0xfe9fffff] Apr 26 17:41:09 Tower kernel: pci 0000:03:06.0: BAR 6: address space collision on of device [0xfeac0000-0xfeadffff] Apr 26 17:41:09 Tower kernel: pci 0000:03:07.0: BAR 6: address space collision on of device [0xfeae0000-0xfeae3fff] Apr 26 17:41:09 Tower kernel: pci 0000:00:01.0: PCI bridge, secondary bus 0000:01 Apr 26 17:41:09 Tower kernel: pci 0000:00:01.0: IO window: 0xb000-0xbfff Apr 26 17:41:09 Tower kernel: pci 0000:00:01.0: MEM window: 0xfe800000-0xfe8fffff Apr 26 17:41:09 Tower kernel: pci 0000:00:01.0: PREFETCH window: 0xf4000000-0xf7ffffff Apr 26 17:41:09 Tower kernel: pci 0000:00:02.0: PCI bridge, secondary bus 0000:02 Apr 26 17:41:09 Tower kernel: pci 0000:00:02.0: IO window: 0x1000-0x1fff Any further suggestions for troubleshooting and resolution are greatly appreciated! Thankyou logs.zip Quote Link to comment
dgaschk Posted April 27, 2011 Share Posted April 27, 2011 The disk is disabled because a write to it failed. MBR: unaligned has nothing to do with a disk being disabled and is a correct setting for non-Advanced Format drives. The other messages you posted are informational startup messages. ata2=disk6=sdb is disabled. In order to determine the status of this disk we require a SMART report. Please post a SMART report for this disk. Quote Link to comment
hgeorges Posted April 27, 2011 Author Share Posted April 27, 2011 The disk is disabled because a write to it failed. MBR: unaligned has nothing to do with a disk being disabled and is a correct setting for non-Advanced Format drives. The other messages you posted are informational startup messages. ata2=disk6=sdb is disabled. In order to determine the status of this disk we require a SMART report. Please post a SMART report for this disk. OK. Thanks for the clarification. Sorry for not posting the SMART reports earlier. I picked up the wrong file from my disk. I've just removed that, and attached to the previous message, a new zip file, containing both the smart reports, and the last syslog. I hope they will point to the problem. Also... I just realized that I actually can access the disk, even if it is marked RED, and disabled (in unmenu screen). I can read the information on disk6, open directories and files, as if nothing happened. I'm confused! Quote Link to comment
lionelhutz Posted April 27, 2011 Share Posted April 27, 2011 Also... I just realized that I actually can access the disk, even if it is marked RED, and disabled (in unmenu screen). I can read the information on disk6, open directories and files, as if nothing happened. I'm confused! That's what the parity does. Try stopping the array and unassigning the disk from the slot. Then, start the array and stop it again. Then, assign the disk again and start the array again allowing the disk to rebuild. If it was a connection problem then the disk will be rebuilt, otherwise the rebuild will fail. Peter Quote Link to comment
RobJ Posted April 27, 2011 Share Posted April 27, 2011 The SMART report for Disk 6 looks very good, and the syslog looks great, with Disk 6 found and mounted like normal, no issues at all. In fact it looks so good, I cannot tell from the syslog that it is marked Disabled! ( I hope that in the future Tom will allow us to see a Disabled status in the syslog ) Your array was running fine until it suddenly lost contact with Disk 6 at 21:19:54. It tried and tried but could not get any response from the drive, and in less than 90 seconds, had disabled the drive. ALL subsequent errors can be ignored because you obviously cannot read or write to a drive that the operating system no longer considers present. In my experience, a drive that suddenly stops responding, with no other errors, is usually completely fine, unless in the somewhat rare case that it suffered a catastrophic failure. The attempt to retrieve a SMART report will then clearly indicate whether the drive is fine or not. Because you have a clean SMART report, the drive is not at fault, and therefore the problem must have been an anomaly with the disk controller (VIA-based), or a cable came loose. Powering off, then checking the cables, then powering back on usually clears the issue. If it happens again, you may have motherboard/controller issues. Since the drive is probably fine, you can use the Trust My Array procedure to more quickly restore the array - or let it rebuild per Peter's instructions. Quote Link to comment
lionelhutz Posted April 27, 2011 Share Posted April 27, 2011 I would highly not recommend doing the trust my parity procedure. There could have been writes to the "virtual" drive while the real drive was missing. The rebuild will restore the real drive to match the virtual drive. Peter Quote Link to comment
hgeorges Posted April 27, 2011 Author Share Posted April 27, 2011 Folks, Thanks much for your help. I went with rebuilding the drive, just because I read Peter's advice first. In retrospect, I should have plugged in a new, identical, drive, and kept the original drive untouched, just in case rebuilding the data might have got into problems. Knock in wood - the rebuilding is still several hours away from completion. While that is working away, I have two more questions, if you don't mind...: 1.- would you interpret the previously posted boot-up messages where memory cannot be reserved, and BAR6 - address space collision as unresolved conditions (by the kernel)? What about ACPI checksum error? Should PME be enabled in the BIOS? Do these require my intervention in any way? 2.- since I don't have a backup for my unraid stored data, and rely exclusively on the parity drive to keep it intact, has anyone experience with mirroring the parity drive at the MB BIOS level? Would unraid recognize it as one drive only, and just use it as such? Is the parity drive mirroring offering any real (additional) protection? Thanks again very much for your help! Quote Link to comment
Joe L. Posted April 27, 2011 Share Posted April 27, 2011 Since the drive is probably fine, you can use the Trust My Array procedure to more quickly restore the array - or let it rebuild per Peter's instructions. Make an informed decision... you can choose. You can get back the data you were writing to the failed disk by NOT using the "trust-my-array" Procedure, and re-construct the failed drive as described earlier in this thread, or You can get back immediate parity protection by forcing the array to trust/think parity is correct, but know it will have errors, because a write to the physical drive failed, so it may not accurately be usable in recovery situations until after a full parity check is performed. Quote Link to comment
Joe L. Posted April 27, 2011 Share Posted April 27, 2011 since I don't have a backup for my unraid stored dataa RAID array is NOT a replacement for a backup of your precious data. All it would take is one lightning strike, flood, fire, etc to cause you to lose all your files. Please invest in a backup drive, stored off-site for really important files. has anyone experience with mirroring the parity drive at the MB BIOS level? Would unraid recognize it as one drive only, and just use it as such? This has been done by at least one member, but it was done for speed, not for the additional protection. The user had multiple disks being written at the same time. It will usually eliminate the capability to spin down the drive or read its temperature.Is the parity drive mirroring offering any real (additional) protection? The parity disk is no more important than any other disk in your server when re-constructing a failed disk. All are equally important. Quote Link to comment
lionelhutz Posted April 27, 2011 Share Posted April 27, 2011 Is the parity drive mirroring offering any real (additional) protection? Nope, it offers almost nothing extra at all. The only case it would help was if the parity and a data drive failed at the same time. You'd be better off mirring the data drives. Well, you'd actually really be better off storing any critical data to an off-site backup. Peter Quote Link to comment
hgeorges Posted April 27, 2011 Author Share Posted April 27, 2011 Thanks for the backup advice. I know that I have to bite the bullet, and such events show it to be important. I'm thinking to replicate the data, over to a friend's server (setup something reciprocal) - just didn't find the time to do it (thinking about using rsync with ssh to send periodic updates, once we exchange a copy of our data) What is the Pro community on this forum using for backup of larger volumes of data? I've been putting this off for quite some time, for cost reasons, but also thinking that I will have to manage those backups (periodic verification, refresh, purging etc) Coming back to the server - I guess the lack of comments to those boot-up messages didn't trigger any thoughts? Since I'm at it now, I'm trying to resolve everything I can in this swoop. I'm going also to attach - hopefully it will work - the screen with SMART history... There are a couple of things I kept ignoring, and I'm wondering what is your opinion... - It didn't... the file is 340kb and it didn't load - I hope you can pick it up from here: http://ifile.it/usoaqc5/Downloads.zip - found this free online storage site w/o a registration requirement. I hope it will work... Thanks again very much for your time! Quote Link to comment
prostuff1 Posted April 27, 2011 Share Posted April 27, 2011 What is the Pro community on this forum using for backup of larger volumes of data? I've been putting this off for quite some time, for cost reasons, but also thinking that I will have to manage those backups (periodic verification, refresh, purging etc) I have Crashplan running on my server and it works to back up most of my stuff. I do not keep backups of my DVD/TV Shows as I can, if I have to, rip them all to the server again. For the more important stuff (photos, documents, etc.) I use crashplan to backup to the cloud. Quote Link to comment
hgeorges Posted April 27, 2011 Author Share Posted April 27, 2011 Hello to All and thanks for all the assistance. My server is back to normal: Array Status STARTED, 10 disks in array. Parity is Valid:. Last parity check < 1 day ago . Parity updated 1 times to address sync errors. I could not find where from the parity error came from, but I hope that having now all the drives statuses green, I'm back to normal. If anyone has left some energy before I close this thread, I'd appreciate some feedback/advice to my previous posting (boot-up messages, and HDD smart history warnings) I've upgraded unmenu in between, and looks awsome. Thanks for this great add-on, makes life easier to look at system details. I'll check closer Crashplan for backup - at the first sight looked interesting, with all the data deduplication and everything. hg Quote Link to comment
RobJ Posted April 28, 2011 Share Posted April 28, 2011 1.- would you interpret the previously posted boot-up messages where memory cannot be reserved, and BAR6 - address space collision as unresolved conditions (by the kernel)? What about ACPI checksum error? Should PME be enabled in the BIOS? Do these require my intervention in any way? Just ignore them, these are completely normal, and you can find variations of almost all of them in all other syslogs. It is just the Linux OS adapting itself to your machine, determining what can be reserved and where, etc. As to the SMART display you linked, it is in a form I am not familiar with, but enough info is there for me to say that there is nothing of concern visible. Most of the warnings are about very high hours, but you probably knew that. Many of the drives are quite old, with a lot of hours on them. You can perhaps adjust the warning and error thresholds for them, to quash those warnings. The only other ones I see are 1 Reallocated_sector count, 1 UDMA_CRC count, and 1 ATA_Error count, and none of them have changed over the entire time of monitoring, so not a current problem. Just a side note, I'm not sure it is useful to graph Power_On_Hours. Isn't it just tracking how time passes - over time? I suppose if one was shutting the machine off more one month than another month, it could be useful, but only from a server usage standpoint, hardly from a SMART standpoint. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.