October 29, 200718 yr OK, This is the contents of the email I sent to Tom on Friday but have not heard back. ====begin email==== Started a couple weeks ago when we had a power outage and the UPS didn't stay up long enough. I had a couple sync errors when the machine started back up, but everything seemed to be fine. The other day I tried to shut the machine off using the Web interface. After clicking the "Stop" button to take the array off line, it just hung there. So eventually I just "gave it the finger" and when it restarted I had sync errors again, which I attributed to the forced reset. Well, I'm at it again. I was gonna reboot to clear the logs and then run a couple consecutively and send you the log. I've tried to stop it again (through the web interface) and it's hung again. I've clicked on the other menu items but it's not responding anymore. I telenet'd in and issued the "poweroff" command. I got a response back about the "The system is going down for system halt NOW!" but that's it. It's stuck there (went down a line, but the prompt hasn't come back yet). I can log in again through the console but the machine is not going down. How can I properly kill the system to restart it? I issued the "df -h" command to see what is still mounted but the only thing that shows up seems to be the USB key. I started this because I have done a couple Parity Checks and I always end up with errors. So I was gonna reboot to clear the logs and then run a couple consecutively and send you the log. ====end email==== Since that time I have forced a restart using the Reset button on the front of the computer. I have also upgraded to the latest just to make sure that I was current with everything. After the upgrade I forced a parity check and ended up with 621 errors. I noticed this entry in the syslog: Oct 27 20:20:01 Tower kernel: [ 9816.546540] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 Oct 27 20:20:01 Tower kernel: [ 9816.546546] ata2.00: (port_status 0x20080000) Oct 27 20:20:01 Tower kernel: [ 9816.546553] ata2.00: cmd 25/00:00:df:00:3e/00:01:17:00:00/e0 tag 0 cdb 0x0 data 131072 in Oct 27 20:20:01 Tower kernel: [ 9816.546555] res 50/00:00:de:01:3e/00:00:17:00:00/e0 Emask 0x2 (HSM violation) Oct 27 20:20:01 Tower kernel: [ 9816.875259] ata2: soft resetting port Oct 27 20:20:02 Tower kernel: [ 9817.034957] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Oct 27 20:20:02 Tower kernel: [ 9817.182001] ata2.00: configured for UDMA/133 Oct 27 20:20:02 Tower kernel: [ 9817.182011] ata2: EH complete Oct 27 20:20:02 Tower kernel: [ 9817.236804] sd 2:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB) Oct 27 20:20:02 Tower kernel: [ 9817.242078] sd 2:0:0:0: [sdc] Write Protect is off Oct 27 20:20:02 Tower kernel: [ 9817.242083] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00 Oct 27 20:20:02 Tower kernel: [ 9817.284054] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA This error seems to have occurred a couple times (I couldn't attach the full syslog for some reason, you can find it here.). So. I'm not sure what to do from here.
October 29, 200718 yr I had never heard of an 'HSM violation' before, so I Google'd it, found out it seems to indicate a possible bug in hard disk firmwares or Linux hard disk drivers, something about NCQ. It seems to be harmless *some* of the time, perhaps not other times. If you can turn NCQ off for sdb and sdc, you *might* be better off, although I don't know if it is really a problem for your setup or not. It does NOT seem to be related to your parity errors, as 2 of the 4 'HSM violations' occurred minutes away from the nearest parity error, only one is coincident. sdc had 1 HSM violation, then sdb had 3. There is a patch (fifo draining patch) being made for this, probably available in a future release of the Linux kernel, which we will receive in a future release of unRAID. I have no ideas as to all of the parity errors.
October 29, 200718 yr Author Hey RobJ Thanks for the answer and the input. You know... I do this kinda thing for a living and why-in-the-hell I didn't do my own Google search I don't know. Chock it up to "shoe-maker's kids" I guess It just seemed interesting out there in the middle of the parity checking errors... who knows. I'll wait a bit for some more ideas. I've been thinking about doing the whole... start with 2 disks... Parity check... add in the next disk... Parity check... etc. until something goes wrong, but honestly... that sounds like work. And what with Guitar Hero 3 just out... and a couple levels of Portal to finish... well you get the picture Thanks again
November 1, 200718 yr Author I'll update this post later with specifics. Basics were identifying which drive was causing the Parity Errors and then working with that drive to determine what was wrong. In my case it was solved simply by running reiserfsk to fix up some problems on the drive. So fas so good... more later. BTW I finished portal and beat GH3 on medium....
Archived
This topic is now archived and is closed to further replies.