Kernel panic - out of memory - unRAID Server 4.2 [No new topics]

January 9, 200818 yr

I've started to experience some flakiness as of late.

The other day I decided to do a parity check, so I fired it off from the WebPage before I went to bed. When I woke up in the morning, it had red-marked one of my drives. This is the second time that has happened (the first time with a different drive).

So, I stopped everything, disabled the drive in question, rebooted so the array would notice the drive was gone. Then I shut down again, re-enabled the drive and started everything back up. The system identified the drive as "replaced" and asked me to "Rebuild" (rebuild the drive from parity) or "Restore" (add the new drive and build new parity). I told it to "Restore" the parity and let it go. Within about 30 mins or so, my tower went off-line (browser no longer responded) and a "Kernel panic - out of memory" error was on the console. As well the NumLock, CapsLock and Scroll Lock buttons on the keyboard were all flashing in sync.

Hmmmm.. that's odd

So.. I started the machine backup, and stopped the array and ran reiserfsck on all the drives and found nothing. So I tried restoring again (recreate parity), this time I had the syslog writing to another file. Well, the same error happened and I had a 387MB syslog which has some normal stuff at the beginning and then a whole lot of the same error. Here's what I think is the important bits:

Jan  7 18:38:25 Tower in.telnetd[1229]: connect from 192.168.2.2 (192.168.2.2)
Jan  7 18:38:27 Tower login[1230]: ROOT LOGIN  on `pts/0' from `192.168.2.2'
Jan  7 18:39:58 Tower login[1076]: ROOT LOGIN  on `tty1'
Jan  7 19:11:12 Tower kernel: [ 2448.742320] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x1380000 action 0x2 frozen
Jan  7 19:11:12 Tower kernel: [ 2448.742331] ata8.00: cmd 25/00:18:9f:0f:c4/00:03:04:00:00/e0 tag 0 cdb 0x0 data 405504 in
Jan  7 19:11:12 Tower kernel: [ 2448.742333]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  7 19:11:17 Tower kernel: [ 2454.102091] ata8: port is slow to respond, please be patient (Status 0xff)
Jan  7 19:11:22 Tower kernel: [ 2458.773187] ata8: device not ready (errno=-16), forcing hardreset
Jan  7 19:11:22 Tower kernel: [ 2458.773192] ata8: hard resetting port
Jan  7 19:11:27 Tower kernel: [ 2464.312628] ata8: port is slow to respond, please be patient (Status 0xff)
Jan  7 19:11:32 Tower kernel: [ 2468.804067] ata8: COMRESET failed (errno=-16)
Jan  7 19:11:32 Tower kernel: [ 2468.804071] ata8: hard resetting port
Jan  7 19:11:37 Tower kernel: [ 2474.343508] ata8: port is slow to respond, please be patient (Status 0xff)
Jan  7 19:11:42 Tower kernel: [ 2478.834947] ata8: COMRESET failed (errno=-16)
Jan  7 19:11:42 Tower kernel: [ 2478.834951] ata8: hard resetting port
Jan  7 19:11:48 Tower kernel: [ 2484.374389] ata8: port is slow to respond, please be patient (Status 0xff)
Jan  7 19:12:17 Tower kernel: [ 2513.778351] ata8: COMRESET failed (errno=-16)
Jan  7 19:12:17 Tower kernel: [ 2513.778359] ata8: limiting SATA link speed to 1.5 Gbps
Jan  7 19:12:17 Tower kernel: [ 2513.778361] ata8: hard resetting port
Jan  7 19:12:22 Tower kernel: [ 2518.778810] ata8: COMRESET failed (errno=-16)
Jan  7 19:12:22 Tower kernel: [ 2518.778815] ata8: reset failed, giving up
Jan  7 19:12:22 Tower kernel: [ 2518.778817] ata8.00: disabled
Jan  7 19:12:22 Tower kernel: [ 2518.778831] ata8: EH complete
Jan  7 19:12:22 Tower kernel: [ 2518.778956] sd 8:0:0:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00
Jan  7 19:12:22 Tower kernel: [ 2518.778962] end_request: I/O error, dev sdg, sector 79957919
Jan  7 19:12:22 Tower kernel: [ 2518.779042] sd 8:0:0:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00
Jan  7 19:12:22 Tower kernel: [ 2518.779046] end_request: I/O error, dev sdg, sector 79958711
Jan  7 19:12:22 Tower kernel: [ 2518.779086] sd 8:0:0:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00
Jan  7 19:12:22 Tower kernel: [ 2518.779090] end_request: I/O error, dev sdg, sector 79959495
Jan  7 19:12:22 Tower kernel: [ 2518.779125] md5: read error!
Jan  7 19:12:22 Tower kernel: [ 2518.779127] handle_stripe read error: 79957856/5, count: 1
Jan  7 19:12:22 Tower kernel: [ 2518.779132] md5: read error!
Jan  7 19:12:22 Tower kernel: [ 2518.779134] handle_stripe read error: 79957864/5, count: 1
Jan  7 19:12:22 Tower kernel: [ 2518.779137] md5: read error!
Jan  7 19:12:22 Tower kernel: [ 2518.779139] handle_stripe read error: 79957872/5, count: 1
Jan  7 19:12:22 Tower kernel: [ 2518.779142] md5: read error!
Jan  7 19:12:22 Tower kernel: [ 2518.779144] handle_stripe read error: 79957880/5, count: 1
Jan  7 19:12:22 Tower kernel: [ 2518.779147] md5: read error!
Jan  7 19:12:22 Tower kernel: [ 2518.779149] handle_stripe read error: 79957888/5, count: 1
Jan  7 19:12:22 Tower kernel: [ 2518.779152] md5: read error!
Jan  7 19:12:22 Tower kernel: [ 2518.779154] handle_stripe read error: 79957896/5, count: 1
Jan  7 19:12:22 Tower kernel: [ 2518.779157] md5: read error!
Jan  7 19:12:22 Tower kernel: [ 2518.779159] handle_stripe read error: 79957904/5, count: 1
Jan  7 19:12:22 Tower kernel: [ 2518.779162] md5: read error!
Jan  7 19:12:22 Tower kernel: [ 2518.779164] handle_stripe read error: 79957912/5, count: 1
Jan  7 19:12:22 Tower kernel: [ 2518.779167] md5: read error!
Jan  7 19:12:22 Tower kernel: [ 2518.779169] handle_stripe read error: 79957920/5, count: 1
Jan  7 19:12:22 Tower kernel: [ 2518.779172] md5: read error!
Jan  7 19:12:22 Tower kernel: [ 2518.779174] handle_stripe read error: 79957928/5, count: 1
Jan  7 19:12:22 Tower kernel: [ 2518.779176] md5: read error!

Those last parts are the ones that repeat.

So.. I again rebooted, checked the disks again and still everything came up fine.

I decided to build the array with just a parity drive and one of my data drives. That worked, no problem.

So before I went to bed, I rebooted and ran memtest all night. Woke up this morning, with no errors. Looking good.

I restarted the machine and decided to add a couple more drives (one parity, four data) and let it create parity. This one errored out again with the same message (I didn't tail the syslog this time).

I'm at a loss at this point. I'm not sure what I should do from here, so I'm hoping you guys will give me some suggestions. The array is up right now but of course I have no parity so I'm vulnerable.

I'm working on a place to put my syslog... more on that later.

Update: File attachments seem to be working again, so I attached a truncated version of the log.

January 9, 200818 yr

Looks like some piece of h/w is starting to fail. Some suggestions for troubleshooting:

- swap the hard drives around and see if problem follows the hard drive or stays with a specific port.

- if stays with a controller port, try swapping controller cards around (if you have 2)

- if problems persist and have no pattern, probably it's the power supply

As for kernel panic - probably what's happening is that messages are getting generated to the syslog so fast that all available memory is getting allocated and the kernel starts killing processes off and eventually a critical one gets killed. The syslog subsystem is supposed to limit the size of the log file to 1MB, but I think if messages get generated too fast, it can grow beyond that in a short time - I will look into this.

January 9, 200818 yr

Author

Hey Tom,

Thanks so much for the reply !

Swapping around the drives seems like pretty easy thing to try. I actually have two of those 5in3 enclosures now, so swapping from one enclosure to the other will not only move the drives around in the case, but also on the controller cards (2xSATA Fasttracks)

As I said this is the second time it started with me manually starting a check, which resulted in a red-icon'd drive (sdg). But, forcing a replace and rebuild solved it that time. Incidently, I also ran Seatools on the drive in question at that time, and it came up with no problems.

We'll see what happens and I'll let you know.

Also, ya I figure the error message was that... as I said the last syslog I tailed was 385 mb so I'm sure it did just exhaust available memory.

January 9, 200818 yr

Which 5-in-3 cages are you using - there is a wide range of quality out there.

January 9, 200818 yr

Author

I'm using the Startech branded ones. Which seem to be the same as about 5 other manufactures.

I originally talked about them in this thread.

I also had upgraded to a 700w OCZ P/S about half a year back. I'm pretty sure it's a pretty good P/S (dual rail etc.) I don't have the model number with me at work but can supply if you like.

I'm up to 8 drives in total on the server now ( 6 x 500GB Seagates and 2 x 320GB Seagates)

   	Model  		Temp.  	Size  		Free  		Reads  	Writes  Errors
parity 	ST3500630AS 	* 	488,386,552 	- 		0 	74,499 	0
disk1 	ST3500630AS 	*	488,386,552 	23,083,180 	441,874 8 	0
disk2 	ST3500630AS 	*	488,386,552 	59,655,140 	74,123 	8 	0
disk3 	ST3500630AS 	*	488,386,552 	69,237,520 	74,122 	8 	0
disk4 	ST3500630AS 	*	488,386,552 	240,801,496 	74,123 	8 	0
disk5 	ST3500630AS 	*	488,386,552 	17,026,220 	58,883 	34,361 	0
disk6 	ST3320620AS 	*	312,571,192 	3,475,556 	61,237 	8 	0
disk7 	ST3320620AS 	*	312,571,192 	176,048,908 	61,238 	8 	0

disk5 was the first one to show a problem, and this last time it was disk2

Disk0, Disk1 are on mobo controller, disk2-disk5 are on the first controller card and disk6-7 are on the next card.

Updated information:

Power Supply is an OCZ GameXStream - OCZ700GXSSLI

January 10, 200818 yr

Author

OK,

So last night I moved some hard drives around and eliminated the first controller card, which had disk2 and disk5 (these were the two that showed up disabled at different times) attached to it. So, only hard drives on the on-board controller and on controller card#2.

So I recreated the parity and everything worked. There were some "suspect" entries in the log as it was running:

Jan  9 18:59:09 Tower kernel: [ 1968.740478] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x380000 action 0x2
Jan  9 18:59:09 Tower kernel: [ 1968.740484] ata3.00: (port_status 0x20002000)
Jan  9 18:59:09 Tower kernel: [ 1968.740491] ata3.00: cmd 25/00:e8:8f:6c:3d/00:02:05:00:00/e0 tag 0 cdb 0x0 data 380928 in
Jan  9 18:59:09 Tower kernel: [ 1968.740493]          res d0/00:e8:8f:6c:3d/00:02:05:00:00/e0 Emask 0x12 (ATA bus error)
Jan  9 18:59:09 Tower kernel: [ 1969.247518] ata3: soft resetting port
Jan  9 18:59:10 Tower kernel: [ 1969.407230] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jan  9 18:59:10 Tower kernel: [ 1969.549242] ata3.00: configured for UDMA/133
Jan  9 18:59:10 Tower kernel: [ 1969.549250] ata3: EH complete
Jan  9 18:59:10 Tower kernel: [ 1969.601736] sd 3:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)
Jan  9 18:59:10 Tower kernel: [ 1969.603777] sd 3:0:0:0: [sdc] Write Protect is off
Jan  9 18:59:10 Tower kernel: [ 1969.603782] sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Jan  9 18:59:10 Tower kernel: [ 1969.603874] sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Jan  9 19:01:02 Tower kernel: [ 2081.855477] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x180000 action 0x2
Jan  9 19:01:02 Tower kernel: [ 2081.855483] ata3.00: (port_status 0x20280000)
Jan  9 19:01:02 Tower kernel: [ 2081.855490] ata3.00: cmd 25/00:00:e7:9d:92/00:04:05:00:00/e0 tag 0 cdb 0x0 data 524288 in
Jan  9 19:01:02 Tower kernel: [ 2081.855492]          res 51/84:00:e7:9d:92/84:04:05:00:00/e0 Emask 0x12 (ATA bus error)
Jan  9 19:01:03 Tower kernel: [ 2082.182255] ata3: soft resetting port
Jan  9 19:01:03 Tower kernel: [ 2082.341964] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jan  9 19:01:03 Tower kernel: [ 2082.496120] ata3.00: configured for UDMA/133
Jan  9 19:01:03 Tower kernel: [ 2082.496127] ata3: EH complete
Jan  9 19:01:03 Tower kernel: [ 2082.558318] sd 3:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)
Jan  9 19:01:03 Tower kernel: [ 2082.564828] sd 3:0:0:0: [sdc] Write Protect is off
Jan  9 19:01:03 Tower kernel: [ 2082.564832] sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Jan  9 19:01:03 Tower kernel: [ 2082.587306] sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Jan  9 19:18:08 Tower kernel: [ 3106.166813] ata3: limiting SATA link speed to 1.5 Gbps
Jan  9 19:18:08 Tower kernel: [ 3106.166821] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x380000 action 0x6
Jan  9 19:18:08 Tower kernel: [ 3106.166824] ata3.00: (port_status 0x20002000)
Jan  9 19:18:08 Tower kernel: [ 3106.166831] ata3.00: cmd 25/00:00:df:c3:9b/00:04:08:00:00/e0 tag 0 cdb 0x0 data 524288 in
Jan  9 19:18:08 Tower kernel: [ 3106.166833]          res d0/00:00:df:c3:9b/00:04:08:00:00/e0 Emask 0x12 (ATA bus error)
Jan  9 19:18:08 Tower kernel: [ 3106.166846] ata3: hard resetting port
Jan  9 19:18:09 Tower kernel: [ 3106.669492] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan  9 19:18:09 Tower kernel: [ 3106.824343] ata3.00: configured for UDMA/133
Jan  9 19:18:09 Tower kernel: [ 3106.824352] ata3: EH complete
Jan  9 19:18:09 Tower kernel: [ 3106.880807] sd 3:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)
Jan  9 19:18:09 Tower kernel: [ 3106.881005] sd 3:0:0:0: [sdc] Write Protect is off
Jan  9 19:18:09 Tower kernel: [ 3106.881009] sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Jan  9 19:18:09 Tower kernel: [ 3106.881078] sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Jan  9 19:57:46 Tower kernel: [ 5478.838640] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
Jan  9 19:57:46 Tower kernel: [ 5478.838645] ata3.00: (port_status 0x20080000)
Jan  9 19:57:46 Tower kernel: [ 5478.838653] ata3.00: cmd c8/00:78:7f:54:ac/00:00:00:00:00/ef tag 0 cdb 0x0 data 61440 in
Jan  9 19:57:46 Tower kernel: [ 5478.838655]          res 50/00:00:f6:54:ac/00:04:08:00:00/ef Emask 0x2 (HSM violation)
Jan  9 19:57:46 Tower kernel: [ 5479.167261] ata3: soft resetting port
Jan  9 19:57:46 Tower kernel: [ 5479.326970] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan  9 19:57:46 Tower kernel: [ 5479.482640] ata3.00: configured for UDMA/133
Jan  9 19:57:46 Tower kernel: [ 5479.482650] ata3: EH complete
Jan  9 19:57:46 Tower kernel: [ 5479.529742] sd 3:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)
Jan  9 19:57:46 Tower kernel: [ 5479.536995] sd 3:0:0:0: [sdc] Write Protect is off
Jan  9 19:57:46 Tower kernel: [ 5479.537001] sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Jan  9 19:57:46 Tower kernel: [ 5479.537141] sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

I will try another rebuild tonight with the drives only on the onboard controller and on controller#1 and see what I get.

January 10, 200818 yr

Yes, looks like something's wrong with the Promise controller.

January 10, 200818 yr

Author

Yes, looks like something's wrong with the Promise controller.

Which one are you referring to? The one from last nights test (controller #2) or Controller#1 from the original failure?

January 10, 200818 yr

From the syslog you attached, controller "ata3" and "ata4" are reporting errors. These ports correspond to devices:

ata3 -> "sdc" -> ST3500630AS

ata4 -> "sdd" -> ST3500630AS

Both drives are Seagates, but the syslog doesn't show the s/n. You can look at the Devices page to see which disks correspond to 'sdc' and 'sdd'.

The error "ATA Bus Error" is your basic "hardware reported an error". Typical with loose cable, bad power, bad controller, bad drive, etc.

The other error being reported is "HSM violation". There are quite a number of reports of this lately by linux users, for example:

http://www.opensubscriber.com/message/[email protected]/8263422.html

I would guess the "HSM violation" is bogus and being generated as a result of the ATA Bus Error.

These kinds of problems can be very tedious to track down. Pretty much have to use a process of elimination to isolate the component(s) causing the problem.

January 10, 200818 yr

Author

OK... well then I'm totally confused now.

Those bus errors where when all the drives were on Controller#2, and the parity build actually finished. But when I had only the drives connected to Controller#1 I got the same error Panic that is shown in the log file I attached to the first post. Logically speaking I could see if the errors where from Controller#1

The possible power thing worries me, but I thought I got a pretty good power supply, especially since these 5in3 cages only require 2xSATA power connectors or 3xMolex connectors.

I guess I'll have to try powering off all the drives and start adding in one by one until I get some errors again. Man what a PIA.

Things were OK for a long while there and now all of my machines are having issues. My HTPC motherboard died, so I got a new one of those. The the video in it started locking up the machine. Bought a new vid card. One stick of RAM went in my unRAID box. My brand spanking new home workstation wouldn't boot up with the brand spanking new ATI 3850 vid card I bought for it, so I had to RMA that (still waiting since before the holidays for that to get back) and now the unRAID box is having flakeyness.

Time to take up tennis ::)

Anyway, any other recommended course-of-actions are welcome (I'm quite good at following orders), and Tom I thank you for your assistance so far.

I'll report back soon.

January 14, 200818 yr

Author

Well.. I have done some more testing and gathering more data.

I tested with just the two drives that are on my motherboard controller and experienced no problems. I then started adding in drives one at a time and testing. I had no errors show up until I got to a total of six drives.

It should be noted that as I went through adding each drive one by one, I had no power supplied to the drives that weren't in the configuration. As I write this I realize that I have yet to try a configuration of only four drives and up, with a different selection of those drives. I think the tests will be fine, but I should do it just to confirm my theory.

So far, from my testing, I have had:

the ata Exception occur on different drives
the ata Exception occur on different controllers
the ata Exception occur in different slots in each different 5in3 cage
not a problem with the drives connected to the on-board controller
the ata Exception occur only on the 500GB drives

I am attaching all the logs I have saved from the testing in case anyone wishes to look. Please note that I have comments at the top, as well as a layout of which drive is where at that time.

So far, all I can surmise is that there is a problem with both controller cards or a power problem... or something else I can't think of.

Would it be useful to try a power supply on the second 5in3 cage or a different power supply all together?

January 14, 200818 yr

I'd say, try a different PSU if you asked me

January 15, 200818 yr

I doubt that my comments will help any, but they may help you clarify the problem.

You mentioned finding 'ata Exception' errors after various combinations of drives and controllers, but I believe you may have missed the fact that there are different kinds of errors occurring, one kind probably more serious than the other. The last time that 'ATA bus errors' occurred was Jan 9 at 7:18pm, a series of them until the SATA Link speed was dropped from 3.0 to 1.5, at which time the bus errors ceased. You have gotten HSM errors since, but those seem to be a very different animal, and perhaps have been happening all along. The ATA bus errors and the 'frozen' errors are what Tom was referring to above, and are probably related to hardware flakiness somewhere. The HSM errors seem, from my very limited research, to be a software problem, possibly a bus timing issue in a Linux driver. From what I can tell, they are quite recent, probably related to NCQ, and are not understood at all. There is a patch being discussed and tested in the Linux world, but to me it seems like one of those "I pressed the side of my head right here, and the pain in my arm went away" fixes, and others tried it and it seems to work, but it sure does not convey any confidence, that the problem is really fixed or understood.

Until you get better advice, I would monitor but ignore the HSM errors, as soft errors. What you do need to fix are the ATA bus errors, and I have no additional ideas to give you. I too would suspect a cable, or controller, or temperatures, or power, or motherboard. One idea, have you checked your internal temps and airflow? Perhaps a drive or motherboard chipset is getting too hot?

January 15, 200818 yr

Author

I doubt that my comments will help any, but they may help you clarify the problem.

You're right... you didn't help! You only confused me more

OK... thanks for pointing out that I have two separate errors with two levels of severity. One error is (hopefully) to be ignored while the other is not. My confusion at this point is what to look at in the logs to tell these errors apart. Currently (as you've already guessed) as soon as I see anything resembling this code "exception Emask", I assumed that it was a bad thing.

Jan 10 22:32:50 Tower kernel: [ 3780.859964] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
Jan 10 22:32:50 Tower kernel: [ 3780.859969] ata8.00: (port_status 0x20080000)
Jan 10 22:32:50 Tower kernel: [ 3780.859976] ata8.00: cmd c8/00:00:57:a6:01/00:00:00:00:00/ea tag 0 cdb 0x0 data 131072 in
Jan 10 22:32:50 Tower kernel: [ 3780.859978]          res 50/00:00:56:a7:01/00:00:00:00:00/ea Emask 0x2 (HSM violation)

As I type this I think I just found where I need to look. Am I correct that I have to look at this part "(ATA bus error)" in the code below?

Jan  9 19:18:08 Tower kernel: [ 3106.166813] ata3: limiting SATA link speed to 1.5 Gbps
Jan  9 19:18:08 Tower kernel: [ 3106.166821] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x380000 action 0x6
Jan  9 19:18:08 Tower kernel: [ 3106.166824] ata3.00: (port_status 0x20002000)
Jan  9 19:18:08 Tower kernel: [ 3106.166831] ata3.00: cmd 25/00:00:df:c3:9b/00:04:08:00:00/e0 tag 0 cdb 0x0 data 524288 in
Jan  9 19:18:08 Tower kernel: [ 3106.166833]          res d0/00:00:df:c3:9b/00:04:08:00:00/e0 Emask 0x12 (ATA bus error)

If that's true, then the error that has happened the only time I have truly let this run till failure is this one. I guess I should let the tests get to a failure point.

Jan  7 19:11:12 Tower kernel: [ 2448.742320] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x1380000 action 0x2 frozen
Jan  7 19:11:12 Tower kernel: [ 2448.742331] ata8.00: cmd 25/00:18:9f:0f:c4/00:03:04:00:00/e0 tag 0 cdb 0x0 data 405504 in
Jan  7 19:11:12 Tower kernel: [ 2448.742333]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Thanks for re-raising the temperature possibility. I will have to "open the case" for some testing too.

Sheeesh...

Soooo.. the only thing I know for sure now is...

...

January 15, 200818 yr

You're right... you didn't help! You only confused me more

Sorry, I was afraid of that, but it seemed better to give you a clearer picture of the problem than not tell you.

My confusion at this point is what to look at in the logs to tell these errors apart. Currently (as you've already guessed) as soon as I see anything resembling this code "exception Emask", I assumed that it was a bad thing.

The "exception Emask" errors ARE a bad thing, but if you have to have one, the HSM kind seem to be the least harmful, as far as we know. The exception handler seems to recover sufficiently, and you have both the Reiser file system and the unRAID parity protection working for you. So far, I am not aware of anyone losing data, having disk errors, or unRAID crashes from HSM errors, but time will tell.

As I type this I think I just found where I need to look. Am I correct that I have to look at this part "(ATA bus error)" in the code below?

Yes, the rest is status info from the exception handler, trying to be helpful.

January 15, 200818 yr

Author

Hey RobJ,

I just wanted to make sure you don't mistake my attempts at humour and light-heartedness as me not appreciating your input. I very much do appreciate it.

One must keep trying to find the humour in these situations... and boy am I looking!

And NLS, I still have plans to try another p/s, so thanks for your input too!

January 16, 200818 yr

Author

I'm happy to report that I was able to get a full parity build done today without any issues. Everything was in it's original spot in the case and I had the full complement of drives going.

To be clear, I had the HSM error pop up twice in the 7 hours it took to do the build.

I will be keeping an eye on things, but for now I'm still not sure what is wrong/right and why it worked this time through. However, I am thinking it may have been an overheating issue. These cages are in a new case that is alittle tighter than my old Antec tower, and I think it was a little warm in the computer room (by warm I mean 73 or 74 degrees as opposed to the mid 60's that it usually is in winter).

Most of the drives steadied around 45c while building, not sure if that's acceptable or not, but I can see how all that heat could cause the onboard/add-in-cards to get hotter than it should. I'll keep an eye on it.

So, thanks to all for the input, help and patience. I'll post back if things re-occur.

Kernel panic - out of memory

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)