[SOLVED] Intermittent / inconsistent parity errors

nick5429 · May 10, 2012

I seem to have developed intermittent parity errors (ie, iterative non-correcting parity checks don't show all the same errors)

I had an unclean shutdown a while back and it's possible I didn't let the parity recalc finish. That was stupid, but is somewhat separate from my issue.

/var/log/syslog.1:May 1 00:00:01 nickserver kernel: md: recovery thread checking parity...
/var/log/syslog.1:May 1 02:26:37 nickserver kernel: md: parity incorrect: 1542896200 <--- this one appears "real"

/var/log/syslog.1:May 1 02:40:30 nickserver kernel: md: parity incorrect: 1669823784 the others are all phantom

/var/log/syslog:May 9 10:38:38 nickserver kernel: md: recovery thread checking parity...
/var/log/syslog:May 9 14:15:23 nickserver kernel: md: parity incorrect: 1542896200

/var/log/syslog:May 9 16:16:46 nickserver kernel: md: parity incorrect: 2290922496

/var/log/syslog:May 9 17:05:06 nickserver kernel: md: parity incorrect: 2558346984

/var/log/syslog:May 9 17:36:25 nickserver kernel: md: parity incorrect: 2676010000

/var/log/syslog:May 9 17:49:42 nickserver kernel: md: parity incorrect: 2740517552

/var/log/syslog:May 9 22:19:38 nickserver kernel: md: recovery thread checking parity...

/var/log/syslog:May 9 22:44:34 nickserver kernel: md: parity incorrect: 281891928

/var/log/syslog:May 9 22:49:24 nickserver kernel: md: parity incorrect: 333552304

/var/log/syslog:May 10 00:08:53 nickserver kernel: md: parity incorrect: 1158023712

/var/log/syslog:May 10 00:49:43 nickserver kernel: md: parity incorrect: 1542896200

/var/log/syslog:May 10 00:50:10 nickserver kernel: md: parity incorrect: 1546624504

/var/log/syslog:May 10 00:50:49 nickserver kernel: md: parity incorrect: 1552673536

/var/log/syslog:May 10 02:31:30 nickserver kernel: md: parity incorrect: 2374474944

/var/log/syslog:May 10 08:17:10 nickserver kernel: md: recovery thread checking parity...
/var/log/syslog:May 10 08:40:56 nickserver kernel: md: parity incorrect: 277107080

/var/log/syslog:May 10 10:43:11 nickserver kernel: md: parity incorrect: 1542896200

/var/log/syslog:May 10 10:57:54 nickserver kernel: md: parity incorrect: 1676424632

/var/log/syslog:May 10 12:12:00 nickserver kernel: md: parity incorrect: 2286487984

Configuration:

unRAID 4.7 on a full Slackware (13.1?) installation

All drives are SATA and connected directly to motherboard headers.

I have 3 1.5TB drives and 2 2TB drives:

Status	Disk	Mounted	Device	Model/Serial	Temp	Reads	Writes	Errors	Size	Used	%Used	Free
OK	parity		/dev/sde	9VT1_5YD517KW	31°C	38725618	2774174					
OK	/dev/md1	/mnt/disk1	/dev/sda	SAMSUNG_HD154UI_S1Y6J1KS744713	*	33225543	415589		1.50T	1.50T	100%	1.38M
OK	/dev/md2	/mnt/disk2	/dev/sdc	00Z_WD-WMAVU3394155	*	24331755	242123		1.50T	1.47T	99%	26.74G
OK	/dev/md3	/mnt/disk3	/dev/sdb	SAMSUNG_HD154UI_S1Y6J1KS744712	*	30915273	381624		1.50T	915.74G	62%	584.52G
OK	/dev/md4	/mnt/disk4	/dev/sdd	00P_WD-WCAZAD107336	31°C	46038453	1765660		2.00T	488.22G	25%	1.51T
	 	 	 	 	 	 	 	Total:	6.50T	4.38T	67%	2.12T

Apparently my system configuration has some sort of log rotation turned on, so my syslog [attached] doesn't show my last boot (~2 months ago).

May 7 05:37:00 nickserver kernel: ------------[ cut here ]------------
May 7 05:37:00 nickserver kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0xf5/0x175()

May 7 05:37:00 nickserver kernel: Hardware name: MS-7576

May 7 05:37:00 nickserver kernel: Modules linked in: md_mod xor dm_mod fglrx(P) [last unloaded: md_mod]

May 7 05:37:00 nickserver kernel: Pid: 0, comm: swapper Tainted: P 2.6.32.9-unRAID #9

May 7 05:37:00 nickserver kernel: Call Trace:

May 7 05:37:00 nickserver kernel: [<c102523f>] warn_slowpath_common+0x65/0x7c

May 7 05:37:00 nickserver kernel: [<c12f838c>] ? dev_watchdog+0xf5/0x175

May 7 05:37:00 nickserver kernel: [<c102528a>] warn_slowpath_fmt+0x24/0x27

May 7 05:37:00 nickserver kernel: [<c12f838c>] dev_watchdog+0xf5/0x175

May 7 05:37:00 nickserver kernel: [<c1031c1b>] ? insert_work+0x41/0x49

May 7 05:37:00 nickserver kernel: [<c1031f4e>] ? __queue_work+0x2a/0x2f

May 7 05:37:00 nickserver kernel: [<c102c983>] run_timer_softirq+0x112/0x166

May 7 05:37:00 nickserver kernel: [<c12f8297>] ? dev_watchdog+0x0/0x175

May 7 05:37:00 nickserver kernel: [<c1029269>] __do_softirq+0x79/0xee

May 7 05:37:00 nickserver kernel: [<c1029304>] do_softirq+0x26/0x2b

May 7 05:37:00 nickserver kernel: [<c10293e3>] irq_exit+0x29/0x2b

May 7 05:37:00 nickserver kernel: [<c10126cf>] smp_apic_timer_interrupt+0x6f/0x7d

May 7 05:37:00 nickserver kernel: [<c10031f6>] apic_timer_interrupt+0x2a/0x30

May 7 05:37:00 nickserver kernel: [<c100843f>] ? default_idle+0x2d/0x42

May 7 05:37:00 nickserver kernel: [<c100868c>] c1e_idle+0xcd/0xd2

May 7 05:37:00 nickserver kernel: [<c1001b66>] cpu_idle+0x3a/0x50

May 7 05:37:00 nickserver kernel: [<c1346fe7>] rest_init+0x53/0x55

May 7 05:37:00 nickserver kernel: [<c14fb79d>] start_kernel+0x27b/0x280

May 7 05:37:00 nickserver kernel: [<c14fb097>] i386_start_kernel+0x97/0x9e

May 7 05:37:00 nickserver kernel: ---[ end trace 6f5f19d34dc73db0 ]---

Since a large portion of the identified "bad" blocks are >1500000000, my inclination is to think the issue lies with one of the 2tb drives (or hopefully the sata cables attaching to them). Smart reports attached.

Aside: has anyone figured out a good way to determine which file a block maps to with reiserFS yet? ext2/3/4 has the 'debugfs' tool that can do it...

Any thoughts other than 'replace the sata cables on the 2 2tb drives and try another non-correcting check'?

smart.txt

syslog.txt

nick5429 · July 23, 2012

Interesting.

So I replaced all the SATA cables to no avail.

I modified the script under 'how to troubleshoot recurring parity errors' from here: http://lime-technology.com/wiki/index.php/FAQ#Hard_Drives as follows, to allow greater flexibility / better logging.

#!/bin/bash
LOG_DIR=/root/hashes
DEVICE=sda
#COUNT=10000000 #5GB
COUNT=2000000 #1GB
SKIP=0    #start
  MAX=2000000000   #end block -- make sure the drive is at least this big
#MAX=1017926000   #start
#COUNT=10000 #5MB
TIMES=9      #for each stride, repeat this many times

cd $LOG_DIR

if [ $# -ne 1 ]
then
        echo "Need 1 param, got $#"
        exit
else
        DEVICE=$1
        echo "Running hashes for device=$DEVICE skip=$SKIP count=$COUNT"
fi

INITIALRESULT=""
RESULT=""
while [ $SKIP -lt $MAX ]; do
        echo "Begin $DEVICE at block $SKIP size $COUNT."
        INITIALRESULT=`dd if=/dev/$DEVICE skip=$SKIP count=$COUNT  | md5sum -b | awk '{print $1}'`
        echo "Block $SKIP: $INITIALRESULT     initial" >> $DEVICE.log
        for i in `seq 1 $TIMES`
          do
            RESULT=`dd if=/dev/$DEVICE skip=$SKIP count=$COUNT | md5sum -b | awk '{print $1}'`
            echo "Block $SKIP: $RESULT"  >> $DEVICE.log
            if [ "$RESULT" != "$INITIALRESULT" ];
            then
                    echo "!!!!ERRORERRORERROR Block $SKIP md5 $RESULT did not match expected $INITIALRESULT"
                    echo "!!!!ERRORERRORERROR Block $SKIP md5 $RESULT did not match expected $INITIALRESULT" >> $DEVICE.log
            fi
          done
        let SKIP=$SKIP+$COUNT
done
exit

The script uses "dd" to read the raw contents of the same section of the disk 10 times and computes the md5 sum of the data it read and compares it to the initial read. An md5 sum will be the same for the same input data.

I ran this over the first 1.5TB of each of my disks in parallel 2-3 times, resulting in ~45TB of reads from each disk (plus a few subset runs). In all that, I found two data miscompares. Even reading these same block addresses with a higher count, I've not been able to reproduce this through anything but "luck".

Both miscompares happened on sda, but I can't reliably determine anything from a sample size of 2.

Block 430000000: 3dc262a182f0031f0c92f6e3cf4811b6 initial
Block 430000000: 3dc262a182f0031f0c92f6e3cf4811b6

Block 430000000: 3dc262a182f0031f0c92f6e3cf4811b6

Block 430000000: 3dc262a182f0031f0c92f6e3cf4811b6

Block 430000000: 3d74be3d80e1f9710837e7149f299ef3

!!!!ERRORERRORERROR Block 430000000 md5 3d74be3d80e1f9710837e7149f299ef3 did not match expected 3dc262a182f0031f0c92f6e3cf4811b6

Block 430000000: 3dc262a182f0031f0c92f6e3cf4811b6

Block 436000000: 0e4597dda7af49bb13b9c647503ca856 initial
Block 436000000: 0e4597dda7af49bb13b9c647503ca856

Block 436000000: 0e4597dda7af49bb13b9c647503ca856

Block 436000000: 0e4597dda7af49bb13b9c647503ca856

Block 436000000: 0e4597dda7af49bb13b9c647503ca856

Block 436000000: 0e4597dda7af49bb13b9c647503ca856

Block 436000000: 0e4597dda7af49bb13b9c647503ca856

Block 436000000: 271a8b560f4911a74aec07a135c399b5

!!!!ERRORERRORERROR Block 436000000 md5 271a8b560f4911a74aec07a135c399b5 did not match expected 0e4597dda7af49bb13b9c647503ca856

Block 436000000: 0e4597dda7af49bb13b9c647503ca856

Block 436000000: 0e4597dda7af49bb13b9c647503ca856

In both cases, you see that it returns to reading the 'correct' data after having read the error, so it's not like this was caused by a write to the disk in the middle of the process.

Aside: I'm not entirely confident that the "block number" reported by the unraid parity checker correlates with the skip/count parameters to dd.

Whatever read pattern the unraid parity checker does seems to hit this more consistently; I've done a number of noncorrecting parity checks in the course of testing this, and typically end up with 1-4 parity mismatches. In contrast, a full run of my script runs 10x as long, but only found 2 errors in 3 full runs.

The errors above are both in the 430000000 area, though I've noticed no such pattern in the parity checks. Note, not all of these ran to completion.

Jul 5 12:42:38 nickserver kernel: mdcmd (29): check NOCORRECT

Jul 5 12:42:38 nickserver kernel: md: recovery thread woken up ...

Jul 5 12:42:38 nickserver kernel: md: recovery thread checking parity...

Jul 5 12:42:38 nickserver kernel: md: using 1152k window, over a total of 1953514552 blocks.

Jul 5 14:31:51 nickserver kernel: md: parity incorrect: 1180695504

Jul 5 15:09:36 nickserver kernel: md: parity incorrect: 1542896200

Jul 5 15:16:50 nickserver kernel: md: parity incorrect: 1608572368

Jul 16 11:17:30 nickserver kernel: mdcmd (25): check NOCORRECT

Jul 16 11:17:30 nickserver kernel: md: recovery thread woken up ...

Jul 16 11:17:30 nickserver kernel: md: recovery thread checking parity...

Jul 16 11:17:30 nickserver kernel: md: using 1152k window, over a total of 1953514552 blocks.

Jul 16 11:36:18 nickserver kernel: md: parity incorrect: 218250952

Jul 16 11:56:39 nickserver kernel: md: parity incorrect: 420616808

Jul 16 16:50:29 nickserver kernel: mdcmd (30): check NOCORRECT

Jul 16 16:50:29 nickserver kernel: md: recovery thread woken up ...

Jul 16 16:50:29 nickserver kernel: md: recovery thread checking parity...

Jul 16 16:50:29 nickserver kernel: md: using 1152k window, over a total of 1953514552 blocks.

Jul 16 18:24:34 nickserver kernel: md: parity incorrect: 1017876024

Jul 18 10:58:00 nickserver kernel: mdcmd (29): check NOCORRECT

Jul 18 10:58:00 nickserver kernel: md: recovery thread woken up ...

Jul 18 10:58:00 nickserver kernel: md: recovery thread checking parity...

Jul 18 10:58:00 nickserver kernel: md: using 1152k window, over a total of 1953514552 blocks.

Jul 18 11:34:35 nickserver kernel: md: parity incorrect: 421167040

Jul 18 13:23:10 nickserver kernel: md: parity incorrect: 1535345568

Jul 18 13:38:38 nickserver kernel: md: parity incorrect: 1676847576

Jul 18 16:07:22 nickserver kernel: md: parity incorrect: 2740517552

All drives have completed an extended offline smart self-test without error.

Currently attempting to reproduce the issue again.

Any other thoughts? Ideas?

Joe L. · July 23, 2012

Assuming you've ruled out a memory error, by running memtest on it for several cycles, and have ruled out the disk controller by trying another port, then odds are it is the disk itself and you might test the drive's suitability as a wheel-chock.

That wheel-chock test involves placing the disk behind the wheel of your car and repeatedly driving over it, testing its ability to stop the car. If, after several dozen attempts to get it to stop the car, it still returns an rare but occasional MD5 error, get a bigger car.

It will cause you far less hair-loss after going through those tests. (and, odds are, parity will consistently be failing once you are through flattening it)

Joe L.

(Oh yeah... it might not qualify for an RMA afterwords... but you'll feel better regardless )

nick5429 · July 23, 2012

Joe,

Haha, thanks for the tips!

I ran a brief memtest at the beginning of this, but you're right -- I do need to run a longer memtest check just to be sure.

I'd been hoping to be able to reproduce the error more reliably before switching to and testing another controller lest I not prove anything to myself when I don't get an error. Though now that I'm typing this out, I realize that I do have a semi-reliable way to reproduce it: unRAID's built in parity check. That's what I get for trying to outsmart myself with all this fancy testing!

Will report in a day or three after I've had a chance to run things through several cycles.

bcbgboy13 · July 24, 2012

Nick,

1. There is a reason why the commercial servers use at least ECC memory and are on UPS.

2. URE (unrecoverable read errors) rating on the consumer level disks.

I suspect every Unraid user transferring huge amount of data 24/7 like in your tests (this is not the usual way Unraid is used) will be bitten by the statistics regarding these two factors above.

I cannot prove anything but is is something for you to keep in mind since you are looking for ideas.

PS. Forgot to mention but you have some Samsung hard drives - some older one had a defective firmware and you should check if yours are the one affected - you can search even here as it was discussed at that time.

nick5429 · July 25, 2012

While these are valid points which may potentially explain a problem like what I'm seeing...

1. There is a reason why the commercial servers use at least ECC memory and are on UPS.

If the RAM were bad, it would show up in a memtest. Any memory error, ever, is indicative of a module that should be thrown out / RMA'd.

2. URE (unrecoverable read errors) rating on the consumer level disks.

This would show up in the syslog / smart stats and again, indi

I suspect every Unraid user transferring huge amount of data 24/7 like in your tests (this is not the usual way Unraid is used) will be bitten by the statistics regarding these two factors above.

I highly doubt it, and the test I describe is quite similar to the stress from a standard parity check. If this were true, all users of unRAID would be seeing random parity errors when doing subsequent parity checks. This issue can easily cause silent data corruption, which--even on consumer hardware--is unacceptable and unexpected.

PS. Forgot to mention but you have some Samsung hard drives - some older one had a defective firmware and you should check if yours are the one affected - you can search even here as it was discussed at that time.

One of the Samsung drives is the drive I suspect of failures, but as far as I can tell, the firmware issue affects only the HD155UI and HD204UI drives; mine are HD154UI

Anyway, an update:

* 12 hour memtest shows no errors (still running)

* The only other sata controller I have laying around is sil3132-based, which are apparently known to be flaky. I've got two of the Syba cards from monoprice (which seem to be one of the few 'recommended' brands and I think is why I bought them), but switching the suspect drive onto either of them results in a drastically increased number of parity mismatches. I'm curious to do a little more testing to see if switching a suspected-good drive onto the Syba cards has a similar effect (which will lead to me promptly discarding the cards...)

EddieA · July 25, 2012

While these are valid points which may potentially explain a problem like what I'm seeing...

1. There is a reason why the commercial servers use at least ECC memory and are on UPS.
If the RAM were bad, it would show up in a memtest. Any memory error, ever, is indicative of a module that should be thrown out / RMA'd.

I wouldn't be so sure of that. I was getting a lot of single bit errors when calculating checksums using md5deep. I could run the same disk numerous times, and get different files failing each time.

memtest+ ran for 48 hours without error, but throwing prime95 onto my system, and running the Blend test blew up within 5 minutes.

Cheers.

nick5429 · July 27, 2012

memtest+ ran for 48 hours without error, but throwing prime95 onto my system, and running the Blend test blew up within 5 minutes.

::jawdrop::

I almost wouldn't have believed you if I didn't just observe the same behavior.

memtest+ stable for 36 hours, prime95 blend consistently fails within 15 minutes. prime95 'small' passes overnight (likely indicating, from the description in prime95, a 'problem with the memory or memory controller'). I'd always considered memtest the golden standard in memory subsystem stability. I guess not.

My RAM was even running underclocked -- autodetected [email protected] vs the rated 1600 @ 1.5V. CPU, everything else running at stock/autodetected speeds. I increased the RAM voltage a bit (1.6ish) and left it at 1333; so far prime95 blend has been running for ~3 hours without errors.

dgaschk · July 27, 2012

Where do you get prime95?

EddieA · July 27, 2012

memtest+ ran for 48 hours without error, but throwing prime95 onto my system, and running the Blend test blew up within 5 minutes.

::jawdrop::

I almost wouldn't have believed you if I didn't just observe the same behavior.

Neither did anyone on the hardware board I posted some questions on. I almost got laughed off it for suggesting that memtest didn't catch the errors.

Luckily my memory was all Corsair, so I RMA'd the pair that I think were bad, after using prime95 to determine.

Where do you get prime95?

Google it. It's the top entry.

Cheers.

nick5429 · July 31, 2012

Bumped up the RAM voltage. After many hours of testing/confirming, the system is totally prime95 stable and passed 3 rounds of parity checks with no mismatches.

I think I can finally call this closed, thanks for everyone's help / tips / suggestions!

jumperalex · August 1, 2012

This would seem to imply that the current Unraid received wisdom of using memtest, indeed its inclusion in the official download, be reconsidered?

GNex Tapatalk

kenoka · August 1, 2012

This would seem to imply that the current Unraid received wisdom of using memtest, indeed its inclusion in the official download, be reconsidered?

GNex Tapatalk

I don't think so. It's still a good test. But now we have data to suggest that it's not 100% valid in all cases. So we'll need to remember that a flaky system that has passed memtest should be checked with Prime95.

nick5429 · August 1, 2012

This would seem to imply that the current Unraid received wisdom of using memtest, indeed its inclusion in the official download, be reconsidered?

GNex Tapatalk

I don't think so. It's still a good test. But now we have data to suggest that it's not 100% valid in all cases. So we'll need to remember that a flaky system that has passed memtest should be checked with Prime95.

Agreed. I think, as a grand overgeneralization:

* Memtest is best at finding actual bad locations in the RAM, and it does a much more thorough job of testing all memory locations with a variety of access patterns.

* Prime95 would be better at testing stability under load. It can't know which physical memory locations it's testing, it just does a lot of memory transactions under heavy CPU and memory load. This is more likely to find problems with things like overclocked or overheating RAM or RAM which isn't getting enough voltage.

Superorb · February 4, 2013

Which version of prime95 do I download and how do I get it running on my unRAID server? I've been getting random parity mismatches and memtest returned 0 errors after 23 hours.

nick5429 · February 4, 2013

Which version of prime95 do I download and how do I get it running on my unRAID server?

The 32-bit Linux one

wget ftp://mersenne.org/gimps/p95v279.linux32.tar.gz
tar -xzvf p95v279.linux32.tar.gz
./mprime

Superorb · February 4, 2013

Ok, I ran prime95 and the process was killed instantly. Syslog:

Feb  4 17:09:43 unRAID login[4944]: ROOT LOGIN  on `pts/0' from `Ken-Windows7' (Logins)
Feb  4 17:11:11 unRAID kernel: mprime invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0 (Minor Issues)
Feb  4 17:11:11 unRAID kernel: Pid: 4992, comm: mprime Not tainted 2.6.32.9-unRAID #8 (Errors)
Feb  4 17:11:11 unRAID kernel: Call Trace: (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c104ab61>] oom_kill_process+0x59/0x1cd (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c104afb9>] __out_of_memory+0xef/0x102 (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c104b02a>] out_of_memory+0x5e/0x83 (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c104cfe9>] __alloc_pages_nodemask+0x375/0x42f (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c1059686>] handle_mm_fault+0x254/0x8f1 (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c129f124>] ? schedule+0x691/0x72f (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c1017050>] do_page_fault+0x17c/0x1e4 (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c1016ed4>] ? do_page_fault+0x0/0x1e4 (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c12a07ce>] error_code+0x66/0x6c (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c1016ed4>] ? do_page_fault+0x0/0x1e4 (Errors)
Feb  4 17:11:11 unRAID kernel: Mem-Info:
Feb  4 17:11:11 unRAID kernel: DMA per-cpu:
Feb  4 17:11:11 unRAID kernel: CPU    0: hi:    0, btch:   1 usd:   0
Feb  4 17:11:11 unRAID kernel: Normal per-cpu:
Feb  4 17:11:11 unRAID kernel: CPU    0: hi:  186, btch:  31 usd: 168
Feb  4 17:11:11 unRAID kernel: HighMem per-cpu:
Feb  4 17:11:11 unRAID kernel: CPU    0: hi:  186, btch:  31 usd: 164
Feb  4 17:11:11 unRAID kernel: active_anon:354816 inactive_anon:74677 isolated_anon:0
Feb  4 17:11:11 unRAID kernel:  active_file:8 inactive_file:14 isolated_file:0
Feb  4 17:11:11 unRAID kernel:  unevictable:54391 dirty:0 writeback:0 unstable:0
Feb  4 17:11:11 unRAID kernel:  free:12293 slab_reclaimable:851 slab_unreclaimable:1810
Feb  4 17:11:11 unRAID kernel:  mapped:1521 shmem:24 pagetables:965 bounce:0
Feb  4 17:11:11 unRAID kernel: DMA free:7984kB min:64kB low:80kB high:96kB active_anon:5576kB inactive_anon:2304kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15768kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:8kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Feb  4 17:11:11 unRAID kernel: lowmem_reserve[]: 0 867 1982 1982
Feb  4 17:11:11 unRAID kernel: Normal free:40568kB min:3732kB low:4664kB high:5596kB active_anon:706968kB inactive_anon:60672kB active_file:24kB inactive_file:36kB unevictable:17460kB isolated(anon):0kB isolated(file):0kB present:887976kB mlocked:0kB dirty:0kB writeback:0kB mapped:188kB shmem:0kB slab_reclaimable:3404kB slab_unreclaimable:7240kB kernel_stack:616kB pagetables:1484kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:100 all_unreclaimable? no
Feb  4 17:11:11 unRAID kernel: lowmem_reserve[]: 0 0 8924 8924
Feb  4 17:11:11 unRAID kernel: HighMem free:620kB min:512kB low:1712kB high:2912kB active_anon:706720kB inactive_anon:235732kB active_file:8kB inactive_file:20kB unevictable:200104kB isolated(anon):0kB isolated(file):0kB present:1142312kB mlocked:0kB dirty:0kB writeback:0kB mapped:5896kB shmem:96kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:2368kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:133 all_unreclaimable? no
Feb  4 17:11:11 unRAID kernel: lowmem_reserve[]: 0 0 0 0
Feb  4 17:11:11 unRAID kernel: DMA: 0*4kB 2*8kB 2*16kB 2*32kB 1*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 7984kB
Feb  4 17:11:11 unRAID kernel: Normal: 104*4kB 109*8kB 73*16kB 29*32kB 23*64kB 7*128kB 12*256kB 16*512kB 5*1024kB 3*2048kB 3*4096kB = 40568kB
Feb  4 17:11:11 unRAID kernel: HighMem: 17*4kB 5*8kB 2*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 620kB
Feb  4 17:11:11 unRAID kernel: 54460 total pagecache pages
Feb  4 17:11:11 unRAID kernel: 0 pages in swap cache
Feb  4 17:11:11 unRAID kernel: Swap cache stats: add 0, delete 0, find 0/0
Feb  4 17:11:11 unRAID kernel: Free swap  = 0kB
Feb  4 17:11:11 unRAID kernel: Total swap = 0kB
Feb  4 17:11:11 unRAID kernel: 515649 pages RAM
Feb  4 17:11:11 unRAID kernel: 287827 pages HighMem
Feb  4 17:11:11 unRAID kernel: 5487 pages reserved
Feb  4 17:11:11 unRAID kernel: 4837 pages shared
Feb  4 17:11:11 unRAID kernel: 495608 pages non-shared
Feb  4 17:11:11 unRAID kernel: Out of memory: kill process 4990 (mprime) score 57106 or a child (Errors)
Feb  4 17:11:11 unRAID kernel: Killed process 4990 (mprime) (Errors)

nick5429 · February 4, 2013

You might have better luck starting a separate thread, since this one is marked as solved.

Post the results of these two commands in your other thread:

ps -e -orss=,args= | sort -b -k1,1n | pr -TW$COLUMNS
free -m

Also, a list of what plugins you're running.

Superorb · February 4, 2013

You might have better luck starting a separate thread, since this one is marked as solved.

Post the results of these two commands in your other thread:
ps -e -orss=,args= | sort -b -k1,1n | pr -TW$COLUMNS
free -m
Also, a list of what plugins you're running.

Here you go, I added it onto my ongoing thread.

http://lime-technology.com/forum/index.php?topic=25672.msg224410#msg224410

[SOLVED] Intermittent / inconsistent parity errors

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation