Slow parity check on Tam's server (Supermicro H8DME-2)


Recommended Posts

Hey y'all,

 

I've just started setting up a supermicro H8DME-2 based server from Tam's Solution (w/ 24bay 4U supermicro case, 2x quad core cpu and 16gb ram) and migrated my old 19-drives unraid server which was housed in a Norco 4220 case to it. First, the good news... everything booted up normally and all I had to do was start the array to get things running, but once I started the parity check, I was really disappointed to see that it was only hitting 15-20MB/sec whereas I've read post that Tam's server should hit 105MB/sec-ish without any changes.

 

I have 19 drives, about 2/3 are 3 or 4TB ones which clock in at 130-150MB/sec each (on hdparm -tT), and the slowest ones are WD 2TB EARS which go for 100-105MB/sec. These same set of drives were getting parity check speed of 55-65MB/sec from the get go (and goes up to 90-100MB/sec above 2TB) in my Norco setup.

 

I've played with the disk settings to no avail (md_num_stripes at 2048, md_sync_window at 1280). There are no AHCI settings in the bios and I've disabled the IDE and SATA controller anyway, since no drives are connected to the motherboard. Copying files from a windows machine to the server, I got 33MB/sec.

 

Things to try:

- I have yet to disable int13h on the SAT-MV8s.

- I haven't updated the BIOS to 3.5a

 

I've also noticed that the openwyrn-openssh.plg plugin takes a helluva long time to start up (nearly 10 minutes). It may have always been like that in my old Norco setup, but if so I didn't realize it since the machine hardly ever reboots.

 

What am I doing wrong?

How can I speed things up?

 

Thanks

syslog.txt

Link to comment

thank you for checking my syslog.

 

This is what a parity check look like right now (started it, then stopped it not long after):

 

md: recovery thread woken up ...
md: recovery thread checking parity...
md: using 5120k window, over a total of 3907018532 blocks.
mdcmd (87): nocheck 
md: md_do_sync: got signal, exit...
md: recovery thread sync completion status: -4

 

The parity check speed stays around 16-20MB/sec whether I do completely nothing on the server or when I put slight load on it (stream a movie off it, run preclear). Currently I'm preclearing a 4TB disk (started at 148MB/s, now it's hovering around 96MB/s at the 3TB mark).

 

I will try a firmware update and post an update here again.

 

Link to comment

I've flashed the motherboard's firmware to the latest one (3.5a) and there doesn't seem to be any difference at all. There are no newer firmware for the SAT2-MV8 cards either.

 

Parity check maxed at 21.6MB/s just now and curiously it seems to be really tying up the /mnt/user share (shfs). I tried to do a syslog dump to /mnt/user/syslog.txt and it just hung there for minutes, until I cancelled parity check.

 

Attached is the syslog (w/ parity check start and stop) and also a `ps -aux` dump.

 

And these:

 

oot@archive:~# dmesg | grep IRQ
ACPI: BIOS IRQ0 override ignored.
ACPI: IRQ9 used by override.
ACPI: IRQ14 used by override.
ACPI: IRQ15 used by override.
NR_IRQS:2304 nr_irqs:744 16
spurious 8259A interrupt: IRQ7.
ACPI: PCI Interrupt Link [LNKA] (IRQs 16 17 18 19) *10
ACPI: PCI Interrupt Link [LNKB] (IRQs 16 17 18 19) *0, disabled.
ACPI: PCI Interrupt Link [LNKC] (IRQs 16 17 18 19) *0, disabled.
ACPI: PCI Interrupt Link [LNKD] (IRQs 16 17 18 19) *0, disabled.
ACPI: PCI Interrupt Link [LNEA] (IRQs 16 17 18 19) *14
ACPI: PCI Interrupt Link [LNEB] (IRQs 16 17 18 19) *0, disabled.
ACPI: PCI Interrupt Link [LNEC] (IRQs 16 17 18 19) *5
ACPI: PCI Interrupt Link [LNED] (IRQs 16 17 18 19) *0, disabled.
ACPI: PCI Interrupt Link [LUB0] (IRQs 21 22 23) *14
ACPI: PCI Interrupt Link [LMAD] (IRQs 20) *11
ACPI: PCI Interrupt Link [LUB2] (IRQs 21 22 23) *7
ACPI: PCI Interrupt Link [LMAC] (IRQs 20) *10
ACPI: PCI Interrupt Link [LAZA] (IRQs 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LSMB] (IRQs 21 22 23) *11
ACPI: PCI Interrupt Link [LPMU] (IRQs 21 22 23) *5
ACPI: PCI Interrupt Link [LSA0] (IRQs 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LSA1] (IRQs 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LATA] (IRQs 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [LSA2] (IRQs 21 22 23) *0, disabled.
PCI: Using ACPI for IRQ routing
ACPI: PCI Interrupt Link [LUB0] enabled at IRQ 23
ACPI: PCI Interrupt Link [LUB2] enabled at IRQ 22
Serial: 8250/16550 driver, 1 ports, IRQ sharing disabled
ACPI: PCI Interrupt Link [LMAC] enabled at IRQ 20
ACPI: PCI Interrupt Link [LNEC] enabled at IRQ 19
sata_mv 0000:03:04.0: Gen-II 32 slots 8 ports SCSI mode IRQ via INTx
ACPI: PCI Interrupt Link [LNEA] enabled at IRQ 18
sata_mv 0000:03:06.0: Gen-II 32 slots 8 ports SCSI mode IRQ via INTx
sata_mv 0000:04:06.0: Gen-II 32 slots 8 ports SCSI mode IRQ via INTx
ACPI: PCI Interrupt Link [LMAD] enabled at IRQ 20

 

root@archive:~# dmesg | grep irq
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 14 global_irq 14 high edge)
ACPI: INT_SRC_OVR (bus 0 bus_irq 15 global_irq 15 high edge)
nr_irqs_gsi: 40
NR_IRQS:2304 nr_irqs:744 16
CPU 0 irqstacks, hard=ed00a000 soft=ed00c000
CPU 1 irqstacks, hard=ed0c0000 soft=ed0c2000
CPU 2 irqstacks, hard=ed0cc000 soft=ed0ce000
CPU 3 irqstacks, hard=ed0ec000 soft=ed0ee000
CPU 4 irqstacks, hard=ed0f8000 soft=ed0fa000
CPU 5 irqstacks, hard=ed10c000 soft=ed10e000
CPU 6 irqstacks, hard=ed128000 soft=ed12a000
CPU 7 irqstacks, hard=ed13c000 soft=ed13e000
pcieport 0000:00:0a.0: irq 40 for MSI/MSI-X
pcieport 0000:00:0d.0: irq 41 for MSI/MSI-X
pcieport 0000:00:0e.0: irq 42 for MSI/MSI-X
pcieport 0000:00:0f.0: irq 43 for MSI/MSI-X
ehci-pci 0000:00:02.1: irq 22, io mem 0xfc2bec00
ohci_hcd 0000:00:02.0: irq 23, io mem 0xfc2bf000
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
ata1: SATA max UDMA/133 mmio m1048576@0xfd700000 port 0xfd722000 irq 19
ata2: SATA max UDMA/133 mmio m1048576@0xfd700000 port 0xfd724000 irq 19
ata3: SATA max UDMA/133 mmio m1048576@0xfd700000 port 0xfd726000 irq 19
ata4: SATA max UDMA/133 mmio m1048576@0xfd700000 port 0xfd728000 irq 19
ata5: SATA max UDMA/133 mmio m1048576@0xfd700000 port 0xfd732000 irq 19
ata6: SATA max UDMA/133 mmio m1048576@0xfd700000 port 0xfd734000 irq 19
ata7: SATA max UDMA/133 mmio m1048576@0xfd700000 port 0xfd736000 irq 19
ata8: SATA max UDMA/133 mmio m1048576@0xfd700000 port 0xfd738000 irq 19
ata9: SATA max UDMA/133 mmio m1048576@0xfd600000 port 0xfd622000 irq 18
ata10: SATA max UDMA/133 mmio m1048576@0xfd600000 port 0xfd624000 irq 18
ata11: SATA max UDMA/133 mmio m1048576@0xfd600000 port 0xfd626000 irq 18
ata12: SATA max UDMA/133 mmio m1048576@0xfd600000 port 0xfd628000 irq 18
ata13: SATA max UDMA/133 mmio m1048576@0xfd600000 port 0xfd632000 irq 18
ata14: SATA max UDMA/133 mmio m1048576@0xfd600000 port 0xfd634000 irq 18
ata15: SATA max UDMA/133 mmio m1048576@0xfd600000 port 0xfd636000 irq 18
ata16: SATA max UDMA/133 mmio m1048576@0xfd600000 port 0xfd638000 irq 18
ata17: SATA max UDMA/133 mmio m1048576@0xfeb00000 port 0xfeb22000 irq 18
ata18: SATA max UDMA/133 mmio m1048576@0xfeb00000 port 0xfeb24000 irq 18
ata19: SATA max UDMA/133 mmio m1048576@0xfeb00000 port 0xfeb26000 irq 18
ata20: SATA max UDMA/133 mmio m1048576@0xfeb00000 port 0xfeb28000 irq 18
ata21: SATA max UDMA/133 mmio m1048576@0xfeb00000 port 0xfeb32000 irq 18
ata22: SATA max UDMA/133 mmio m1048576@0xfeb00000 port 0xfeb34000 irq 18
ata23: SATA max UDMA/133 mmio m1048576@0xfeb00000 port 0xfeb36000 irq 18
ata24: SATA max UDMA/133 mmio m1048576@0xfeb00000 port 0xfeb38000 irq 18

 

root@archive:~# cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
  0:         51          2          0         15         97       2450      13047     127350   IO-APIC-edge      timer
  1:          0          0          0          0          0          0          0          2   IO-APIC-edge      i8042
  7:          1          0          0          0          0          0          0          0   IO-APIC-edge    
  9:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   acpi
12:          0          0          0          0          0          0          0          3   IO-APIC-edge      i8042
18:          1          0          0         17         10        549    1078002   10708699   IO-APIC-fasteoi   sata_mv, sata_mv
19:          1          0          0         13         30        746    1948720    2820399   IO-APIC-fasteoi   sata_mv
22:          0          0          0          0          0          0          3       1247   IO-APIC-fasteoi   ehci_hcd:usb1
23:          0          0          0          0          0          0          1         38   IO-APIC-fasteoi   ohci_hcd:usb2
44:          1          0          0          3          8       1810       9139     124227   PCI-MSI-edge      eth0
NMI:          0          0          0          0          0          0          0          0   Non-maskable interrupts
LOC:      12653      18382      20300      14410      22636      21462      27219      47093   Local timer interrupts
SPU:          0          0          0          0          0          0          0          0   Spurious interrupts
PMI:          0          0          0          0          0          0          0          0   Performance monitoring interrupts
IWI:          0          0          0          0          0          0          0          0   IRQ work interrupts
RTR:          0          0          0          0          0          0          0          0   APIC ICR read retries
RES:      53601      14237      10420       8528     865819     834927      62112       6919   Rescheduling interrupts
CAL:      59286       2328       2320       1701         41         28         20         20   Function call interrupts
TLB:        133        824        319        225        275        744        306        142   TLB shootdowns
TRM:          0          0          0          0          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0          0          0          0          0   Threshold APIC interrupts
MCE:          0          0          0          0          0          0          0          0   Machine check exceptions
MCP:         10         10         10         10         10         10         10         10   Machine check polls
ERR:          1
MIS:          0

syslog-041014.txt

ps-041014.txt

Link to comment

I've dropped the RAM from 16GB down to 4GB.

 

I've moved the SAT2-MV8 cards so they each get their own IRQ: (compare sata_mv assignments to the one above)

 

root@archive:~# cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
  0:         48          1          2          7          2          0         16       7344   IO-APIC-edge      timer
  1:          0          0          0          0          0          0          0          2   IO-APIC-edge      i8042
  7:          1          0          0          0          0          0          0          0   IO-APIC-edge    
  9:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   acpi
12:          0          0          0          0          0          0          0          3   IO-APIC-edge      i8042
17:          0          0          0          2          0          0          0        435   IO-APIC-fasteoi   sata_mv
18:          0          0          0          0          0          0          1        451   IO-APIC-fasteoi   sata_mv
19:          0          0          0          4          0          0          0        214   IO-APIC-fasteoi   sata_mv
22:          0          0          0          0          0          0          1       1127   IO-APIC-fasteoi   ehci_hcd:usb1
23:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   ohci_hcd:usb2
44:          0          0          0          0          0          0          5       8106   PCI-MSI-edge      eth0
NMI:          0          0          0          0          0          0          0          0   Non-maskable interrupts
LOC:       2099       3768       2294       2152       1723       1719       2696        138   Local timer interrupts
SPU:          0          0          0          0          0          0          0          0   Spurious interrupts
PMI:          0          0          0          0          0          0          0          0   Performance monitoring interrupts
IWI:          0          0          0          0          0          0          0          0   IRQ work interrupts
RTR:          0          0          0          0          0          0          0          0   APIC ICR read retries
RES:       2340       3533       3436       1313       1947       3985       1961       2722   Rescheduling interrupts
CAL:         94         80        757         63         12         14         15         14   Function call interrupts
TLB:         49        382        172         59         57        450        114         61   TLB shootdowns
TRM:          0          0          0          0          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0          0          0          0          0   Threshold APIC interrupts
MCE:          0          0          0          0          0          0          0          0   Machine check exceptions
MCP:          2          2          2          2          2          2          2          2   Machine check polls
ERR:          1
MIS:          0

 

I'm still stuck at 24MB/s parity check.

 

If any HDD is bottlenecking this process, it should show up with hdparm -tT, no??

 

 

Link to comment

Have you verified that all the drives are connecting at full speed ?

 

I have the same server, and had swapped to some new cables with latches during a server cleanup, and found that my parity check speeds were very low.  It turned out that the new cables were junk. I double / triple checked that they were full seated etc, but nothing made the drives connect all full speed. I reverted to the original cables (no latch though) and link speeds were once again solid.

 

use dmesg|grep "SATA link" to look for the speeds, I was seeing some drives at 1.5 Gbps and some at 3.0, moving the new but obviously poor cables around moved the slow link speed to a different drive.

 

dmesg |grep "SATA link"

ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

 

Surprisingly, these Monoprice cables were the problem. But nothing against Monoprice because I have used their 8087-sata forward breakout cables with 100% success on other servers. 

 

Link to comment

@LinuxGuyGary:

I'm using the stock SATA cables that came with the system; they don't look pristine, but seems to be working well:

 

ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata17: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata4: SATA link down (SStatus 0 SControl 300)
ata5: SATA link down (SStatus 0 SControl 300)
ata6: SATA link down (SStatus 0 SControl 300)
ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata11: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata12: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata13: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata14: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata15: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata16: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata18: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata19: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata20: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata21: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata22: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata23: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata24: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

 

(ata4, 5 and 6 are supposed to be empty)

 

@vl1969:

Thanks, I'll stick with the stock cables for now.

Link to comment

Finally... this seems to be the problem?

 

Apr 10 23:21:28 archive kernel: INFO: rcu_sched self-detected stall on CPU { 7}  (t=6000 jiffies g=1434 c=1433 q=7796)
Apr 10 23:21:28 archive kernel: Pid: 2836, comm: unraidd Not tainted 3.9.11p-unRAID #5 (Errors)
Apr 10 23:21:28 archive kernel: Call Trace: (Errors)
Apr 10 23:21:28 archive kernel:  [<c1062c2a>] print_cpu_stall+0xbc/0x107 (Errors)
Apr 10 23:21:28 archive kernel:  [<c1062eba>] __rcu_pending+0x4f/0x12a (Errors)
Apr 10 23:21:28 archive kernel:  [<c1063008>] rcu_check_callbacks+0x73/0x9b (Errors)
Apr 10 23:21:28 archive kernel:  [<c1032ed9>] update_process_times+0x2d/0x53 (Errors)
Apr 10 23:21:28 archive kernel:  [<c105520b>] tick_sched_timer+0x77/0xa1 (Errors)
Apr 10 23:21:28 archive kernel:  [<c1040e02>] ? __remove_hrtimer+0x25/0x7a (Errors)
Apr 10 23:21:28 archive kernel:  [<c1040f45>] __run_hrtimer+0x45/0xaf (Errors)
Apr 10 23:21:28 archive kernel:  [<c10412ad>] hrtimer_interrupt+0xf1/0x1e7 (Errors)
Apr 10 23:21:28 archive kernel:  [<c10483d9>] ? sched_clock_cpu+0x3f/0x13f (Errors)
Apr 10 23:21:28 archive kernel:  [<c101c43a>] smp_apic_timer_interrupt+0x6d/0x7f (Errors)
Apr 10 23:21:28 archive kernel:  [<c1401411>] apic_timer_interrupt+0x2d/0x34 (Errors)
Apr 10 23:21:28 archive kernel:  [<c124408a>] ? xor_sse_5_pf64+0x70/0x32c (Errors)
Apr 10 23:21:28 archive kernel:  [<c12435de>] xor_blocks+0x74/0x7c (Errors)
Apr 10 23:21:28 archive kernel:  [<f88d50b8>] check_parity+0x96/0xcc [md_mod] (Errors)
Apr 10 23:21:28 archive kernel:  [<f88d5bfb>] handle_stripe+0xa29/0xceb [md_mod] (Errors)
Apr 10 23:21:28 archive kernel:  [<c1044f5f>] ? __wake_up+0x3b/0x42 (Errors)
Apr 10 23:21:28 archive kernel:  [<f88d5f2e>] unraidd+0x71/0xb5 [md_mod] (Errors)
Apr 10 23:21:28 archive kernel:  [<f88d2cb2>] md_thread+0xd3/0xea [md_mod] (Errors)
Apr 10 23:21:28 archive kernel:  [<c103f031>] ? wake_up_bit+0x5b/0x5b (Errors)
Apr 10 23:21:28 archive kernel:  [<c103ebf1>] kthread+0x90/0x95 (Errors)
Apr 10 23:21:28 archive kernel:  [<f88d2bdf>] ? import_device+0x166/0x166 [md_mod] (Errors)
Apr 10 23:21:28 archive kernel:  [<c1401837>] ret_from_kernel_thread+0x1b/0x28 (Errors)
Apr 10 23:21:28 archive kernel:  [<c103eb61>] ? kthread_freezable_should_stop+0x4a/0x4a (Errors)
Apr 10 23:21:40 archive kernel: mce: [Hardware Error]: Machine check events logged (Errors)
Apr 10 23:23:28 archive kernel: INFO: rcu_sched self-detected stall on CPU { 7}  (t=6000 jiffies g=1439 c=1438 q=7157)
Apr 10 23:23:28 archive kernel: Pid: 2836, comm: unraidd Not tainted 3.9.11p-unRAID #5 (Errors)
Apr 10 23:23:28 archive kernel: Call Trace: (Errors)
Apr 10 23:23:28 archive kernel:  [<c1062c2a>] print_cpu_stall+0xbc/0x107 (Errors)
Apr 10 23:23:28 archive kernel:  [<c1062eba>] __rcu_pending+0x4f/0x12a (Errors)
Apr 10 23:23:28 archive kernel:  [<c1063008>] rcu_check_callbacks+0x73/0x9b (Errors)
Apr 10 23:23:28 archive kernel:  [<c1032ed9>] update_process_times+0x2d/0x53 (Errors)
Apr 10 23:23:28 archive kernel:  [<c105520b>] tick_sched_timer+0x77/0xa1 (Errors)
Apr 10 23:23:28 archive kernel:  [<c1040e02>] ? __remove_hrtimer+0x25/0x7a (Errors)
Apr 10 23:23:28 archive kernel:  [<c1040f45>] __run_hrtimer+0x45/0xaf (Errors)
Apr 10 23:23:28 archive kernel:  [<c10412ad>] hrtimer_interrupt+0xf1/0x1e7 (Errors)
Apr 10 23:23:28 archive kernel:  [<c10483d9>] ? sched_clock_cpu+0x3f/0x13f (Errors)
Apr 10 23:23:28 archive kernel:  [<c101c43a>] smp_apic_timer_interrupt+0x6d/0x7f (Errors)
Apr 10 23:23:28 archive kernel:  [<c1044d0b>] ? check_preempt_curr+0x39/0x64 (Errors)
Apr 10 23:23:28 archive kernel:  [<c1401411>] apic_timer_interrupt+0x2d/0x34 (Errors)
Apr 10 23:23:28 archive kernel:  [<c12440c9>] ? xor_sse_5_pf64+0xaf/0x32c (Errors)
Apr 10 23:23:28 archive kernel:  [<c12435de>] xor_blocks+0x74/0x7c (Errors)
Apr 10 23:23:28 archive kernel:  [<f88d50b8>] check_parity+0x96/0xcc [md_mod] (Errors)
Apr 10 23:23:28 archive kernel:  [<f88d5bfb>] handle_stripe+0xa29/0xceb [md_mod] (Errors)
Apr 10 23:23:28 archive kernel:  [<c1044f5f>] ? __wake_up+0x3b/0x42 (Errors)
Apr 10 23:23:28 archive kernel:  [<f88d5f2e>] unraidd+0x71/0xb5 [md_mod] (Errors)
Apr 10 23:23:28 archive kernel:  [<f88d2cb2>] md_thread+0xd3/0xea [md_mod] (Errors)
Apr 10 23:23:28 archive kernel:  [<c103f031>] ? wake_up_bit+0x5b/0x5b (Errors)
Apr 10 23:23:28 archive kernel:  [<c103ebf1>] kthread+0x90/0x95 (Errors)
Apr 10 23:23:28 archive kernel:  [<f88d2bdf>] ? import_device+0x166/0x166 [md_mod] (Errors)
Apr 10 23:23:28 archive kernel:  [<c1401837>] ret_from_kernel_thread+0x1b/0x28 (Errors)
Apr 10 23:23:28 archive kernel:  [<c103eb61>] ? kthread_freezable_should_stop+0x4a/0x4a (Errors)
Apr 10 23:24:27 archive kernel: mdcmd (68): nocheck  (unRAID engine)

 

I looked up 'rcu_sched self-detected stall on CPU' and it relates to PowerNow setting. Here's the pertinent BIOS settings:

 

MTRR Mapping                     [Continuous]
Thermal Throttling               [Disabled]
PowerNow                         [Disabled]
Secure Virtual Machine Mode      [Enabled]
CPU Page Translation Table       [Enabled]
CPU Prefetching                  [Enabled]
IO Prefetching                   [Enabled]
Probe Filter                     [Auto]

 

I have a pair of 2346 HE Opterons installed (default type from Tam's), using fan-less heatsinks. CPU monitor pegs the temp around 30'C as the machine is stored in a pretty cool room (21'C ambient temperature)

Link to comment

DO NOT,  I repeat DO NOT use fanless heat sinks in this box.

I can sell you several after my experience with them.

 

after several strange issues and reboots and shutdowns  and even compleat system going into power off mode  with strange alarm blaring all over my basement, all with in one week period.

I traced it to CPU over heating. I used a solid copper fan-less HS for similar CPU but designed for a 1U server.

 

the air flow in this chassis is not enough to cool it.

 

I ended up  getting a pair of cooler master T-4 HS  there is a similar model 212 EVO but it is just a litle bit higher than the case.

T-4 fits perfectly, it actually fits on existing mounts (the black plastic kind), (I had removed the mounts initially to fit the fanless HS but put it back) for this one.

using just the spring holder with the HS fit it perfectly.

 

the fans are very quiet and if need be you can mount the second fan on the other side of the HS, my CPU stais cool as it is.  best coolers for this box...

 

 

Link to comment

@vl1969: Undone! As soon as I typed that last post, I went back to the server room to replace the stock AMD fan on one of the Opterons and took out the second CPU; to reduce the offending variables.

 

Still no joy :( Parity check at 21MB/sec.

 

I'm not sure what to think now.

 

Edit: cat /proc/cpuinfo shows the cpu freq at 1000mhz for all cores? (It's supposed to be 1.8ghz)

I think i should double check power connectors to the mobo and maybe swap the cpu if that doesnt work. And swap back the psu to the default one; and try to build a new array with my spare hdd and a fresh unraid stick.

Link to comment

This issue has completely baffled me...

 

So I let the parity check ran to completion (nearly 2 days):

 

It started at 24MB/s

As soon as it passed 2TB mark, the speed went up to 36MB/s

Then passing 3TB, it went up to 55MB/s or so.

 

I have 18 drives plus parity, about 8 of them still 2TB, the rest are mixed between 3 and 4TB. I'm still slowly moving away from the 2TB drives because their performance and age.

 

At first I thought this was a clear sign of bus bottlenecking; but I looked up PCI-X speed and they should do at least 800MB/s, which means my parity check should be at least 90MB/s at the start.

 

I've also ran diskspeed.sh which tests each hdd on the system and generates an average speed (from hdparm) of the drives. None of my drive scored below 85MB/s.

 

Help me Obi-wan.

 

Spec:

 

Supermicro H8DME-2 motherboard

AMD 2346 HE quad core

8GB ECC RAM

3x AOC-SAT2-MV8 controllers

19 drives of various makes and models (8x 2TB drivess, 6x 3TB drves, 5x 4TB drives)

Enermax Revolution 87+ PSU

Link to comment

Replace one of the AOC-SAT2-MV8 with a AOC-SASLP-MV8 and make sure that the AOC-SAT2-MV8 are on separate busses.

 

+1

 

One of the first things I did with my TAMs server was replace the PCI-X sata controllers with the SAS-MV8's.  I never got to see what the old pci-x sata controllers would do I only know what the SAS versions do.  I get about 68MB/s and I have a mix of 2 and 3tb drives and 24tb total.

Link to comment

On my original unRAID server (X7SBE MB) I used two AOC-SAT2-MV8s and MB SATA ports for 22 drives.  One SAT2-MV8 was on the 133mhz PCI-X bus and the other on the 100mhz PCI-X bus.  I got 50-100MB/s on the 2TB WD Greens I had at the time on those cards.  So maybe not as fast as the PCIe based cards but still acceptable - at least to me.  When I tried to run 3 AOC-SAT2-MV8's with 2 cards on the 133mhz bus the speeds dropped to 30-65MB/s depending on where on the platter it was reading or writing to.

Link to comment

@dgaschk, zeroK, BobPhoenix: thanks a lot, guys. All signs pointed to bus bottlenecking, but i remembered a few posts about this server in which ppl mentioned that they're ready to use (without mods) and that people have gotten 105MB/s parity check speed.

 

I've also read that PCI-X should do 1GB/s at 133mhz and 800MB/s at 100mhz (which should enable at least around 100Mb/s per drive on Sat2-mv8's max capacity, which is 8 drives per controller).

 

....

 

It didnt make sense to me, until i went back to the manual and studied the diagram:

 

 

image.jpg.d599ab7bbb8e271c31aaf1751f35b843.jpg

Link to comment

PCIex8 moves about 2GBps. It easily accommodates 1000MBps + 800MBps. The issue is that PCI-x slots share the bus capacity. One of the busses has 2 cards and 16 drives can use more than 1000Mbps. If the system had 2 cards it would have acceptable speeds. How many PCI-x cards are the people who are getting 105MBps using? How many drives are connected in those systems? The third card is killing performance. Replacing one of the cards with a PCIe card should substantially improve performance.

Link to comment

PCIex8 moves about 2GBps. It easily accommodates 1000MBps + 800MBps. The issue is that PCI-x slots share the bus capacity. One of the busses has 2 cards and 16 drives can use more than 1000Mbps. If the system had 2 cards it would have acceptable speeds. How many PCI-x cards are the people who are getting 105MBps using? How many drives are connected in those systems? The third card is killing performance. Replacing one of the cards with a PCIe card should substantially improve performance.

 

Ah alright. I'll try 1 pci-x card per bus. Whatabout the onboard sata connectors (nforce chipset)? Those usually operate full speed, right?

Link to comment

I'm still baffled.... here are the stats from my tests, through some changes of HBA configs... last one I start to use the onboard controller. I'm waiting for SFF-8087 breakout cables before I can deploy M1015s into the mix.

 

1:

 

Onboard: -

SLOT1 (100mhz): -

SLOT2 (100mhz): SAT2-MV8: 7 HDD

SLOT3 (133mhz): SAT2-MV8: 6 HDD

SLOT4 (133mhz): SAT2-MV8: 6 HDD

 

Parity check: 24MB/s

Total = 456MB/s

100mhz channel = 168MB/s

133mhz channel = 288MB/s

 

2:

 

Onboard: -

SLOT1 (100mhz): SAT2-MV8: 5 HDD

SLOT2 (100mhz): SAT2-MV8: 5 HDD

SLOT3 (133mhz): SAT2-MV8: 5 HDD

SLOT4 (133mhz): SAT2-MV8: 4 HDD

 

Parity check: 34MB/s

Total = 646MB/s

100mhz channel = 340MB/s

133mhz channel = 306MB/s

 

3:

 

Onboard: 6 HDD

SLOT1 (100mhz): 4HDD

SLOT2 (100mhz): 4HDD

SLOT3 (133mhz): 5HDD

SLOT4 (133mhz): -

 

Parity check: 30MB/s

Total = 570MB/s

100mhz channel = 240MB/s

133mhz channel = 150MB/s

 

4:

 

Onboard: 6 HDD

SLOT1 (100mhz): -

SLOT2 (100mhz): 4HDD

SLOT3 (133mhz): 4HDD

SLOT4 (133mhz): 5HDD

 

Parity check: 28MB/s

Total = 532MB/s

100mhz channel = 112MB/s

133mhz channel = 252MB/s

 

ps: parity check is just the initial starting speed (first 5-10 minutes, taken from a few samples)

 

pps: where's the logic in all this? What is the actual bottleneck?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.