Server unavailable during parity check, other errors

tucansam · October 2, 2016

Well, two issues.

6.1.3

Last few months, I've noticed that the server is unavailable during monthly parity check. Web interface doesn't answer, no shares are accessible. I can ssh into the server, but that's it. Only recently started happening, last three or for months.

I went to the syslog and saw a lot of these:

--

Sep 30 19:27:31 ffs1 kernel: ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x40010

0 action 0x6 frozen

Sep 30 19:27:31 ffs1 kernel: ata1.00: irq_stat 0x08000000, interface fatal error

Sep 30 19:27:31 ffs1 kernel: ata1: SError: { UnrecovData Handshk }

Sep 30 19:27:31 ffs1 kernel: ata1.00: failed command: WRITE DMA EXT

Sep 30 19:27:31 ffs1 kernel: ata1.00: cmd 35/00:40:18:cd:a8/00:05:40:00:00/e0 ta

g 11 dma 688128 out

Sep 30 19:27:31 ffs1 kernel: res 50/00:00:18:cd:a8/00:00:40:00:00/e0 Ema

sk 0x10 (ATA bus error)

Sep 30 19:27:31 ffs1 kernel: ata1.00: status: { DRDY }

Sep 30 19:27:31 ffs1 kernel: ata1: hard resetting link

Sep 30 19:27:31 ffs1 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 3

00)

Sep 30 19:27:31 ffs1 kernel: ata1.00: configured for UDMA/133

Sep 30 19:27:31 ffs1 kernel: ata1: EH complete

--

Not sure what is going on, or if the two issues are related. Right now the drives are blinking like its doing it parity check (2nd of the month) and the server is 100% unavailable except for ssh. I ssh'd in and issued 'powerdown' -- no effect. So I issued 'reboot' -- no effect.

'top' shows a syslog higher than I've ever seen on any unix system, ever:

--

top - 07:49:47 up 16 days, 14:27, 1 user, load average: 130.55, 124.78, 110.1

Tasks: 244 total, 3 running, 241 sleeping, 0 stopped, 0 zombie

Cpu(s): 0.4%us, 26.2%sy, 0.0%ni, 30.9%id, 31.4%wa, 0.0%hi, 11.2%si, 0.0%st

Mem: 3966868k total, 3919580k used, 47288k free, 426532k buffers

Swap: 0k total, 0k used, 0k free, 2781972k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

1531 root 20 0 0 0 0 R 26 0.0 263:02.15 unraidd

1463 root 20 0 0 0 0 D 18 0.0 109:56.53 mdrecoveryd

995 root 0 -20 0 0 0 S 13 0.0 76:30.84 kworker/1:1H

997 root 0 -20 0 0 0 R 13 0.0 79:41.14 kworker/0:1H

1430 root 20 0 9832 2660 1988 S 1 0.1 32:21.24 cpuload

22235 root 20 0 13412 2388 1968 R 0 0.1 0:00.02 top

1 root 20 0 4368 1596 1500 S 0 0.0 0:15.95 init

2 root 20 0 0 0 0 S 0 0.0 0:00.54 kthreadd

3 root 20 0 0 0 0 S 0 0.0 5:24.42 ksoftirqd/0

5 root 0 -20 0 0 0 S 0 0.0 0:00.00 kworker/0:0H

7 root 20 0 0 0 0 S 0 0.0 5:26.07 rcu_preempt

8 root 20 0 0 0 0 S 0 0.0 0:00.00 rcu_sched

9 root 20 0 0 0 0 S 0 0.0 0:00.00 rcu_bh

10 root RT 0 0 0 0 S 0 0.0 0:12.22 migration/0

11 root RT 0 0 0 0 S 0 0.0 0:12.64 migration/1

12 root 20 0 0 0 0 S 0 0.0 0:05.34 ksoftirqd/1

14 root 0 -20 0 0 0 S 0 0.0 0:00.00 kworker/1:0H

--

No idea what to think right now. In the past, I have simply hard powered-down the server in such circumstances, and simply dealt with the forthcoming parity check after power on. Not sure I want to do that again.

Advice welcome.

BTW syslog is pretty boring (99% spindown messages) otherwise.

I have run 'diagnostics' but have no way of getting it off the system until shares come back... Also 'ls /mnt/user' (trying to remember if I mapped an nfs share to another server) hangs!!!

---

....And.... its getting worse

root@ffs1:/# uptime

08:03:13 up 16 days, 14:40, 2 users, load average: 150.74, 144.84, 129.63

root@ffs1:/# uptime

08:05:42 up 16 days, 14:43, 2 users, load average: 153.44, 148.20, 133.09

root@ffs1:/# uptime

08:05:45 up 16 days, 14:43, 2 users, load average: 153.49, 148.29, 133.20

root@ffs1:/# uptime

08:05:48 up 16 days, 14:43, 2 users, load average: 153.49, 148.29, 133.20

root@ffs1:/# uptime

08:06:42 up 16 days, 14:44, 2 users, load average: 153.79, 149.24, 134.39

root@ffs1:/# uptime

08:06:43 up 16 days, 14:44, 2 users, load average: 153.81, 149.32, 134.50

root@ffs1:/# uptime

08:06:45 up 16 days, 14:44, 2 users, load average: 153.81, 149.32, 134.50

root@ffs1:/# uptime

08:06:46 up 16 days, 14:44, 2 users, load average: 153.81, 149.32, 134.50

root@ffs1:/# uptime

08:06:47 up 16 days, 14:44, 2 users, load average: 153.81, 149.32, 134.50

root@ffs1:/# uptime

08:06:50 up 16 days, 14:44, 2 users, load average: 153.82, 149.40, 134.60

root@ffs1:/#

tucansam · January 2, 2017

Still happens every month.

One server is fine, the other server is 100% unavailable, except for ssh, during parity check. Shares timeout, can't be read from. Copying data to shares times out. Complete data inaccessibility from SMB/NFS.

Anyone?

JorgeB · January 2, 2017

Diagnostics?

tucansam · January 2, 2017

There's a lot of stuff in there I don't want on the internet.

What specifically do you want to see?

JorgeB · January 2, 2017

You can anonymize the diagnostics, it will be difficult to get help here without posting them.

tucansam · January 2, 2017

Syslog attached.

syslog.txt

trurl · January 2, 2017

Syslog wasn't what was asked for.

Go to Tools - Diagnostics and post the complete diagnostics zip.

tucansam · January 2, 2017

And again, that has a LOT of information in it that I don't want on the internet.

So what, specifically, is needed to diagnose? Don't see how my server's network configuration, user names, share names, permissions, etc would be helpful. So please help me to post the most minimal amount of information necessary to accomplish this.

Thanks.

gubbgnutten · January 2, 2017

Have a look at Services, maybe a troubleshooting session would be appropriate?

trurl · January 2, 2017

Post SMART for parity drive.

itimpi · January 2, 2017

And again, that has a LOT of information in it that I don't want on the internet.

So what, specifically, is needed to diagnose? Don't see how my server's network configuration, user names, share names, permissions, etc would be helpful. So please help me to post the most minimal amount of information necessary to accomplish this.

It would be worth stating what causes you a problem as all the personal information is meant to be anonymized. Most of the things you mention so not seem at all sensitive, but if any are then the anonymising code may need revisiting. The whole idea was to collect in one file all the information needed for troubleshooting.

Server unavailable during parity check, other errors

Recommended Posts

tucansam

Link to comment

tucansam

Link to comment

JorgeB

Link to comment

tucansam

Link to comment

JorgeB

Link to comment

tucansam

Link to comment

trurl

Link to comment

tucansam

Link to comment

gubbgnutten

Link to comment

trurl

Link to comment

itimpi

Link to comment

Archived