Jump to content

Server unavailable during parity check, other errors


tucansam

Recommended Posts

Well, two issues.

 

6.1.3

 

Last few months, I've noticed that the server is unavailable during monthly parity check.  Web interface doesn't answer, no shares are accessible.  I can ssh into the server, but that's it.  Only recently started happening, last three or for months.

 

I went to the syslog and saw a lot of these:

 

 

--

Sep 30 19:27:31 ffs1 kernel: ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x40010

0 action 0x6 frozen

Sep 30 19:27:31 ffs1 kernel: ata1.00: irq_stat 0x08000000, interface fatal error

Sep 30 19:27:31 ffs1 kernel: ata1: SError: { UnrecovData Handshk }

Sep 30 19:27:31 ffs1 kernel: ata1.00: failed command: WRITE DMA EXT

Sep 30 19:27:31 ffs1 kernel: ata1.00: cmd 35/00:40:18:cd:a8/00:05:40:00:00/e0 ta

g 11 dma 688128 out

Sep 30 19:27:31 ffs1 kernel:        res 50/00:00:18:cd:a8/00:00:40:00:00/e0 Ema

sk 0x10 (ATA bus error)

Sep 30 19:27:31 ffs1 kernel: ata1.00: status: { DRDY }

Sep 30 19:27:31 ffs1 kernel: ata1: hard resetting link

Sep 30 19:27:31 ffs1 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 3

00)

Sep 30 19:27:31 ffs1 kernel: ata1.00: configured for UDMA/133

Sep 30 19:27:31 ffs1 kernel: ata1: EH complete

--

 

 

Not sure what is going on, or if the two issues are related.  Right now the drives are blinking like its doing it parity check (2nd of the month) and the server is 100% unavailable except for ssh.  I ssh'd in and issued 'powerdown' -- no effect.  So I issued 'reboot' -- no effect.

 

'top' shows a syslog higher than I've ever seen on any unix system, ever:

 

--

top - 07:49:47 up 16 days, 14:27,  1 user,  load average: 130.55, 124.78, 110.1

Tasks: 244 total,  3 running, 241 sleeping,  0 stopped,  0 zombie

Cpu(s):  0.4%us, 26.2%sy,  0.0%ni, 30.9%id, 31.4%wa,  0.0%hi, 11.2%si,  0.0%st

Mem:  3966868k total,  3919580k used,    47288k free,  426532k buffers

Swap:        0k total,        0k used,        0k free,  2781972k cached

 

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

1531 root      20  0    0    0    0 R  26  0.0 263:02.15 unraidd

1463 root      20  0    0    0    0 D  18  0.0 109:56.53 mdrecoveryd

  995 root      0 -20    0    0    0 S  13  0.0  76:30.84 kworker/1:1H

  997 root      0 -20    0    0    0 R  13  0.0  79:41.14 kworker/0:1H

1430 root      20  0  9832 2660 1988 S    1  0.1  32:21.24 cpuload

22235 root      20  0 13412 2388 1968 R    0  0.1  0:00.02 top

    1 root      20  0  4368 1596 1500 S    0  0.0  0:15.95 init

    2 root      20  0    0    0    0 S    0  0.0  0:00.54 kthreadd

    3 root      20  0    0    0    0 S    0  0.0  5:24.42 ksoftirqd/0

    5 root      0 -20    0    0    0 S    0  0.0  0:00.00 kworker/0:0H

    7 root      20  0    0    0    0 S    0  0.0  5:26.07 rcu_preempt

    8 root      20  0    0    0    0 S    0  0.0  0:00.00 rcu_sched

    9 root      20  0    0    0    0 S    0  0.0  0:00.00 rcu_bh

  10 root      RT  0    0    0    0 S    0  0.0  0:12.22 migration/0

  11 root      RT  0    0    0    0 S    0  0.0  0:12.64 migration/1

  12 root      20  0    0    0    0 S    0  0.0  0:05.34 ksoftirqd/1

  14 root      0 -20    0    0    0 S    0  0.0  0:00.00 kworker/1:0H

--

 

 

No idea what to think right now.  In the past, I have simply hard powered-down the server in such circumstances, and simply dealt with the forthcoming parity check after power on.  Not sure I want to do that again.

 

Advice welcome.

 

BTW syslog is pretty boring (99% spindown messages) otherwise.

 

I have run 'diagnostics' but have no way of getting it off the system until shares come back... Also 'ls /mnt/user' (trying to remember if I mapped an nfs share to another server) hangs!!!

 

---

 

....And.... its getting worse

 

 

root@ffs1:/# uptime

08:03:13 up 16 days, 14:40,  2 users,  load average: 150.74, 144.84, 129.63

root@ffs1:/# uptime

08:05:42 up 16 days, 14:43,  2 users,  load average: 153.44, 148.20, 133.09

root@ffs1:/# uptime

08:05:45 up 16 days, 14:43,  2 users,  load average: 153.49, 148.29, 133.20

root@ffs1:/# uptime

08:05:48 up 16 days, 14:43,  2 users,  load average: 153.49, 148.29, 133.20

root@ffs1:/# uptime

08:06:42 up 16 days, 14:44,  2 users,  load average: 153.79, 149.24, 134.39

root@ffs1:/# uptime

08:06:43 up 16 days, 14:44,  2 users,  load average: 153.81, 149.32, 134.50

root@ffs1:/# uptime

08:06:45 up 16 days, 14:44,  2 users,  load average: 153.81, 149.32, 134.50

root@ffs1:/# uptime

08:06:46 up 16 days, 14:44,  2 users,  load average: 153.81, 149.32, 134.50

root@ffs1:/# uptime

08:06:47 up 16 days, 14:44,  2 users,  load average: 153.81, 149.32, 134.50

root@ffs1:/# uptime

08:06:50 up 16 days, 14:44,  2 users,  load average: 153.82, 149.40, 134.60

root@ffs1:/#

 

 

Link to comment
  • 2 months later...

And again, that has a LOT of information in it that I don't want on the internet. 

 

So what, specifically, is needed to diagnose?  Don't see how my server's network configuration, user names, share names, permissions, etc would be helpful.  So please help me to post the most minimal amount of information necessary to accomplish this.

 

Thanks.

 

Link to comment

And again, that has a LOT of information in it that I don't want on the internet. 

So what, specifically, is needed to diagnose?  Don't see how my server's network configuration, user names, share names, permissions, etc would be helpful.  So please help me to post the most minimal amount of information necessary to accomplish this.

It would be worth stating what causes you a problem as all the personal information is meant to be anonymized.  Most of the things you mention so not seem at all sensitive, but if any are then the anonymising code may need revisiting. The whole idea was to collect in one file all the information needed for troubleshooting.
Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...