tucansam Posted October 2, 2016 Share Posted October 2, 2016 Well, two issues. 6.1.3 Last few months, I've noticed that the server is unavailable during monthly parity check. Web interface doesn't answer, no shares are accessible. I can ssh into the server, but that's it. Only recently started happening, last three or for months. I went to the syslog and saw a lot of these: -- Sep 30 19:27:31 ffs1 kernel: ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x40010 0 action 0x6 frozen Sep 30 19:27:31 ffs1 kernel: ata1.00: irq_stat 0x08000000, interface fatal error Sep 30 19:27:31 ffs1 kernel: ata1: SError: { UnrecovData Handshk } Sep 30 19:27:31 ffs1 kernel: ata1.00: failed command: WRITE DMA EXT Sep 30 19:27:31 ffs1 kernel: ata1.00: cmd 35/00:40:18:cd:a8/00:05:40:00:00/e0 ta g 11 dma 688128 out Sep 30 19:27:31 ffs1 kernel: res 50/00:00:18:cd:a8/00:00:40:00:00/e0 Ema sk 0x10 (ATA bus error) Sep 30 19:27:31 ffs1 kernel: ata1.00: status: { DRDY } Sep 30 19:27:31 ffs1 kernel: ata1: hard resetting link Sep 30 19:27:31 ffs1 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 3 00) Sep 30 19:27:31 ffs1 kernel: ata1.00: configured for UDMA/133 Sep 30 19:27:31 ffs1 kernel: ata1: EH complete -- Not sure what is going on, or if the two issues are related. Right now the drives are blinking like its doing it parity check (2nd of the month) and the server is 100% unavailable except for ssh. I ssh'd in and issued 'powerdown' -- no effect. So I issued 'reboot' -- no effect. 'top' shows a syslog higher than I've ever seen on any unix system, ever: -- top - 07:49:47 up 16 days, 14:27, 1 user, load average: 130.55, 124.78, 110.1 Tasks: 244 total, 3 running, 241 sleeping, 0 stopped, 0 zombie Cpu(s): 0.4%us, 26.2%sy, 0.0%ni, 30.9%id, 31.4%wa, 0.0%hi, 11.2%si, 0.0%st Mem: 3966868k total, 3919580k used, 47288k free, 426532k buffers Swap: 0k total, 0k used, 0k free, 2781972k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1531 root 20 0 0 0 0 R 26 0.0 263:02.15 unraidd 1463 root 20 0 0 0 0 D 18 0.0 109:56.53 mdrecoveryd 995 root 0 -20 0 0 0 S 13 0.0 76:30.84 kworker/1:1H 997 root 0 -20 0 0 0 R 13 0.0 79:41.14 kworker/0:1H 1430 root 20 0 9832 2660 1988 S 1 0.1 32:21.24 cpuload 22235 root 20 0 13412 2388 1968 R 0 0.1 0:00.02 top 1 root 20 0 4368 1596 1500 S 0 0.0 0:15.95 init 2 root 20 0 0 0 0 S 0 0.0 0:00.54 kthreadd 3 root 20 0 0 0 0 S 0 0.0 5:24.42 ksoftirqd/0 5 root 0 -20 0 0 0 S 0 0.0 0:00.00 kworker/0:0H 7 root 20 0 0 0 0 S 0 0.0 5:26.07 rcu_preempt 8 root 20 0 0 0 0 S 0 0.0 0:00.00 rcu_sched 9 root 20 0 0 0 0 S 0 0.0 0:00.00 rcu_bh 10 root RT 0 0 0 0 S 0 0.0 0:12.22 migration/0 11 root RT 0 0 0 0 S 0 0.0 0:12.64 migration/1 12 root 20 0 0 0 0 S 0 0.0 0:05.34 ksoftirqd/1 14 root 0 -20 0 0 0 S 0 0.0 0:00.00 kworker/1:0H -- No idea what to think right now. In the past, I have simply hard powered-down the server in such circumstances, and simply dealt with the forthcoming parity check after power on. Not sure I want to do that again. Advice welcome. BTW syslog is pretty boring (99% spindown messages) otherwise. I have run 'diagnostics' but have no way of getting it off the system until shares come back... Also 'ls /mnt/user' (trying to remember if I mapped an nfs share to another server) hangs!!! --- ....And.... its getting worse root@ffs1:/# uptime 08:03:13 up 16 days, 14:40, 2 users, load average: 150.74, 144.84, 129.63 root@ffs1:/# uptime 08:05:42 up 16 days, 14:43, 2 users, load average: 153.44, 148.20, 133.09 root@ffs1:/# uptime 08:05:45 up 16 days, 14:43, 2 users, load average: 153.49, 148.29, 133.20 root@ffs1:/# uptime 08:05:48 up 16 days, 14:43, 2 users, load average: 153.49, 148.29, 133.20 root@ffs1:/# uptime 08:06:42 up 16 days, 14:44, 2 users, load average: 153.79, 149.24, 134.39 root@ffs1:/# uptime 08:06:43 up 16 days, 14:44, 2 users, load average: 153.81, 149.32, 134.50 root@ffs1:/# uptime 08:06:45 up 16 days, 14:44, 2 users, load average: 153.81, 149.32, 134.50 root@ffs1:/# uptime 08:06:46 up 16 days, 14:44, 2 users, load average: 153.81, 149.32, 134.50 root@ffs1:/# uptime 08:06:47 up 16 days, 14:44, 2 users, load average: 153.81, 149.32, 134.50 root@ffs1:/# uptime 08:06:50 up 16 days, 14:44, 2 users, load average: 153.82, 149.40, 134.60 root@ffs1:/# Link to comment
tucansam Posted January 2, 2017 Author Share Posted January 2, 2017 Still happens every month. One server is fine, the other server is 100% unavailable, except for ssh, during parity check. Shares timeout, can't be read from. Copying data to shares times out. Complete data inaccessibility from SMB/NFS. Anyone? Link to comment
tucansam Posted January 2, 2017 Author Share Posted January 2, 2017 There's a lot of stuff in there I don't want on the internet. What specifically do you want to see? Link to comment
JorgeB Posted January 2, 2017 Share Posted January 2, 2017 You can anonymize the diagnostics, it will be difficult to get help here without posting them. Link to comment
tucansam Posted January 2, 2017 Author Share Posted January 2, 2017 Syslog attached. syslog.txt Link to comment
trurl Posted January 2, 2017 Share Posted January 2, 2017 Syslog wasn't what was asked for. Go to Tools - Diagnostics and post the complete diagnostics zip. Link to comment
tucansam Posted January 2, 2017 Author Share Posted January 2, 2017 And again, that has a LOT of information in it that I don't want on the internet. So what, specifically, is needed to diagnose? Don't see how my server's network configuration, user names, share names, permissions, etc would be helpful. So please help me to post the most minimal amount of information necessary to accomplish this. Thanks. Link to comment
gubbgnutten Posted January 2, 2017 Share Posted January 2, 2017 Have a look at Services, maybe a troubleshooting session would be appropriate? Link to comment
trurl Posted January 2, 2017 Share Posted January 2, 2017 Post SMART for parity drive. Link to comment
itimpi Posted January 2, 2017 Share Posted January 2, 2017 And again, that has a LOT of information in it that I don't want on the internet. So what, specifically, is needed to diagnose? Don't see how my server's network configuration, user names, share names, permissions, etc would be helpful. So please help me to post the most minimal amount of information necessary to accomplish this. It would be worth stating what causes you a problem as all the personal information is meant to be anonymized. Most of the things you mention so not seem at all sensitive, but if any are then the anonymising code may need revisiting. The whole idea was to collect in one file all the information needed for troubleshooting. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.