Tezmo Posted August 13, 2015 Share Posted August 13, 2015 Would appreciate some general ideas with regard to my problems - because they involve a hard lock I can't retrieve any log files but would value any suggestions nonetheless. Running unraid 6.0.1 with 21 drives in the array and 1 parity drive - this is an array migrated as a 'clean' install from 5 and the drives are formatted with RFS. Various sizes from 2 to 6 TB. I have a further 2TB drive outside the array (ext4 formatted) which I mount in my go script, which looks like this: root@Stash:/mnt# cat /boot/config/go #!/bin/bash # Start the Management Utility /usr/local/sbin/emhttp & mkdir /mnt/scratch mount /dev/disk/by-id/ata-WDC_WD20EARX-00PASB0_WD-WMAZA4999308-part1 /mnt/scratch cp /boot/bin/* /usr/bin/ The last line is to make available a few binaries I use like unrar etc. The server also runs plugins for sabnzbd, sickbeard, and plex server. The symptom is that I am experiencing a full IO lockup of unraid - anecdotally it seems related to copying/moving data from my drive outside the array to my array - this could be triggered either manually by me in a console, or by one of my plugins. It corresponds to a 100% load on the shfs process, and any attempt to read from the array sits there in perpetuity. I cannot unmount the devices nor stop the array elegantly. The web interface is inaccessible - I suspect it still works fine, its just that it wants to perform some array IO. The drive outside of the array remains readable and writeable. The network stays up so I can telnet/ssh in but cannot perform any array IO. The diagnostics command just hangs when I run it to generate logs as per the Read me first! instructions. All I can do to recover the situation is an unclean power down. Symptomatically it looks a lot like the ReiserFS kernel panic issue (https://lime-technology.com/forum/index.php?topic=35788.0) but I am told that has been fixed. Any pointers would be very much appreciated! Link to comment
BRiT Posted August 13, 2015 Share Posted August 13, 2015 First things first, get rid of the binary copy step at the end of your go file. This is to make sure that isnt causing any issues, especially if you're accidently overwritting 64bit binaries with possibly 32bit binaries or with corrupted files. Also, disable any and all plugins. You should be using Dockers for additional stability. But keep those disabled too until you get your system stable first. Link to comment
Tezmo Posted August 13, 2015 Author Share Posted August 13, 2015 Thank you mate - will strip out all the extraneous bollocks - fwiw just checked the contents of my copy operation and there aren't any binaries in there - just bash scripts. Doing this process now and will update with another stability report. Link to comment
Tezmo Posted September 22, 2015 Author Share Posted September 22, 2015 Ultimately put this down to RAM - having swapped it out for something newer (and bigger!) the problem has gone away entirely. Link to comment
Tezmo Posted October 23, 2015 Author Share Posted October 23, 2015 But now its back! And this time I have a syslog. CPU is stuck at 100% as before. I can't access the array when telnet'd in - can't run ls, for example. Web interface is unresponsive. All the esoteric stuff in my initial post is now gone - I have a straight up 21 drive array, plus parity. Running nothing move than latest version of unraid. I believe it packed in at 13:33. Here is the tail end of my syslog: Oct 23 12:58:11 Stash avahi-daemon[6433]: Received response from host 172.16.20.60 with invalid source port 57544 on interface 'eth0.0' Oct 23 13:33:50 Stash udevd[25024]: timeout 'ata_id --export /dev/sdc' Oct 23 13:33:50 Stash udevd[25024]: timeout 'scsi_id --export --whitelisted -d /dev/sdc' Oct 23 13:34:43 Stash kernel: sdc: sdc1 Oct 23 19:10:19 Stash in.telnetd[3163]: connect from 172.16.22.12 (172.16.22.12) It looks like these timeouts are to blame? here is the top of my top output: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11811 root 20 0 4827m 20m 800 S 100 0.1 434:17.83 shfs 9852 root 20 0 0 0 0 S 4 0.0 92:18.90 unraidd here is the smart report on sdc: SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 100 253 021 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 2 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 125 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 2 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 0 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 80 194 Temperature_Celsius 0x0022 122 120 000 Old_age Always - 30 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0 1) Can I bring this thing back to life without rebooting? Physical access is tricky. 2) Is SDC to blame here? My array is set to never power down drives. Thank you all for your thoughts, Adrian. Link to comment
Tezmo Posted January 21, 2016 Author Share Posted January 21, 2016 I have now managed to tie this to a particular event. It is triggered by attempt to delete a file. Now running 6.1.6. here is the output from syslog: Jan 21 16:52:09 Stash kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 Jan 21 16:52:09 Stash kernel: IP: [<ffffffff811538d4>] __discard_prealloc+0x98/0xb3 Jan 21 16:52:09 Stash kernel: PGD 112bb0067 PUD 12ada1067 PMD 0 Jan 21 16:52:09 Stash kernel: Oops: 0002 [#1] PREEMPT SMP Jan 21 16:52:09 Stash kernel: Modules linked in: md_mod xt_nat veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat e1000e ptp mvsas ahci libsas i2c_i801 libahci scsi_transport_sas pps_core [last unloaded: md_mod] Jan 21 16:52:09 Stash kernel: CPU: 6 PID: 24670 Comm: shfs Not tainted 4.1.13-unRAID #1 Jan 21 16:52:09 Stash kernel: Hardware name: ASUS All Series/Z87-PLUS, BIOS 2103 08/15/2014 Jan 21 16:52:09 Stash kernel: task: ffff88041d3cdc20 ti: ffff880002bc8000 task.ti: ffff880002bc8000 Jan 21 16:52:09 Stash kernel: RIP: 0010:[<ffffffff811538d4>] [<ffffffff811538d4>] __discard_prealloc+0x98/0xb3 Jan 21 16:52:09 Stash kernel: RSP: 0018:ffff880002bcbb68 EFLAGS: 00010246 Jan 21 16:52:09 Stash kernel: RAX: ffff8800254d3ca8 RBX: ffff8800254d3c80 RCX: 0000000000000000 Jan 21 16:52:09 Stash kernel: RDX: 0000000000000000 RSI: ffff8800254d3c80 RDI: ffff880002bcbca8 Jan 21 16:52:09 Stash kernel: RBP: ffff880002bcbb98 R08: 0000000000000293 R09: 000000000000e3dd Jan 21 16:52:09 Stash kernel: R10: ffffffff00001001 R11: ffffffff8116106d R12: ffff880002bcbca8 Jan 21 16:52:09 Stash kernel: R13: ffff8800254d3d20 R14: ffff880002bcbca8 R15: 0000000000000000 Jan 21 16:52:09 Stash kernel: FS: 00002b33b5e0e700(0000) GS:ffff88042fb80000(0000) knlGS:0000000000000000 Jan 21 16:52:09 Stash kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 21 16:52:09 Stash kernel: CR2: 0000000000000008 CR3: 0000000414396000 CR4: 00000000001406e0 Jan 21 16:52:09 Stash kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jan 21 16:52:09 Stash kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jan 21 16:52:09 Stash kernel: Stack: Jan 21 16:52:09 Stash kernel: ffff8803aa92c000 ffff880002bcbca8 ffffc900077da000 ffffc900077fa1e8 Jan 21 16:52:09 Stash kernel: ffff880002bcbca8 ffff8800069d3000 ffff880002bcbbc8 ffffffff81153953 Jan 21 16:52:09 Stash kernel: ffff880002bcbca8 ffff88041d3cdc20 ffffc900077da000 ffffc900077da000 Jan 21 16:52:09 Stash kernel: Call Trace: Jan 21 16:52:09 Stash kernel: [<ffffffff81153953>] reiserfs_discard_all_prealloc+0x44/0x4e Jan 21 16:52:09 Stash kernel: [<ffffffff811701cc>] do_journal_end+0x4e7/0xc78 Jan 21 16:52:09 Stash kernel: [<ffffffff81170ebc>] journal_end+0xae/0xb6 Jan 21 16:52:09 Stash kernel: [<ffffffff811576f8>] reiserfs_unlink+0x1de/0x23f Jan 21 16:52:09 Stash kernel: [<ffffffff811062b2>] vfs_unlink+0xc5/0x165 Jan 21 16:52:09 Stash kernel: [<ffffffff8110a063>] do_unlinkat+0x107/0x24c Jan 21 16:52:09 Stash kernel: [<ffffffff81101863>] ? SyS_newlstat+0x25/0x2e Jan 21 16:52:09 Stash kernel: [<ffffffff8110a7b0>] SyS_unlink+0x11/0x13 Jan 21 16:52:09 Stash kernel: [<ffffffff815f69ae>] system_call_fastpath+0x12/0x71 Jan 21 16:52:09 Stash kernel: Code: 1c 75 bb 0f 0b 85 c0 74 12 48 8b 93 e8 00 00 00 4c 89 ee 4c 89 e7 e8 be 6e 00 00 48 8b 4b 28 44 89 7b 1c 48 8d 43 28 48 8b 53 30 <48> 89 51 08 48 89 0a 48 89 43 28 48 89 43 30 58 5b 41 5c 41 5d Jan 21 16:52:09 Stash kernel: RIP [<ffffffff811538d4>] __discard_prealloc+0x98/0xb3 Jan 21 16:52:09 Stash kernel: RSP <ffff880002bcbb68> Jan 21 16:52:09 Stash kernel: CR2: 0000000000000008 Jan 21 16:52:09 Stash kernel: ---[ end trace 2ca03a34b3b066f1 ]--- I've seen various references to this null pointer dereference in other forum posts, all of which point to it being a problem that's been fixed. Does anyone have any great ideas? Link to comment
JonathanM Posted January 21, 2016 Share Posted January 21, 2016 Have you run a reiserfsck check on all of your /dev/md? devices? Link to comment
Tezmo Posted January 21, 2016 Author Share Posted January 21, 2016 Can affirm I have indeed mate (through the GUI) with no ill effects reported on any devices. Link to comment
JonathanM Posted January 21, 2016 Share Posted January 21, 2016 What IP address is in use by your server? Link to comment
Tezmo Posted January 21, 2016 Author Share Posted January 21, 2016 It is using 172.16.20.35. Link to comment
JonathanM Posted January 21, 2016 Share Posted January 21, 2016 It is using 172.16.20.35. Interesting. That is not a "normal" private IP range for consumer routers, and I think is one of the ranges used by the internal docker network. I assume you set it up that way on purpose? Link to comment
gubbgnutten Posted January 21, 2016 Share Posted January 21, 2016 It is using 172.16.20.35. Interesting. That is not a "normal" private IP range for consumer routers, and I think is one of the ranges used by the internal docker network. I assume you set it up that way on purpose? ...but still a perfectly valid private IP range, and Docker is supposed to select a range that is not in use on the host machine (which in my similar case typically results in 10.0.x.x). Link to comment
JonathanM Posted January 21, 2016 Share Posted January 21, 2016 It is using 172.16.20.35. Interesting. That is not a "normal" private IP range for consumer routers, and I think is one of the ranges used by the internal docker network. I assume you set it up that way on purpose? ...but still a perfectly valid private IP range, and Docker is supposed to select a range that is not in use on the host machine (which in my similar case typically results in 10.0.x.x). Very true. I was just looking things over and that was one of the details that jumped out at me. My next suggestion is to remove reiserfs from the equation. I had some similar stuff happening (100% CPU shfs, box is unresponsive) and moving to xfs solved the issue. Link to comment
Tezmo Posted January 21, 2016 Author Share Posted January 21, 2016 It is using 172.16.20.35. Interesting. That is not a "normal" private IP range for consumer routers, and I think is one of the ranges used by the internal docker network. I assume you set it up that way on purpose? ...but still a perfectly valid private IP range, and Docker is supposed to select a range that is not in use on the host machine (which in my similar case typically results in 10.0.x.x). Very true. I was just looking things over and that was one of the details that jumped out at me. My next suggestion is to remove reiserfs from the equation. I had some similar stuff happening (100% CPU shfs, box is unresponsive) and moving to xfs solved the issue. It just sits on the LAN at work, and such is the range that was pre-existing. I am actually in the process of an xfs migration right now, in an attempt to knock this on the head. It would be nice to get something concrete though, I guess if it never happens again after the change to xfs, it'd only qualify anecdotally as a fix. Thank you both! Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.