Help with unraid array IO locking up

Tezmo · August 13, 2015

Would appreciate some general ideas with regard to my problems - because they involve a hard lock I can't retrieve any log files but would value any suggestions nonetheless.

Running unraid 6.0.1 with 21 drives in the array and 1 parity drive - this is an array migrated as a 'clean' install from 5 and the drives are formatted with RFS. Various sizes from 2 to 6 TB.

I have a further 2TB drive outside the array (ext4 formatted) which I mount in my go script, which looks like this:

root@Stash:/mnt# cat /boot/config/go
#!/bin/bash
# Start the Management Utility
/usr/local/sbin/emhttp &
mkdir /mnt/scratch
mount /dev/disk/by-id/ata-WDC_WD20EARX-00PASB0_WD-WMAZA4999308-part1 /mnt/scratch
cp /boot/bin/* /usr/bin/

The last line is to make available a few binaries I use like unrar etc. The server also runs plugins for sabnzbd, sickbeard, and plex server.

The symptom is that I am experiencing a full IO lockup of unraid - anecdotally it seems related to copying/moving data from my drive outside the array to my array - this could be triggered either manually by me in a console, or by one of my plugins. It corresponds to a 100% load on the shfs process, and any attempt to read from the array sits there in perpetuity. I cannot unmount the devices nor stop the array elegantly. The web interface is inaccessible - I suspect it still works fine, its just that it wants to perform some array IO. The drive outside of the array remains readable and writeable. The network stays up so I can telnet/ssh in but cannot perform any array IO. The diagnostics command just hangs when I run it to generate logs as per the Read me first! instructions. All I can do to recover the situation is an unclean power down.

Symptomatically it looks a lot like the ReiserFS kernel panic issue (https://lime-technology.com/forum/index.php?topic=35788.0) but I am told that has been fixed.

Any pointers would be very much appreciated!

BRiT · August 13, 2015

First things first, get rid of the binary copy step at the end of your go file. This is to make sure that isnt causing any issues, especially if you're accidently overwritting 64bit binaries with possibly 32bit binaries or with corrupted files.

Also, disable any and all plugins.

You should be using Dockers for additional stability.

But keep those disabled too until you get your system stable first.

Tezmo · August 13, 2015

Thank you mate - will strip out all the extraneous bollocks - fwiw just checked the contents of my copy operation and there aren't any binaries in there - just bash scripts.

Doing this process now and will update with another stability report.

Tezmo · September 22, 2015

Ultimately put this down to RAM - having swapped it out for something newer (and bigger!) the problem has gone away entirely.

Tezmo · October 23, 2015

But now its back! And this time I have a syslog. CPU is stuck at 100% as before. I can't access the array when telnet'd in - can't run ls, for example. Web interface is unresponsive.

All the esoteric stuff in my initial post is now gone - I have a straight up 21 drive array, plus parity. Running nothing move than latest version of unraid.

I believe it packed in at 13:33. Here is the tail end of my syslog:

Oct 23 12:58:11 Stash avahi-daemon[6433]: Received response from host 172.16.20.60 with invalid source port 57544 on interface 'eth0.0'
Oct 23 13:33:50 Stash udevd[25024]: timeout 'ata_id --export /dev/sdc'
Oct 23 13:33:50 Stash udevd[25024]: timeout 'scsi_id --export --whitelisted -d /dev/sdc'
Oct 23 13:34:43 Stash kernel: sdc: sdc1
Oct 23 19:10:19 Stash in.telnetd[3163]: connect from 172.16.22.12 (172.16.22.12)

It looks like these timeouts are to blame?

here is the top of my top output:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                         
11811 root      20   0 4827m  20m  800 S  100  0.1 434:17.83 shfs                                                                                                                                                                             
9852 root      20   0     0    0    0 S    4  0.0  92:18.90 unraidd

here is the smart report on sdc:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       2
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       125
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       80
194 Temperature_Celsius     0x0022   122   120   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

1) Can I bring this thing back to life without rebooting? Physical access is tricky.

2) Is SDC to blame here?

My array is set to never power down drives.

Thank you all for your thoughts,

Adrian.

Tezmo · January 21, 2016

I have now managed to tie this to a particular event. It is triggered by attempt to delete a file.

Now running 6.1.6.

here is the output from syslog:

Jan 21 16:52:09 Stash kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
Jan 21 16:52:09 Stash kernel: IP: [<ffffffff811538d4>] __discard_prealloc+0x98/0xb3
Jan 21 16:52:09 Stash kernel: PGD 112bb0067 PUD 12ada1067 PMD 0
Jan 21 16:52:09 Stash kernel: Oops: 0002 [#1] PREEMPT SMP
Jan 21 16:52:09 Stash kernel: Modules linked in: md_mod xt_nat veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat e1000e ptp mvsas ahci libsas i2c_i801 libahci scsi_transport_sas pps_core [last unloaded: md_mod]
Jan 21 16:52:09 Stash kernel: CPU: 6 PID: 24670 Comm: shfs Not tainted 4.1.13-unRAID #1
Jan 21 16:52:09 Stash kernel: Hardware name: ASUS All Series/Z87-PLUS, BIOS 2103 08/15/2014
Jan 21 16:52:09 Stash kernel: task: ffff88041d3cdc20 ti: ffff880002bc8000 task.ti: ffff880002bc8000
Jan 21 16:52:09 Stash kernel: RIP: 0010:[<ffffffff811538d4>]  [<ffffffff811538d4>] __discard_prealloc+0x98/0xb3
Jan 21 16:52:09 Stash kernel: RSP: 0018:ffff880002bcbb68  EFLAGS: 00010246
Jan 21 16:52:09 Stash kernel: RAX: ffff8800254d3ca8 RBX: ffff8800254d3c80 RCX: 0000000000000000
Jan 21 16:52:09 Stash kernel: RDX: 0000000000000000 RSI: ffff8800254d3c80 RDI: ffff880002bcbca8
Jan 21 16:52:09 Stash kernel: RBP: ffff880002bcbb98 R08: 0000000000000293 R09: 000000000000e3dd
Jan 21 16:52:09 Stash kernel: R10: ffffffff00001001 R11: ffffffff8116106d R12: ffff880002bcbca8
Jan 21 16:52:09 Stash kernel: R13: ffff8800254d3d20 R14: ffff880002bcbca8 R15: 0000000000000000
Jan 21 16:52:09 Stash kernel: FS:  00002b33b5e0e700(0000) GS:ffff88042fb80000(0000) knlGS:0000000000000000
Jan 21 16:52:09 Stash kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 21 16:52:09 Stash kernel: CR2: 0000000000000008 CR3: 0000000414396000 CR4: 00000000001406e0
Jan 21 16:52:09 Stash kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 21 16:52:09 Stash kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jan 21 16:52:09 Stash kernel: Stack:
Jan 21 16:52:09 Stash kernel: ffff8803aa92c000 ffff880002bcbca8 ffffc900077da000 ffffc900077fa1e8
Jan 21 16:52:09 Stash kernel: ffff880002bcbca8 ffff8800069d3000 ffff880002bcbbc8 ffffffff81153953
Jan 21 16:52:09 Stash kernel: ffff880002bcbca8 ffff88041d3cdc20 ffffc900077da000 ffffc900077da000
Jan 21 16:52:09 Stash kernel: Call Trace:
Jan 21 16:52:09 Stash kernel: [<ffffffff81153953>] reiserfs_discard_all_prealloc+0x44/0x4e
Jan 21 16:52:09 Stash kernel: [<ffffffff811701cc>] do_journal_end+0x4e7/0xc78
Jan 21 16:52:09 Stash kernel: [<ffffffff81170ebc>] journal_end+0xae/0xb6
Jan 21 16:52:09 Stash kernel: [<ffffffff811576f8>] reiserfs_unlink+0x1de/0x23f
Jan 21 16:52:09 Stash kernel: [<ffffffff811062b2>] vfs_unlink+0xc5/0x165
Jan 21 16:52:09 Stash kernel: [<ffffffff8110a063>] do_unlinkat+0x107/0x24c
Jan 21 16:52:09 Stash kernel: [<ffffffff81101863>] ? SyS_newlstat+0x25/0x2e
Jan 21 16:52:09 Stash kernel: [<ffffffff8110a7b0>] SyS_unlink+0x11/0x13
Jan 21 16:52:09 Stash kernel: [<ffffffff815f69ae>] system_call_fastpath+0x12/0x71
Jan 21 16:52:09 Stash kernel: Code: 1c 75 bb 0f 0b 85 c0 74 12 48 8b 93 e8 00 00 00 4c 89 ee 4c 89 e7 e8 be 6e 00 00 48 8b 4b 28 44 89 7b 1c 48 8d 43 28 48 8b 53 30 <48> 89 51 08 48 89 0a 48 89 43 28 48 89 43 30 58 5b 41 5c 41 5d
Jan 21 16:52:09 Stash kernel: RIP  [<ffffffff811538d4>] __discard_prealloc+0x98/0xb3
Jan 21 16:52:09 Stash kernel: RSP <ffff880002bcbb68>
Jan 21 16:52:09 Stash kernel: CR2: 0000000000000008
Jan 21 16:52:09 Stash kernel: ---[ end trace 2ca03a34b3b066f1 ]---

I've seen various references to this null pointer dereference in other forum posts, all of which point to it being a problem that's been fixed. Does anyone have any great ideas?

JonathanM · January 21, 2016

Have you run a reiserfsck check on all of your /dev/md? devices?

Tezmo · January 21, 2016

Can affirm I have indeed mate (through the GUI) with no ill effects reported on any devices.

JonathanM · January 21, 2016

What IP address is in use by your server?

Tezmo · January 21, 2016

It is using 172.16.20.35.

JonathanM · January 21, 2016

It is using 172.16.20.35.

Interesting. That is not a "normal" private IP range for consumer routers, and I think is one of the ranges used by the internal docker network. I assume you set it up that way on purpose?

gubbgnutten · January 21, 2016

It is using 172.16.20.35.

Interesting. That is not a "normal" private IP range for consumer routers, and I think is one of the ranges used by the internal docker network. I assume you set it up that way on purpose?

...but still a perfectly valid private IP range, and Docker is supposed to select a range that is not in use on the host machine (which in my similar case typically results in 10.0.x.x).

JonathanM · January 21, 2016

It is using 172.16.20.35.

Interesting. That is not a "normal" private IP range for consumer routers, and I think is one of the ranges used by the internal docker network. I assume you set it up that way on purpose?

...but still a perfectly valid private IP range, and Docker is supposed to select a range that is not in use on the host machine (which in my similar case typically results in 10.0.x.x).

Very true. I was just looking things over and that was one of the details that jumped out at me.

My next suggestion is to remove reiserfs from the equation. I had some similar stuff happening (100% CPU shfs, box is unresponsive) and moving to xfs solved the issue.

Tezmo · January 21, 2016

It is using 172.16.20.35.

Interesting. That is not a "normal" private IP range for consumer routers, and I think is one of the ranges used by the internal docker network. I assume you set it up that way on purpose?

...but still a perfectly valid private IP range, and Docker is supposed to select a range that is not in use on the host machine (which in my similar case typically results in 10.0.x.x).

Very true. I was just looking things over and that was one of the details that jumped out at me.

My next suggestion is to remove reiserfs from the equation. I had some similar stuff happening (100% CPU shfs, box is unresponsive) and moving to xfs solved the issue.

It just sits on the LAN at work, and such is the range that was pre-existing.

I am actually in the process of an xfs migration right now, in an attempt to knock this on the head. It would be nice to get something concrete though, I guess if it never happens again after the change to xfs, it'd only qualify anecdotally as a fix. Thank you both!

Help with unraid array IO locking up

Recommended Posts

Tezmo

Link to comment

BRiT

Link to comment

Tezmo

Link to comment

Tezmo

Link to comment

Tezmo

Link to comment

Tezmo

Link to comment

JonathanM

Link to comment

Tezmo

Link to comment

JonathanM

Link to comment

Tezmo

Link to comment

JonathanM

Link to comment

gubbgnutten

Link to comment

JonathanM

Link to comment

Tezmo

Link to comment

Archived