Jump to content
Benson

[Solve] 6.5.0 RC1/RC2/RC3 Array drive spindown then can't stop

19 posts in this topic Last Reply

Recommended Posts

In RC1 & RC2, if all disk spin down, then shutdown/reboot won't work, GUI gone but telnet still function and /dev/md2 md3 .... still in mount state.

 

Only hard reset can reboot and array will auto parity check when start again.

 

diagnostics not work too .......

 

Filesystem      1K-blocks       Used  Available Use% Mounted on
rootfs           16358820     583740   15775080   4% /
tmpfs               32768        220      32548   1% /run
devtmpfs         16396408          0   16396408   0% /dev
tmpfs            16440244          0   16440244   0% /dev/shm
cgroup_root          8192          0       8192   0% /sys/fs/cgroup
tmpfs              131072        756     130316   1% /var/log
/dev/sda1         7800304     554656    7245648   8% /boot
/dev/loop0           7680       7680          0 100% /lib/modules
/dev/loop1           4608       4608          0 100% /lib/firmware
/dev/md2       5860522532 5760145344  100117472  99% /mnt/disk2
/dev/md3       5860522532 5764139424   96021264  99% /mnt/disk3
/dev/md4       5860522532 5769269168   90980816  99% /mnt/disk4
/dev/md5       5860522532 5467203020  392547092  94% /mnt/disk5
/dev/md6       5860522532 5762692312   97055912  99% /mnt/disk6
/dev/md7       5860522532  971330608 4888639312  17% /mnt/disk7
/dev/nvme0n1p1  500107576   82069920  417126992  17% /mnt/cache
root@X370:~# diagnostics
Starting diagnostics collection...

Edited by Benson

Share this post


Link to post

Try to get diagnostics , but it just hang in collection ...

BTW, dmesg woulg help ?

 

It seems md2 md3 ...... haven't spin up

 

 [   73.970071] mdcmd (60): check correct
[   73.970083] md: recovery thread: check P Q ...
[   73.973566] md: using 15360k window, over a total of 5860522532 blocks.
[  112.950447] mdcmd (61): nocheck
[  113.001749] md: md_do_sync: got signal, exit...
[  113.039032] md: recovery thread: completion status: -4
[  259.021840] mdcmd (62): spindown 0
[  259.892214] mdcmd (63): spindown 1
[  260.439064] mdcmd (64): spindown 2
[  261.259056] mdcmd (65): spindown 3
[  262.078673] mdcmd (66): spindown 4
[  262.908611] mdcmd (67): spindown 5
[  263.743781] mdcmd (68): spindown 6
[  264.670593] mdcmd (69): spindown 7
[  265.644618] mdcmd (70): spindown 29
[  290.608096] mdcmd (71): nocheck
[  290.608099] md: nocheck_array: check not active
[  290.608268] mdcmd (72): spinup 0
[  290.608286] mdcmd (73): spinup 1
[  290.608301] mdcmd (74): spinup 2
[  290.608317] mdcmd (75): spinup 3
[  290.608331] mdcmd (76): spinup 4
[  290.608340] mdcmd (77): spinup 5
[  290.608351] mdcmd (78): spinup 6
[  290.608362] mdcmd (79): spinup 7
[  290.608386] mdcmd (80): spinup 29
[  304.913977] nfsd: last server has exited, flushing export cache

Share this post


Link to post
1 minute ago, Benson said:

BTW, dmesg woulg help ?

 

/var/log/syslog would help.

Share this post


Link to post

Attach, thanks

 

Will try safe mode have problem or not

 

 

syslog

Edited by Benson

Share this post


Link to post

Try in safe mode, same problem found

 

In safe mode

 

spin down OK

spin up OK

spin down OK

reboot ...... same problem

 

Mar  3 18:36:00 X370 emhttpd: shcmd (149): /etc/rc.d/rc.samba stop
Mar  3 18:36:00 X370 emhttpd: shcmd (150): rm -f /etc/avahi/services/smb.service
Mar  3 18:36:00 X370 avahi-daemon[2088]: Files changed, reloading.
Mar  3 18:36:00 X370 avahi-daemon[2088]: Service group file /services/smb.service vanished, removing services.
Mar  3 18:36:00 X370 emhttpd: shcmd (152): /etc/rc.d/rc.nfsd stop
Mar  3 18:36:00 X370 rpc.mountd[2073]: Caught signal 15, un-registering and exiting.
Mar  3 18:36:01 X370 emhttpd: Stopping mover...
Mar  3 18:36:01 X370 emhttpd: shcmd (154): /usr/local/sbin/mover stop
Mar  3 18:36:01 X370 root: mover: not running
Mar  3 18:36:01 X370 emhttpd: Sync filesystems...
Mar  3 18:36:01 X370 emhttpd: shcmd (155): sync
Mar  3 18:36:01 X370 kernel: nfsd: last server has exited, flushing export cache
Mar  3 18:36:02 X370 emhttpd: shcmd (156): umount /mnt/user
Mar  3 18:36:02 X370 emhttpd: shcmd (157): rmdir /mnt/user
Mar  3 18:36:02 X370 emhttpd: shcmd (158): rm -f /boot/config/plugins/dynamix/mover.cron
Mar  3 18:36:02 X370 emhttpd: shcmd (159): /usr/local/sbin/update_cron
Mar  3 18:36:02 X370 emhttpd: Unmounting disks...
Mar  3 18:36:02 X370 emhttpd: shcmd (160): umount /mnt/disk1
Mar  3 18:36:02 X370 kernel: ------------[ cut here ]------------
Mar  3 18:36:02 X370 kernel: WARNING: CPU: 2 PID: 0 at kernel/rcu/tree.c:2725 rcu_process_callbacks+0x320/0x36b
Mar  3 18:36:02 X370 kernel: Modules linked in: nfsd lockd grace sunrpc md_mod igb i2c_algo_bit ixgbe ptp pps_core mdio edac_mce_amd kvm mpt3sas crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel ahci libahci aes_x86_64 raid_class crypto_simd ccp scsi_transport_sas i2c_piix4 glue_helper mxm_wmi i2c_core wmi_bmof nvme cryptd nvme_core wmi button acpi_cpufreq [last unloaded: pps_core]
Mar  3 18:36:02 X370 kernel: CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.14.23-unRAID #1
Mar  3 18:36:02 X370 kernel: Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 3803 01/22/2018
Mar  3 18:36:02 X370 kernel: task: ffff8807fb652a00 task.stack: ffffc900031b4000
Mar  3 18:36:02 X370 kernel: RIP: 0010:rcu_process_callbacks+0x320/0x36b
Mar  3 18:36:02 X370 kernel: RSP: 0018:ffff88081ec83f18 EFLAGS: 00010002
Mar  3 18:36:02 X370 kernel: RAX: ffffffffffffd800 RBX: ffff88081eca14c0 RCX: 000000000066c601
Mar  3 18:36:02 X370 kernel: RDX: 0000000000000003 RSI: ffff88081ec83f20 RDI: ffff88081eca14f8
Mar  3 18:36:02 X370 kernel: RBP: ffffffff81c399c0 R08: 0000000000024080 R09: ffffffff8108c8f5
Mar  3 18:36:02 X370 kernel: R10: ffffea001e689000 R11: 0000000000000020 R12: ffff88081eca14f8
Mar  3 18:36:02 X370 kernel: R13: 7fffffffffffffff R14: 0000000000000246 R15: fffffffffffffffc
Mar  3 18:36:02 X370 kernel: FS:  0000000000000000(0000) GS:ffff88081ec80000(0000) knlGS:0000000000000000
Mar  3 18:36:02 X370 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar  3 18:36:02 X370 kernel: CR2: 000000000072d688 CR3: 00000007f8e04000 CR4: 00000000003406e0
Mar  3 18:36:02 X370 kernel: Call Trace:
Mar  3 18:36:02 X370 kernel: <IRQ>
Mar  3 18:36:02 X370 kernel: ? hrtimer_forward+0x74/0x7c
Mar  3 18:36:02 X370 kernel: __do_softirq+0xcd/0x1c2
Mar  3 18:36:02 X370 kernel: irq_exit+0x4f/0x8e
Mar  3 18:36:02 X370 kernel: smp_apic_timer_interrupt+0x7a/0x85
Mar  3 18:36:02 X370 kernel: apic_timer_interrupt+0x7d/0x90
Mar  3 18:36:02 X370 kernel: </IRQ>
Mar  3 18:36:02 X370 kernel: RIP: 0010:native_safe_halt+0x2/0x3
Mar  3 18:36:02 X370 kernel: RSP: 0018:ffffc900031b7f20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
Mar  3 18:36:02 X370 kernel: RAX: ffffffff815dd668 RBX: 0000000000000000 RCX: 0100000000000000
Mar  3 18:36:02 X370 kernel: RDX: 0000000000016e32 RSI: 0000000000000000 RDI: 0000000000000000
Mar  3 18:36:02 X370 kernel: RBP: 0000000000000000 R08: 0000008621b25fdc R09: 000000000000ff00
Mar  3 18:36:02 X370 kernel: R10: ffffc900031b7eb0 R11: ffff88081efc0000 R12: 0000000000000000
Mar  3 18:36:02 X370 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Mar  3 18:36:02 X370 kernel: ? ldsem_down_write+0x1b4/0x1b4
Mar  3 18:36:02 X370 kernel: default_idle+0x5/0x8
Mar  3 18:36:02 X370 kernel: do_idle+0xbe/0x179
Mar  3 18:36:02 X370 kernel: cpu_startup_entry+0x18/0x1a
Mar  3 18:36:02 X370 kernel: secondary_startup_64+0xa5/0xb0
Mar  3 18:36:02 X370 kernel: Code: a8 00 00 00 eb 13 48 2b 05 eb f3 ba 00 48 39 c2 7d 07 48 89 93 90 00 00 00 48 83 7b 38 00 0f 94 c1 48 85 d2 0f 94 c0 38 c1 74 02 <0f> 0b 4c 89 f7 57 9d 0f 1f 44 00 00 4c 89 e7 e8 1f 0e 00 00 84
Mar  3 18:36:02 X370 kernel: ---[ end trace 6df911811ecdaf58 ]---
Mar  3 18:38:49 X370 sshd[3099]: Accepted password for root from 192.168.9.181 port 64862 ssh2
 

Edited by Benson

Share this post


Link to post

Actually say, this not relate shutdown/reboot.

Once disk spin down, it can't stop array (unmount failed) ....

 

Will return back some hrs later, may try more test.

Share this post


Link to post
9 minutes ago, Benson said:

Actually say, this not relate shutdown/reboot.

Once disk spin down, it can't stop array (unmount failed) ....

 

Will return back some hrs later, may try more test.

 

Do you have a CLI session or something else open which may prevent the unmounting?

Share this post


Link to post

I wonder if this is related to a new call trace I was getting, which would knock network shares offline and prohibit shutdown:

 

Mar  2 19:34:54 Brahms1 kernel: md: sync done. time=42513sec
Mar  2 19:34:54 Brahms1 kernel: md: recovery thread: completion status: 0
Mar  2 19:49:55 Brahms1 kernel: mdcmd (50): spindown 0
Mar  2 19:49:56 Brahms1 kernel: mdcmd (51): spindown 6
Mar  2 19:56:10 Brahms1 emhttpd: req (3): csrf_token=****************&title=System+Log&cmd=%2FwebGui%2Fscripts%2Ftail_log&arg1=syslog
Mar  2 19:56:10 Brahms1 emhttpd: cmd: /usr/local/emhttp/plugins/dynamix/scripts/tail_log syslog
Mar  2 20:00:16 Brahms1 kernel: ------------[ cut here ]------------
Mar  2 20:00:16 Brahms1 kernel: WARNING: CPU: 5 PID: 0 at kernel/rcu/tree.c:2725 rcu_process_callbacks+0x320/0x36b
Mar  2 20:00:16 Brahms1 kernel: Modules linked in: xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables vhost_net tun vhost tap veth xt_nat ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat xfs md_mod ipmi_devintf bonding tg3 ipmi_ssif intel_powerclamp coretemp kvm_intel kvm mpt3sas crc32c_intel igb intel_cstate raid_class intel_uncore scsi_transport_sas ptp pps_core i2c_i801 ata_piix i2c_algo_bit i2c_core button ipmi_si acpi_cpufreq [last unloaded: tg3]
Mar  2 20:00:16 Brahms1 kernel: CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.14.23-unRAID #1
Mar  2 20:00:16 Brahms1 kernel: Hardware name: HP ProLiant DL120 G6/ProLiant DL120 G6, BIOS O26    07/01/2013
Mar  2 20:00:16 Brahms1 kernel: task: ffff88013ab6bfc0 task.stack: ffffc900006b4000
Mar  2 20:00:16 Brahms1 kernel: RIP: 0010:rcu_process_callbacks+0x320/0x36b
Mar  2 20:00:16 Brahms1 kernel: RSP: 0018:ffff88013fd43f18 EFLAGS: 00010002
Mar  2 20:00:16 Brahms1 kernel: RAX: ffffffffffffd800 RBX: ffff88013fd614c0 RCX: 00000000313f7a01
Mar  2 20:00:16 Brahms1 kernel: RDX: 0000000000000001 RSI: ffff88013fd43f20 RDI: ffff88013fd614f8
Mar  2 20:00:16 Brahms1 kernel: RBP: ffffffff81c399c0 R08: 0000000000024080 R09: ffffffff8108c8f5
Mar  2 20:00:16 Brahms1 kernel: R10: ffffea000278e600 R11: 0000000000000020 R12: ffff88013fd614f8
Mar  2 20:00:16 Brahms1 kernel: R13: 7fffffffffffffff R14: 0000000000000246 R15: fffffffffffffffe
Mar  2 20:00:16 Brahms1 kernel: FS:  0000000000000000(0000) GS:ffff88013fd40000(0000) knlGS:0000000000000000
Mar  2 20:00:16 Brahms1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar  2 20:00:16 Brahms1 kernel: CR2: 0000149d89457000 CR3: 0000000001c0a000 CR4: 00000000000026e0
Mar  2 20:00:16 Brahms1 kernel: Call Trace:
Mar  2 20:00:16 Brahms1 kernel: <IRQ>
Mar  2 20:00:16 Brahms1 kernel: ? hrtimer_forward+0x74/0x7c
Mar  2 20:00:16 Brahms1 kernel: __do_softirq+0xcd/0x1c2
Mar  2 20:00:16 Brahms1 kernel: irq_exit+0x4f/0x8e
Mar  2 20:00:16 Brahms1 kernel: smp_apic_timer_interrupt+0x7a/0x85
Mar  2 20:00:16 Brahms1 kernel: apic_timer_interrupt+0x7d/0x90
Mar  2 20:00:16 Brahms1 kernel: </IRQ>
Mar  2 20:00:16 Brahms1 kernel: RIP: 0010:cpuidle_enter_state+0xe0/0x135
Mar  2 20:00:16 Brahms1 kernel: RSP: 0018:ffffc900006b7ef8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
Mar  2 20:00:16 Brahms1 kernel: RAX: ffff88013fd60940 RBX: 0000000000000000 RCX: 000000000000001f
Mar  2 20:00:16 Brahms1 kernel: RDX: 0000283bad811e98 RSI: 0000000000020140 RDI: 0000000000000000
Mar  2 20:00:16 Brahms1 kernel: RBP: ffff88013fd68800 R08: 00007081f2485c1f R09: 0000000000000001
Mar  2 20:00:16 Brahms1 kernel: R10: ffffc900006b7ed8 R11: 00000000002f9418 R12: 0000000000000003
Mar  2 20:00:16 Brahms1 kernel: R13: 0000283bad811e98 R14: ffffffff81c591f8 R15: 0000283bad71f531
Mar  2 20:00:16 Brahms1 kernel: ? cpuidle_enter_state+0xbb/0x135
Mar  2 20:00:16 Brahms1 kernel: do_idle+0x11a/0x179
Mar  2 20:00:16 Brahms1 kernel: cpu_startup_entry+0x18/0x1a
Mar  2 20:00:16 Brahms1 kernel: secondary_startup_64+0xa5/0xb0
Mar  2 20:00:16 Brahms1 kernel: Code: a8 00 00 00 eb 13 48 2b 05 eb f3 ba 00 48 39 c2 7d 07 48 89 93 90 00 00 00 48 83 7b 38 00 0f 94 c1 48 85 d2 0f 94 c0 38 c1 74 02 <0f> 0b 4c 89 f7 57 9d 66 66 90 66 90 4c 89 e7 e8 1f 0e 00 00 84 
Mar  2 20:00:16 Brahms1 kernel: ---[ end trace e13abcfd14c19384 ]---
Mar  2 21:00:02 Brahms1 root: /etc/libvirt: 920.5 MiB (965189632 bytes) trimmed
Mar  2 21:00:02 Brahms1 root: /var/lib/docker: 16.5 GiB (17667182592 bytes) trimmed
Mar  2 21:00:02 Brahms1 root: /mnt/cache: 377.7 GiB (405560147968 bytes) trimmed
Mar  3 00:00:02 Brahms1 Plugin Auto Update: Checking for available plugin updates
Mar  3 00:00:05 Brahms1 Plugin Auto Update: ca.update.applications.plg version 2018.03.02 does not meet age requirements to update
Mar  3 00:00:05 Brahms1 Plugin Auto Update: ca.backup2.plg version 2018.03.02 does not meet age requirements to update
Mar  3 00:00:05 Brahms1 Plugin Auto Update: community.applications.plg version 2018.03.02 does not meet age requirements to update
Mar  3 00:00:06 Brahms1 Plugin Auto Update: Community Applications Plugin Auto Update finished

 

Share this post


Link to post
3 hours ago, 1812 said:

I wonder if this is related to a new call trace I was getting, which would knock network shares offline and prohibit shutdown:

 

Does the same thing happen when you start the system in safe mode?

Share this post


Link to post
13 hours ago, bonienl said:

 

Do you have a CLI session or something else open which may prevent the unmounting?

Sure haven't

 

 

Just return back and try some test (reset many many time) , below all can't fix the problem

- Direct I/O, on / off

- BIOS CSM, on / off

- UEFI boot mode, on / off

- upgrade to RC3

 

Some more detail during troubleshoot

 

Problem : Once spin down all or most drive, even spin up before stop array, the array will sit on waiting umount. Telnet or GUI still function. This problem would not happen in "maintenance mode"

 

Additional info.

In Telnet, if I type "umount /mnt/diskX" , then it will hang the session, but disk actually be unmount

 

Thanks

 

Giveup first, going to sleep !!!!!!!!

Edited by Benson

Share this post


Link to post

Can't reproduce this issue.  Try this:

 

Start array

Spindown

Spinup

Stop array

 

Does that sequence fail?

Share this post


Link to post

Perhaps something to do with his nvme device?

(form syslog)

Mar  3 18:09:03 X370 kernel: mdcmd (108): spinup 0
Mar  3 18:09:03 X370 kernel: mdcmd (109): spinup 1
Mar  3 18:09:03 X370 kernel: mdcmd (110): spinup 2
Mar  3 18:09:03 X370 kernel: mdcmd (111): spinup 3
Mar  3 18:09:03 X370 kernel: mdcmd (112): spinup 4
Mar  3 18:09:03 X370 kernel: mdcmd (113): spinup 5
Mar  3 18:09:03 X370 kernel: mdcmd (114): spinup 6
Mar  3 18:09:03 X370 kernel: mdcmd (115): spinup 7
Mar  3 18:09:03 X370 kernel: mdcmd (116): spinup 29
Mar  3 18:09:03 X370 root:  HDIO_DRIVE_CMD(setidle) failed: Inappropriate ioctl for device
Mar  3 18:09:03 X370 root: 
Mar  3 18:09:03 X370 root: /dev/nvme0n1:
Mar  3 18:09:03 X370 root:  setting standby to 0 (off)
Mar  3 18:09:03 X370 emhttpd: shcmd (300): exit status: 25

 

Share this post


Link to post
Just now, bonienl said:

Perhaps something to do with his nvme device?

(form syslog)


Mar  3 18:09:03 X370 kernel: mdcmd (108): spinup 0
Mar  3 18:09:03 X370 kernel: mdcmd (109): spinup 1
Mar  3 18:09:03 X370 kernel: mdcmd (110): spinup 2
Mar  3 18:09:03 X370 kernel: mdcmd (111): spinup 3
Mar  3 18:09:03 X370 kernel: mdcmd (112): spinup 4
Mar  3 18:09:03 X370 kernel: mdcmd (113): spinup 5
Mar  3 18:09:03 X370 kernel: mdcmd (114): spinup 6
Mar  3 18:09:03 X370 kernel: mdcmd (115): spinup 7
Mar  3 18:09:03 X370 kernel: mdcmd (116): spinup 29
Mar  3 18:09:03 X370 root:  HDIO_DRIVE_CMD(setidle) failed: Inappropriate ioctl for device
Mar  3 18:09:03 X370 root: 
Mar  3 18:09:03 X370 root: /dev/nvme0n1:
Mar  3 18:09:03 X370 root:  setting standby to 0 (off)
Mar  3 18:09:03 X370 emhttpd: shcmd (300): exit status: 25

 

 

Don't think so.  I think it's related to the rcu issue.  The disk1 device never unmounts.

Share this post


Link to post
6 hours ago, limetech said:

Can't reproduce this issue.  Try this:

 

Start array

Spindown

Spinup

Stop array

 

Does that sequence fail?

Yes, 100% eachtime, 3rd step Spinup won't make things different.

 

i.e.

Start array

Spindown

Stop array ..... problem happen

 

This fault 1st happen in normal automatic disk spin-down, and make me to troubleshoot.

So if I want to use 6.5 then I must set all disk never spin-down.

Just wonder no one ( ryzen only ? ) report such critical problem.

 

Haven't try RC4

Edited by Benson

Share this post


Link to post
8 hours ago, bonienl said:

 

Does the same thing happen when you start the system in safe mode?

don't know. server is running normally right now on rc3, but here is another new to me message in the logs regarding hrtimer:

 

Mar  3 08:10:58 Brahms1 kernel: mdcmd (52): spindown 1
Mar  3 08:11:27 Brahms1 kernel: mdcmd (53): spindown 2
Mar  3 08:11:29 Brahms1 kernel: mdcmd (54): spindown 3
Mar  3 08:11:30 Brahms1 kernel: mdcmd (55): spindown 4
Mar  3 08:11:30 Brahms1 kernel: mdcmd (56): spindown 5
Mar  3 08:11:59 Brahms1 kernel: mdcmd (57): spindown 6
Mar  3 08:12:01 Brahms1 kernel: mdcmd (58): spindown 0
Mar  3 08:12:01 Brahms1 kernel: mdcmd (59): spindown 7
Mar  3 10:35:53 Brahms1 kernel: hrtimer: interrupt took 44531 ns
Mar  3 19:36:09 Brahms1 kernel: mdcmd (60): spindown 7

 

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.