Dealing with unclean shutdowns

trurl · January 21, 2022

Sometimes flash drive can disconnect or become readonly due to corruption, but often there will be other symptoms of these problems. Booting from USB2 port can be more reliable.

There is a timeout that will go ahead and shut down or reboot even if the array doesn't stop. This timeout can be adjusted in Disk Settings.

Instead of shutting down or rebooting, stop the array, see how long that takes, and adjust timeout accordingly.

All of this has already been discussed in this thread.

KRiSX · February 9, 2022

Hey all, I had an unclean shutdown yesterday due to a power cut, UPS options either didn't work or weren't configured correctly - I'm not sure yet and will be testing and fixing it soon, but upon powering the server back on I of course had an unclean shutdown and a parity check which resulted in 116 errors, I just want to check if I need to do anything else in this situation? Looking at the syslog I see "Parity Check Tuning: automatic Non-Correcting Parity Check finished (116 errors)"

Should I run parity check again with "Write corrections to parity" enabled or will this be fine? I didn't select or do anything to influence the parity check and everything seems to be working. I'm just in the middle of my transfer from DrivePool so want to be sure I'm good to keep going with something like this occuring.

itimpi · February 9, 2022

it is normal to have a small number of errors after an unclean shutdown. They nearly always occur very near the beginning of the check.

You need to run a correcting check to get those errors fixed or you risk data corruption if a disk fails and needs recovering. The correcting check should report the same number of errors, but this time it will be fixing them. Subsequent checks should then report 0 errors (assuming no further unclean shutdowns).

KRiSX · February 9, 2022

11 minutes ago, itimpi said:

You need to run a correcting check to get those errors fixed or you risk data corruption if a disk fails and needs recovering. The correcting check should report the same number of errors, but this time it will be fixing them. Subsequent checks should then report 0 errors (assuming no further unclean shutdowns).

ok thanks, i'll trigger that off now before i move more data over then, am I correct in saying all I need to do is hit the Check button on Main with the write corrections box ticked or is there more to it?

is there a way to make it do this by default in the event this happens again? so i don't have to tie up my disks for 16+ hours twice

Edited February 9, 2022 by KRiSX

itimpi · February 9, 2022

3 minutes ago, KRiSX said:

am I correct in saying all I need to do is hit the Check button on Main with the write corrections box ticked

This is all you need to do.

JonathanM · February 10, 2022

1 hour ago, KRiSX said:

is there a way to make it do this by default in the event this happens again? so i don't have to tie up my disks for 16+ hours twice

There are good reasons to not automatically change data without knowing why.

If you have a data drive acting up, the last thing you want to do is blindly write to the parity drive based on what is possibly bad data. Also, bad RAM can cause random parity check errors, so if you run 2 non-correcting checks in a row and get different results, you REALLY need to get to the bottom of it before writing ANY data to the server.

If a parity check finds errors, you need to fix what caused the errors before committing the changes. In your case, having an unclean shutdown is a good reason for a small number of parity errors, so you should be safe to just correct them, instead of doing a second non-correcting check to verify the results aren't changing.

KRiSX · February 10, 2022

6 hours ago, JonathanM said:

There are good reasons to not automatically change data without knowing why.

If you have a data drive acting up, the last thing you want to do is blindly write to the parity drive based on what is possibly bad data. Also, bad RAM can cause random parity check errors, so if you run 2 non-correcting checks in a row and get different results, you REALLY need to get to the bottom of it before writing ANY data to the server.

If a parity check finds errors, you need to fix what caused the errors before committing the changes. In your case, having an unclean shutdown is a good reason for a small number of parity errors, so you should be safe to just correct them, instead of doing a second non-correcting check to verify the results aren't changing.

fair points, too many variables to assume its just safe to go ahead... oh well, i'm 8 hours in with another 8 hours to go on the correcting check - will logs show what/if any files are affected or is it a case of if the files are there then you're all good? i'm assuming the latter and right now anything on here could be re-obtained without any hassle, i just want my system healthy overall

trurl · February 10, 2022

6 hours ago, KRiSX said:

will logs show what/if any files are affected

Correcting parity check won't affect any files. Only the parity disk is written, and parity contains none of your data. The reason you don't want to corrupt parity is so you can rebuild a data disk accurately.

If it corrects the small number of sync errors you mentioned earlier then that is the expected result. Then a non-correcting parity check should return zero, the only acceptable result.

Bob@unraid · February 11, 2022

Hello everyone,

a few days ago I upgraded the cpu and mainboard but tonight the system shutdown itself unclean.

I really have no clue the caused the problem, normally the system should go to sleep with the Dynamix S3 Sleep plugin.

I am running the latest Version: 6.10.0-rc2

I would be great if someone could check the attached syslog file. Also added the diagnostics

Thanks

syslog (2)

unraid-diagnostics-20220212-0019.zip

Edited February 11, 2022 by Bob@unraid

trurl · February 11, 2022

5 hours ago, Bob@unraid said:

attached syslog file

Also attach current diagnostics

kuhnamatata · February 22, 2022

I have been trying to figure out what is causing my unclean shutdowns on this server for a couple of years and finally decided to ask for help.

It typically goes off line after a few days of uptime, I have attached failure log file along with an after startup diagnostics zip and would really appreciate it if anyone with more expierence could take a look at them. Latest time it went offline (not able to access through the webpage or ssh) the shares were still accessible

proto-diagnostics-20220222-1732.zip proto syslog.log

dlandon · February 23, 2022

You have a lot of call traces. This goes on and on I also see a lot of the myservers messages like at the top of this snipet. Someone that knows more about how to interpret this would be better to help you.

Feb 22 01:41:23 Proto flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update
Feb 22 01:42:23 Proto flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update
Feb 22 01:43:23 Proto flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update
Feb 22 01:44:10 Proto kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Feb 22 01:44:10 Proto kernel: rcu: #01118-....: (16260291 ticks this GP) idle=e52/1/0x4000000000000000 softirq=26144979/26144982 fqs=4053314 
Feb 22 01:44:10 Proto kernel: #011(t=16260292 jiffies g=129526613 q=14094513)
Feb 22 01:44:10 Proto kernel: NMI backtrace for cpu 18
Feb 22 01:44:10 Proto kernel: CPU: 18 PID: 1458 Comm: sensors Tainted: P           O      5.14.15-Unraid #1
Feb 22 01:44:10 Proto kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Extreme4, BIOS P3.80 04/06/2018
Feb 22 01:44:10 Proto kernel: Call Trace:
Feb 22 01:44:10 Proto kernel: <IRQ>
Feb 22 01:44:10 Proto kernel: dump_stack_lvl+0x46/0x5a
Feb 22 01:44:10 Proto kernel: ? lapic_can_unplug_cpu+0x93/0x93
Feb 22 01:44:10 Proto kernel: nmi_cpu_backtrace+0x7d/0x8f
Feb 22 01:44:10 Proto kernel: nmi_trigger_cpumask_backtrace+0x56/0xd3
Feb 22 01:44:10 Proto kernel: rcu_dump_cpu_stacks+0xc3/0xea
Feb 22 01:44:10 Proto kernel: rcu_sched_clock_irq+0x22e/0x608
Feb 22 01:44:10 Proto kernel: ? trigger_load_balance+0x204/0x28a
Feb 22 01:44:10 Proto kernel: ? tick_sched_do_timer+0x3e/0x3e
Feb 22 01:44:10 Proto kernel: update_process_times+0x8c/0xab
Feb 22 01:44:10 Proto kernel: tick_sched_timer+0x38/0x65
Feb 22 01:44:10 Proto kernel: __hrtimer_run_queues+0xfa/0x18a
Feb 22 01:44:10 Proto kernel: hrtimer_interrupt+0x92/0x160
Feb 22 01:44:10 Proto kernel: __sysvec_apic_timer_interrupt+0x99/0xdb
Feb 22 01:44:10 Proto kernel: sysvec_apic_timer_interrupt+0x61/0x7d
Feb 22 01:44:10 Proto kernel: </IRQ>
Feb 22 01:44:10 Proto kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Feb 22 01:44:10 Proto kernel: RIP: 0010:smp_call_function_single+0xca/0xf7
Feb 22 01:44:10 Proto kernel: Code: 50 08 80 e2 01 74 04 f3 90 eb f4 83 48 08 01 4d 89 77 10 4c 89 fe 44 89 e7 4d 89 6f 18 e8 a2 fe ff ff 85 db 74 0d 41 8b 57 08 <80> e2 01 74 04 f3 90 eb f3 48 8b 54 24 38 65 48 2b 14 25 28 00 00
Feb 22 01:44:10 Proto kernel: RSP: 0018:ffffc90020fa7cc0 EFLAGS: 00000202
Feb 22 01:44:10 Proto kernel: RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
Feb 22 01:44:10 Proto kernel: RDX: 0000000000000011 RSI: ffffc90020fa7cc0 RDI: 0000000000000009
Feb 22 01:44:10 Proto kernel: RBP: ffffc90020fa7d28 R08: 0000000000000009 R09: ffff88810bd56180
Feb 22 01:44:10 Proto kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000009
Feb 22 01:44:10 Proto kernel: R13: ffffc90020fa7d38 R14: ffffffff813e5fa1 R15: ffffc90020fa7cc0
Feb 22 01:44:10 Proto kernel: ? pldmfw_flash_image+0x7fe/0x7fe
Feb 22 01:44:10 Proto kernel: ? pldmfw_flash_image+0x7fe/0x7fe
Feb 22 01:44:10 Proto kernel: rdmsr_on_cpu+0x48/0x71
Feb 22 01:44:10 Proto kernel: show_temp+0x68/0xc3 [coretemp]
Feb 22 01:44:10 Proto kernel: dev_attr_show+0x20/0x42
Feb 22 01:44:10 Proto kernel: sysfs_kf_seq_show+0x75/0xc0
Feb 22 01:44:10 Proto kernel: seq_read_iter+0x156/0x347
Feb 22 01:44:10 Proto kernel: new_sync_read+0x7c/0xaf
Feb 22 01:44:10 Proto kernel: vfs_read+0xc6/0x108
Feb 22 01:44:10 Proto kernel: ksys_read+0x76/0xbe
Feb 22 01:44:10 Proto kernel: do_syscall_64+0x83/0xa5
Feb 22 01:44:10 Proto kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
Feb 22 01:44:10 Proto kernel: RIP: 0033:0x146f4ac243ce

kuhnamatata · February 23, 2022

2 hours ago, dlandon said:

You have a lot of call traces. This goes on and on I also see a lot of the myservers messages like at the top of this snipet. Someone that knows more about how to interpret this would be better to help you.

Thanks for taking the time to look at the files

kuhnamatata · February 23, 2022

Found some info on the flash backup adding task, doubt that's causing my server going off-line but i'll try it

ljm42 · February 23, 2022

Yeah let's clean up the flash backup messages. Run the commands a bit further up in the conversation, starting here:

https://forums.unraid.net/topic/112745-stop-useless-backups/?tab=comments#comment-1026714

If the `git show` command throws an error message then we'll go a different direction from that thread.

But I agree, these are a symptom, not the cause.

kuhnamatata · February 24, 2022

Ran "git show" and and it displayed config changes from this mornings nextcloud and mariaDB update

ljm42 · February 24, 2022

2 hours ago, kuhnamatata said:

Ran "git show" and and it displayed config changes from this mornings nextcloud and mariaDB update

OK so no repeated backups of the same file that would be cause for concern? In that case it sounds like files are legitimately being changed and backed up, so no more to worry about related to flash backup.

kuhnamatata · February 24, 2022

This is what I got when I ran with git show command

commit f6bb1dd183b5c22ddf2f930e44ea0576c348eba5 (HEAD -> master, origin/master)
commit f6bb1dd183b5c22ddf2f930e44ea0576c348eba5 (HEAD -> master, origin/master)
commit f6bb1dd183b5c22ddf2f930e44ea0576c348eba5 (HEAD -> master, origin/master)
commit f6bb1dd183b5c22ddf2f930e44ea0576c348eba5 (HEAD -> master, origin/master)
Author: gitbot <[email protected]>
Date: Mon Dec 27 14:08:03 2021 -0500

Config change

diff --git a/config/plugins/fix.common.problems.plg b/config/plugins/fix.common.problems.plg
index 10d8c63..0536655 100644
--- a/config/plugins/fix.common.problems.plg
+++ b/config/plugins/fix.common.problems.plg
@@ -2,8 +2,8 @@
<!DOCTYPE PLUGIN [
<!ENTITY name "fix.common.problems">
<!ENTITY author "Andrew Zawadzki">
-<!ENTITY version "2021.08.05">
-<!ENTITY md5 "28271a759e6b795e4595ed77d22a18eb">
+<!ENTITY version "2021.12.26">
+<!ENTITY md5 "d4f455460e7d5cf64dfe6cf2598c07ce">
<!ENTITY launch "Settings/FixProblems">
<!ENTITY plugdir "/usr/local/emhttp/plugins/&name;">
<!ENTITY github "Squidly271/fix.common.problems">
@@ -11,6 +11,9 @@
]>
<PLUGIN name="&name;" author="&author;" version="&version;" launch="&launch;" pluginURL="&pluginURL;" icon="warning" min="6.7.0" support="http://lime-technology.com/forum/index.php?topic=48972.0">
<CHANGES>
+###2021.12.26
+- Check for blank or invalid characters within TLD
+
###2021.08.05
- Remove Scaling governor tests - not relevant anymore

LTech · March 9, 2022

Hi, today after starting my Unraid Server I got an Unclean Shutdown Error. But I don't know why. Before that, I used the normal shutdown button on the WebGui and everything else should have been shutdown (like VMs, Docker, etc.).

I even have a UPS to prevent that.

I hope you could help me identify the Problem.

unraid-vault-diagnostics-20220308-1610.zip

Squid · March 9, 2022

Hi there.

What is this pool named cache_zfs_test Is it actually a ZFS pool? If so, it should not be mounted within /mnt Use /mnt/disks instead It appears at first glance that the ZFS plugin wasn't properly unmounting it, possibly because everything within /mnt should be reserved for OS managed devices only (hence use the /mnt/disks/... instead)

LTech · March 9, 2022

Hi, no this pool is not a ZFS Pool, it is just the remaining of some testing I did with ZFS. I have the ZFS Plugin still installed, but it is just a normal Cache Pool with 4 drives in Raid 10. This should not pose a problem as far as I know. At least it didn't so far.

MattB425 · October 8, 2022

I've been having on and off problems with unclean shutdowns and I have no idea why. Last one caused over 1,000 parity errors so I'm assuming a disk is going bad. How do I check this?

Any help would be appreciated.

mainframe-diagnostics-20221008-0751.zip syslog

dlandon · October 8, 2022

7 minutes ago, MattB425 said:

I've been having on and off problems with unclean shutdowns and I have no idea why. Last one caused over 1,000 parity errors so I'm assuming a disk is going bad. How do I check this?

Any help would be appreciated.

mainframe-diagnostics-20221008-0751.zip 127.13 kB · 0 downloads syslog 614.39 kB · 0 downloads

Remove the following lines from your go file. Unraid is now handling this for you.

modprobe i915
chmod -R 777 /dev/dri

Reboot.

MattB425 · October 8, 2022

7 hours ago, dlandon said:
Remove the following lines from your go file. Unraid is now handling this for you.
modprobe i915
chmod -R 777 /dev/dri
Reboot.

Done. Hoping that helps. Thank you very much for your help.

edit: Unfortunately still getting crashes/unclean shut downs and don't know why.

Edited October 8, 2022 by MattB425

harley dmello · October 18, 2022

The unclean shutdown is required when you want to delete the history of the forum?

Dealing with unclean shutdowns

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

dlandon

dlandon

anthem221

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation