Dealing with unclean shutdowns


Recommended Posts

Sometimes flash drive can disconnect or become readonly due to corruption, but often there will be other symptoms of these problems. Booting from USB2 port can be more reliable.

 

There is a timeout that will go ahead and shut down or reboot even if the array doesn't stop. This timeout can be adjusted in Disk Settings.

 

Instead of shutting down or rebooting, stop the array, see how long that takes, and adjust timeout accordingly.

 

All of this has already been discussed in this thread.

  • Like 1
Link to comment
  • 3 weeks later...

Hey all, I had an unclean shutdown yesterday due to a power cut, UPS options either didn't work or weren't configured correctly - I'm not sure yet and will be testing and fixing it soon, but upon powering the server back on I of course had an unclean shutdown and a parity check which resulted in 116 errors, I just want to check if I need to do anything else in this situation? Looking at the syslog I see "Parity Check Tuning: automatic Non-Correcting Parity Check finished (116 errors)"

 

Should I run parity check again with "Write corrections to parity" enabled or will this be fine? I didn't select or do anything to influence the parity check and everything seems to be working. I'm just in the middle of my transfer from DrivePool so want to be sure I'm good to keep going with something like this occuring.

Link to comment

it is normal to have a small number of errors after an unclean shutdown.   They nearly always occur very near the beginning of the check.
 

You need to run a correcting check to get those errors fixed or you risk data corruption if a disk fails and needs recovering.  The correcting check should report the same number of errors, but this time it will be fixing them.   Subsequent checks should then report 0 errors (assuming no further unclean shutdowns).

Link to comment
11 minutes ago, itimpi said:

You need to run a correcting check to get those errors fixed or you risk data corruption if a disk fails and needs recovering.  The correcting check should report the same number of errors, but this time it will be fixing them.   Subsequent checks should then report 0 errors (assuming no further unclean shutdowns).

 

ok thanks, i'll trigger that off now before i move more data over then, am I correct in saying all I need to do is hit the Check button on Main with the write corrections box ticked or is there more to it?

 

is there a way to make it do this by default in the event this happens again? so i don't have to tie up my disks for 16+ hours twice

Edited by KRiSX
  • Like 1
Link to comment
1 hour ago, KRiSX said:

 

is there a way to make it do this by default in the event this happens again? so i don't have to tie up my disks for 16+ hours twice

There are good reasons to not automatically change data without knowing why.

 

If you have a data drive acting up, the last thing you want to do is blindly write to the parity drive based on what is possibly bad data. Also, bad RAM can cause random parity check errors, so if you run 2 non-correcting checks in a row and get different results, you REALLY need to get to the bottom of it before writing ANY data to the server.

 

If a parity check finds errors, you need to fix what caused the errors before committing the changes. In your case, having an unclean shutdown is a good reason for a small number of parity errors, so you should be safe to just correct them, instead of doing a second non-correcting check to verify the results aren't changing.

  • Like 1
Link to comment
6 hours ago, JonathanM said:

There are good reasons to not automatically change data without knowing why.

 

If you have a data drive acting up, the last thing you want to do is blindly write to the parity drive based on what is possibly bad data. Also, bad RAM can cause random parity check errors, so if you run 2 non-correcting checks in a row and get different results, you REALLY need to get to the bottom of it before writing ANY data to the server.

 

If a parity check finds errors, you need to fix what caused the errors before committing the changes. In your case, having an unclean shutdown is a good reason for a small number of parity errors, so you should be safe to just correct them, instead of doing a second non-correcting check to verify the results aren't changing.

 

fair points, too many variables to assume its just safe to go ahead... oh well, i'm 8 hours in with another 8 hours to go on the correcting check - will logs show what/if any files are affected or is it a case of if the files are there then you're all good? i'm assuming the latter and right now anything on here could be re-obtained without any hassle, i just want my system healthy overall

Link to comment
6 hours ago, KRiSX said:

will logs show what/if any files are affected

Correcting parity check won't affect any files. Only the parity disk is written, and parity contains none of your data. The reason you don't want to corrupt parity is so you can rebuild a data disk accurately.

 

If it corrects the small number of sync errors you mentioned earlier then that is the expected result. Then a non-correcting parity check should return zero, the only acceptable result.

Link to comment

Hello everyone,

 

a few days ago I upgraded the cpu and mainboard but tonight the system shutdown itself unclean.

 

I really have no clue the caused the problem, normally the system should go to sleep with the Dynamix S3 Sleep plugin.

 

I am running the latest Version: 6.10.0-rc2 

 

I would be great if someone could check the attached syslog file. Also added the diagnostics

 

Thanks

syslog (2)

unraid-diagnostics-20220212-0019.zip

Edited by Bob@unraid
Link to comment
  • 2 weeks later...

I have been trying to figure out what is causing my unclean shutdowns on this server for a couple of years and finally decided to ask for help.

 

It typically goes off line after a few days of uptime, I have attached failure log file along with an after startup diagnostics zip and would really appreciate it if anyone with more expierence could take a look at them. Latest time it went offline (not able to access through the webpage or ssh) the shares were still accessible 

proto-diagnostics-20220222-1732.zip proto syslog.log

Link to comment

You have a lot of call traces.  This goes on and on  I also see a lot of the myservers messages like at the top of this snipet.  Someone that knows more about how to interpret this would be better to help you.

 

Feb 22 01:41:23 Proto flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update
Feb 22 01:42:23 Proto flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update
Feb 22 01:43:23 Proto flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update
Feb 22 01:44:10 Proto kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Feb 22 01:44:10 Proto kernel: rcu: #01118-....: (16260291 ticks this GP) idle=e52/1/0x4000000000000000 softirq=26144979/26144982 fqs=4053314 
Feb 22 01:44:10 Proto kernel: #011(t=16260292 jiffies g=129526613 q=14094513)
Feb 22 01:44:10 Proto kernel: NMI backtrace for cpu 18
Feb 22 01:44:10 Proto kernel: CPU: 18 PID: 1458 Comm: sensors Tainted: P           O      5.14.15-Unraid #1
Feb 22 01:44:10 Proto kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Extreme4, BIOS P3.80 04/06/2018
Feb 22 01:44:10 Proto kernel: Call Trace:
Feb 22 01:44:10 Proto kernel: <IRQ>
Feb 22 01:44:10 Proto kernel: dump_stack_lvl+0x46/0x5a
Feb 22 01:44:10 Proto kernel: ? lapic_can_unplug_cpu+0x93/0x93
Feb 22 01:44:10 Proto kernel: nmi_cpu_backtrace+0x7d/0x8f
Feb 22 01:44:10 Proto kernel: nmi_trigger_cpumask_backtrace+0x56/0xd3
Feb 22 01:44:10 Proto kernel: rcu_dump_cpu_stacks+0xc3/0xea
Feb 22 01:44:10 Proto kernel: rcu_sched_clock_irq+0x22e/0x608
Feb 22 01:44:10 Proto kernel: ? trigger_load_balance+0x204/0x28a
Feb 22 01:44:10 Proto kernel: ? tick_sched_do_timer+0x3e/0x3e
Feb 22 01:44:10 Proto kernel: update_process_times+0x8c/0xab
Feb 22 01:44:10 Proto kernel: tick_sched_timer+0x38/0x65
Feb 22 01:44:10 Proto kernel: __hrtimer_run_queues+0xfa/0x18a
Feb 22 01:44:10 Proto kernel: hrtimer_interrupt+0x92/0x160
Feb 22 01:44:10 Proto kernel: __sysvec_apic_timer_interrupt+0x99/0xdb
Feb 22 01:44:10 Proto kernel: sysvec_apic_timer_interrupt+0x61/0x7d
Feb 22 01:44:10 Proto kernel: </IRQ>
Feb 22 01:44:10 Proto kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Feb 22 01:44:10 Proto kernel: RIP: 0010:smp_call_function_single+0xca/0xf7
Feb 22 01:44:10 Proto kernel: Code: 50 08 80 e2 01 74 04 f3 90 eb f4 83 48 08 01 4d 89 77 10 4c 89 fe 44 89 e7 4d 89 6f 18 e8 a2 fe ff ff 85 db 74 0d 41 8b 57 08 <80> e2 01 74 04 f3 90 eb f3 48 8b 54 24 38 65 48 2b 14 25 28 00 00
Feb 22 01:44:10 Proto kernel: RSP: 0018:ffffc90020fa7cc0 EFLAGS: 00000202
Feb 22 01:44:10 Proto kernel: RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
Feb 22 01:44:10 Proto kernel: RDX: 0000000000000011 RSI: ffffc90020fa7cc0 RDI: 0000000000000009
Feb 22 01:44:10 Proto kernel: RBP: ffffc90020fa7d28 R08: 0000000000000009 R09: ffff88810bd56180
Feb 22 01:44:10 Proto kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000009
Feb 22 01:44:10 Proto kernel: R13: ffffc90020fa7d38 R14: ffffffff813e5fa1 R15: ffffc90020fa7cc0
Feb 22 01:44:10 Proto kernel: ? pldmfw_flash_image+0x7fe/0x7fe
Feb 22 01:44:10 Proto kernel: ? pldmfw_flash_image+0x7fe/0x7fe
Feb 22 01:44:10 Proto kernel: rdmsr_on_cpu+0x48/0x71
Feb 22 01:44:10 Proto kernel: show_temp+0x68/0xc3 [coretemp]
Feb 22 01:44:10 Proto kernel: dev_attr_show+0x20/0x42
Feb 22 01:44:10 Proto kernel: sysfs_kf_seq_show+0x75/0xc0
Feb 22 01:44:10 Proto kernel: seq_read_iter+0x156/0x347
Feb 22 01:44:10 Proto kernel: new_sync_read+0x7c/0xaf
Feb 22 01:44:10 Proto kernel: vfs_read+0xc6/0x108
Feb 22 01:44:10 Proto kernel: ksys_read+0x76/0xbe
Feb 22 01:44:10 Proto kernel: do_syscall_64+0x83/0xa5
Feb 22 01:44:10 Proto kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
Feb 22 01:44:10 Proto kernel: RIP: 0033:0x146f4ac243ce

 

  • Thanks 1
Link to comment
2 hours ago, kuhnamatata said:

Ran "git show" and and it displayed config changes from this mornings nextcloud and mariaDB update

 

OK so no repeated backups of the same file that would be cause for concern? In that case it sounds like files are legitimately being changed and backed up, so no more to worry about related to flash backup.

Link to comment

This is what I got when I ran with git show command

 

commit f6bb1dd183b5c22ddf2f930e44ea0576c348eba5 (HEAD -> master, origin/master)
commit f6bb1dd183b5c22ddf2f930e44ea0576c348eba5 (HEAD -> master, origin/master)
commit f6bb1dd183b5c22ddf2f930e44ea0576c348eba5 (HEAD -> master, origin/master)
commit f6bb1dd183b5c22ddf2f930e44ea0576c348eba5 (HEAD -> master, origin/master)
Author: gitbot <[email protected]>
Date:   Mon Dec 27 14:08:03 2021 -0500

    Config change

diff --git a/config/plugins/fix.common.problems.plg b/config/plugins/fix.common.problems.plg
index 10d8c63..0536655 100644
--- a/config/plugins/fix.common.problems.plg
+++ b/config/plugins/fix.common.problems.plg
@@ -2,8 +2,8 @@
 <!DOCTYPE PLUGIN [
 <!ENTITY name "fix.common.problems">
 <!ENTITY author "Andrew Zawadzki">
-<!ENTITY version "2021.08.05">
-<!ENTITY md5 "28271a759e6b795e4595ed77d22a18eb">
+<!ENTITY version "2021.12.26">
+<!ENTITY md5 "d4f455460e7d5cf64dfe6cf2598c07ce">
 <!ENTITY launch "Settings/FixProblems">
 <!ENTITY plugdir "/usr/local/emhttp/plugins/&name;">
 <!ENTITY github "Squidly271/fix.common.problems">
@@ -11,6 +11,9 @@
 ]>
 <PLUGIN name="&name;" author="&author;" version="&version;" launch="&launch;" pluginURL="&pluginURL;" icon="warning" min="6.7.0" support="http://lime-technology.com/forum/index.php?topic=48972.0">
   <CHANGES>
+###2021.12.26
+- Check for blank or invalid characters within TLD
+
 ###2021.08.05
 - Remove Scaling governor tests - not relevant anymore

Link to comment
  • 2 weeks later...

Hi there. 

 

What is this pool named cache_zfs_test  Is it actually a ZFS pool?  If so, it should not be mounted within /mnt  Use /mnt/disks instead  It appears at first glance that the ZFS plugin wasn't properly unmounting it, possibly because everything within /mnt should be reserved for OS managed devices only (hence use the /mnt/disks/... instead)

Link to comment

Hi, no this pool is not a ZFS Pool, it is just the remaining of some testing I did with ZFS. I have the ZFS Plugin still installed, but it is just a normal Cache Pool with 4 drives in Raid 10. This should not pose a problem as far as I know. At least it didn't so far.

Link to comment
  • 6 months later...
7 minutes ago, MattB425 said:

I've been having on and off problems with unclean shutdowns and I have no idea why. Last one caused over 1,000 parity errors so I'm assuming a disk is going bad. How do I check this?

 

Any help would be appreciated. 

mainframe-diagnostics-20221008-0751.zip 127.13 kB · 0 downloads syslog 614.39 kB · 0 downloads

Remove the following lines from your go file.  Unraid is now handling this for you.

modprobe i915
chmod -R 777 /dev/dri

Reboot.

Link to comment
7 hours ago, dlandon said:

Remove the following lines from your go file.  Unraid is now handling this for you.

modprobe i915
chmod -R 777 /dev/dri

Reboot.

 

Done. Hoping that helps. Thank you very much for your help. 

 

edit: Unfortunately still getting crashes/unclean shut downs and don't know why. 

Edited by MattB425
Link to comment
  • 2 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.