eribob Posted September 2, 2020 Share Posted September 2, 2020 (edited) Hi! My array has been working perfect until today. One of the disks were suddenly disabled due to read errors. The SMART report seem to indicate that the disk is healthy however? (It has FAILED: Never on all attributes as far as I can see). I have posted diagnostics - should I replace the disk or can this be some other kind of bug? I never had a warning from unraid about the disk before today, which is strange if it was failing. Best regards Erik monsterservern-diagnostics-20200902-1225.zip Edited September 2, 2020 by eribob Quote Link to comment
JorgeB Posted September 2, 2020 Share Posted September 2, 2020 Diags are after rebooting so we can't see what happened, but disk looks fine, suggest replacing/swapping cables first just to rule them out and rebuild on top. Quote Link to comment
eribob Posted September 2, 2020 Author Share Posted September 2, 2020 OK thanks! So replace SATA-cable and then rebuild parity? Quote Link to comment
JorgeB Posted September 2, 2020 Share Posted September 2, 2020 Replace/swap both cables, or if using some enclosure swap with another disk, then rebuild the disabled disk, not parity. https://wiki.unraid.net/Troubleshooting#Re-enable_the_drive Quote Link to comment
eribob Posted September 2, 2020 Author Share Posted September 2, 2020 Ah, thank you. I will try that. /Erik Quote Link to comment
eribob Posted September 2, 2020 Author Share Posted September 2, 2020 Hi again! I am following your instructions. I replaced the disk data cable and removed it from the array. After that I re-inserted it and the disk is rebuilding. However, the rebuild process keeps getting paused with the message: Parity Tuning Operation: 2020-09-02 16:05 Notification unknown action: recon D1 (1.6% completed) Pause I can resume the process again when it pauses and it will run for another couple of minutes or so, but then the same thing happens again. The system log also mentions the drives being overheated. Is that causing the recon D1 problem? Sep 2 16:00:34 Monsterservern kernel: md: recovery thread: recon D1 ... Sep 2 16:05:01 Monsterservern parity.check.tuning.php: Paused unknown action: recon D1 (1.6% completed) : Following drives overheated: 34 34 34 31 Sep 2 16:05:01 Monsterservern kernel: mdcmd (44): nocheck PAUSE Sep 2 16:05:01 Monsterservern kernel: Sep 2 16:05:02 Monsterservern kernel: md: recovery thread: exit status: -4 Sep 2 16:08:04 Monsterservern kernel: mdcmd (45): check Resume Sep 2 16:08:04 Monsterservern kernel: md: recovery thread: recon D1 ... Sep 2 16:10:02 Monsterservern parity.check.tuning.php: Paused unknown action: recon D1 (2.1% completed) : Following drives overheated: 34 34 34 31 Sep 2 16:10:02 Monsterservern kernel: mdcmd (46): nocheck PAUSE Sep 2 16:10:02 Monsterservern kernel: Sep 2 16:10:03 Monsterservern kernel: md: recovery thread: exit status: -4 Perhaps I should remove the side panels from the case and attempt to continue? monsterservern-diagnostics-20200902-1614.zip Quote Link to comment
trurl Posted September 2, 2020 Share Posted September 2, 2020 I am sure this behavior is being caused by the Parity Check Tuning plugin. Check its settings or remove it. Quote Link to comment
eribob Posted September 2, 2020 Author Share Posted September 2, 2020 38 minutes ago, trurl said: I am sure this behavior is being caused by the Parity Check Tuning plugin. Check its settings or remove it. Genius! It actually said so in the logs, I am too stressed to check properly. Thank you! Quote Link to comment
itimpi Posted September 2, 2020 Share Posted September 2, 2020 2 hours ago, eribob said: Genius! It actually said so in the logs, I am too stressed to check properly. Thank you! The message being output by the parity check tuning plugin does not look quite right. Is there any chance you can go into it’s settings and set the Debug logging to ‘testing’ level, reproduce the symptoms, and then post a copy of the syslog (or diagnostics which includes the syslog) so I can get more detail on exactly what is happening? After doing that turn the level down again. My system does not suffer from drives getting too hot so I have trouble testing all real-world scenarios relating to the temperature checks. On the face of it you may have the temperature settings too low but I am not sure. You can also disable the option to pause/resume based on temperature. Quote Link to comment
eribob Posted September 2, 2020 Author Share Posted September 2, 2020 I disabled the option in the Parity Check tuning plugin "pause and resume array operations if disks overheat". I had warning disk temperature at 45 and critical at 55 (I believe it is default, since I cant remember ever changing those values). I now raised the warning to 50 and critical to 60 as well. After disabling the "pause if overheat" the rebuild process has been progressing without problems (now on 39%). So most likely it was pausing due to temperatures approaching the warning level. Since I have important data on my array and no parity until the rebuild is finished, I want to await the rebuild process now. So I do not want to try and reproduce the error. Thanks again for quick support! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.