[6.9.2] SMART settings wiped occasionally

trurl · April 27, 2021

Don't know why you posted a bug report (without Diagnostics).

This seems to be more about your RAID controller (not recommended for these reasons and more) than about Unraid.

codefaux · April 27, 2021

8 hours ago, trurl said:

Don't know why you posted a bug report (without Diagnostics).

Oh, I can help with that.

1 - Because there's a bug.

2 - Because unRAID's configuration management scripts are eating perfectly formatted, correct SMART configuration data

3 - Because I don't have the Diagnostics

4 - Because I thought there might be enough information here to actually figure out something, IE there's only a (relatively) small section of code responsible for handling the SMART Settings for the disks

5 - Because my diagnostic.zip got lost, as I said, and because if it included the log directory it would've been over 128MB, as indicated by the full log partition.

I apologize that the logic I used to arrive at seeking help was simply beyond your grasp, but now you know. Had you simply said, "We can't help at all without a diagnostic.zip" I would've been dubious but accepting. I'm curious, what was your logic in replying to a post seeking help, while also not providing anything that could be mistaken as helpful? Is this what unRAID support is? Is this guy important?

8 hours ago, trurl said:

This seems to be more about your RAID controller (not recommended for these reasons and more) than about Unraid.

Could you explain to me how my RAID controller is erasing the config settings from smart_one.cfg on the USB drive which isn't connected to it?

Can you explain to me how my RAID controller is causing in-memory scripts to lose their contents?

I'm deeply curious, because I'm using what I have. I don't care what's ideal, I don't have that and I can't afford it. Unless you wish to provide me with a suitable replacement suggestion (Four SAS slots per PCIe card, ideally an HBA) maybe let someone else weigh in. I know EXCEEDINGLY WELL why using RAID controllers sucks, I'm DEALING WITH IT.

I don't care if they're recommended, they're "supported" and there's a bug with unRAID's configuration manipulation that is erasing the contents of its file. This is not "RAID controller related" any more than that I need to use the PROVIDED CONFIGURATION METHODS to adjust settings because my controller does not support direct polling of disks.

trurl · April 28, 2021

codefaux · April 28, 2021

6 hours ago, trurl said:

Thanks for the reference, ~~but it covers up to eight disks on one controller. This does not support my use case,~~ as I require a minimum of 12 disks per controller, preferably 16 -- as indicated by "Four SAS slots per PCIe card" in my post above.

---

EDIT: Upon further review, the link -clearly- has options for higher port counts, but they are well outside my price range. An SAS expander and a smaller controller may be an option, but then I fear for my bandwidth, with 30+ disks. I accept that using a RAID controller in JBOD mode is not ideal, but it's the option I can afford - these controllers were $12 each, and provide up to 16 JBOD disks apiece via four internal SAS connectors. With proper smartctl flags, they even work reasonably well with unRAID. I now have a price goal for my next upgrade, but it's definitely gonna be a while. In the mean time, I'd like to fix what I've got.

---

I'm also using SAS cabling (obviously, to those who know most of the technical terminology I've used) both internal and external to my casing, so a controller with SATA connectors simply won't cut it without me also having to buy eight 4x-SATA-host-to-SAS adapter cables.

Furhtermore, I have neither the need nor the desire to change my controller and disk layout, nor do I have the finances. This worked before, it'll continue to work. In fact, while I'm mentioning it, this worked a few updates ago without kernel log flooding, so something recently changed.

Oh! While I'm at it, thank you for being helpful, but I'd ask that unless you intend to be helpful WITH MY PROBLEM, please go away. There is a bug in unRAID that I intend to see addressed or at least acknowledged. I don't need to convert from RAID controllers to a direct HBA. My RAID controllers are in JBOD mode, that's the closest I can get to ideal where I'm at.

"Buy something different" is not a support answer, it is not helpful, it is lazy and annoying and petulant. I get that it's not ideal, but there is infrastructure in place to make it work, and THAT INFRASTRUCTURE ISN'T WORKING. I'd like to get it fixed, rather than being avoidant and spending money I simply don't have, to acquire hardware I'm still unable to find in the first place. You obviously don't understand my requirements, my setup, my situation, or even the problem I'm attempting to address despite your deraliing.

Please stop sidetracking my support thread. I don't need new disk controllers, I need someone to look into the issue I'm raising. Your first post was antagonistic, your second post was minimum-effort and didn't even begin to address the requirements I laid out in terminology you should understand if you're posing as a support agent on a forum for a storage-based OS. Your presence here is negative, I'm asking politely and professionally for help and you seem like a troll. All I want is for my fileserver to run properly.

Edited April 28, 2021 by codefaux
Further review, clarification

codefaux · April 28, 2021

I apologize for being/sounding like an ass. I'm dealing with a lot of crap right now (including some issues with a hardware upgrade last night, thus not being able to provide diagnostics.zip presently) and I really just wanted someone to address the problem listed above -- unRAID seems to eat some/all of the contents of an important config file at some point when changing lots of entries one at a time.

As a coder, if someone tells me "there's a bug with X" in my project, sometimes I can find it just looking through the area carefully, knowing it misbehaves. I understand that may not be the case here, and I understand internal devs may not be willing to DO that without further information. It's reasonable to say "We need the diagnostics.zip to proceeed" as a dev.

I'll stabilize this stack of crap and reproduce the bug again (if I can, entropy hates me lately) and happily send the zip along. I don't intend to be unreasonable, I just anticipated any efforts to help would be toward the resolution of the bug, not toward changing everything I've got and spending money to avoid it.

SimonF · April 28, 2021

Do you need to be on 6.9.2 for the new motherboard? Which vers was it working on previously? Is it an option to downgrade to a different unraid vers in the short term?

I have looked at the smartctl code and it is still using ioctl for 3ware so not sure those messages in the log can be suppressed unless smartctl is updated by the owners, this would be outside of Limetech's control.

Current diags will be helpful, but if you can provide at present are you able to post smart-one.cfg, disks.cfg

JorgeB · April 28, 2021

40 minutes ago, codefaux said:

unRAID seems to eat some/all of the contents of an important config file at some point when changing lots of entries one at a time.

This is a known issue, there are already multiple bug reports about this, e.g.:

codefaux · April 28, 2021

1 hour ago, SimonF said:

Do you need to be on 6.9.2 for the new motherboard? Which vers was it working on previously? Is it an option to downgrade to a different unraid vers in the short term?

I have looked at the smartctl code and it is still using ioctl for 3ware so not sure those messages in the log can be suppressed unless smartctl is updated by the owners, this would be outside of Limetech's control.

Current diags will be helpful, but if you can provide at present are you able to post smart-one.cfg, disks.cfg

The new motherboard has only run 6.9.2 and up until today I was experiencing significant instability -- the same instability I upgraded motherboards to escape, much to my dismay. I may have found the issue (c-states) and a workaround (disabling them) and expect to be stable enough to provide the diagnostic.zip if this is the case -- and if it's even still required (see below, apparently it's a known issue anyway.)

I could downgrade to an older version of unRAID, especially if it will quiet the logspam because JUST WOW. I don't remember which version I was running before these issues cropped up, is there one you might recommend?

Also, for what it's worth -- my smart-one.cfg is *likely* to be absolutely useless as a diagnostic, because I wrote it myself from scratch with the bash snippet in my first post, using clever code to scan my connected controllers and match them to named entries etc etc.. If it's wrong, it's because I wrote it wrong. It seems to be working though, except for still spamming log messages every several minutes. Attaching it regardless, as well as disk.cfg -- I assume you meant disk.cfg and not disks.cfg, correct?

smart-one.cfg disk.cfg

41 minutes ago, JorgeB said:

This is a known issue, there are already multiple bug reports about this, e.g.:

Fantastic, but to be fair the single referenced bug report (of seemingly multiple bug reports, I understand) reports specifically that they could not set warning/critical temperature, not that other settings were also being reset. Perhaps someone should ask the poster or a mod to clarify the title and/or contents of that report, if it's an issue, as the title and contents directly exclude my issue due to their level of specificity and also not mentioning other settings disappearing. Thanks to my additional, seemingly extra report, the scope of the problem is a bit more clearly defined now, so I believe it has merit. Is there disagreement?

EDIT: Hey also, if this is a "known issue" -- is there some manner of note somewhere? Did I miss a "Known Issues" thread or something? This could have saved me a considerable amount of effort and time, where should I look for your Known Issues in the future? I think I just figured out another problem I'm having (as mentioned) and would love to browse your Known Issues list. Maybe in the future I won't have to post at all, because this time I looked and didn't find anyone with SMART settings clearing EXCEPT the one you linked, which I already knew about but ignored because it seemed like it could be a __different__ issue..

Edited April 28, 2021 by codefaux

SimonF · April 28, 2021

4 hours ago, codefaux said:

is there one you might recommend?

It would be 6.8.3 as I believe the issue started with 6.9, but will depend on hardware, i.e. if your new MD needs a newer kernel etc.

4 hours ago, codefaux said:

I assume you meant disk.cfg and not disks.cfg, correct?

Yes

If you went straight to 6.9.2 from previous vers, then you could look at the changes file in the previous dir as below.

root@unraid:/boot/previous# cat changes.txt
### Version 6.8.2 2020-01-26

codefaux · April 29, 2021

13 hours ago, SimonF said:

It would be 6.8.3 as I believe the issue started with 6.9, but will depend on hardware, i.e. if your new MD needs a newer kernel etc.

Yes

If you went straight to 6.9.2 from previous vers, then you could look at the changes file in the previous dir as below.

root@unraid:/boot/previous# cat changes.txt
### Version 6.8.2 2020-01-26

Thank you very much. I'm using a pair of Xeon X5670 CPUs -- nothing I have is 'new' enough to need modern support..frankly it's become clear that it's quite the opposite, in my case.

Looks like 6.8.2 lives in the /previous folder. My understanding - and I'm being paranoid-cautious about this because of how much time I've blown on this so far - is that I should back up the existing files to, say, backup-6.9.3 and then copy the files from /boot/previous into /boot and all should be well, correct? Will I need to downgrade plugins? I assume there's no settings migration required, but obviously a full-USB backup beforehand would be wise.

Am I missing anything?

SimonF · April 29, 2021

If you click on the flash drive from the Main window you will have an option to backup flash.

in tools update OS you can revert to previous vers by clicking restore.

The other thing you will need to do is put the cache drive back if you have one. As part of 6.9 the cache config gets moved into the multipool config files.


Reverting back to 6.8.3

If you have a cache disk/pool it will be necessary to either:

restore the flash backup you created before upgrading (you did create a backup, right?), or
on your flash, copy 'config/disk.cfg.bak' to 'config/disk.cfg' (restore 6.8.3 cache assignment), or
manually re-assign storage devices assigned to cache back to cache
 

This is because to support multiple pools, code detects the upgrade to 6.9.0 and moves the 'cache' device settings out of 'config/disk.cfg' and into 'config/pools/cache.cfg'.  If you downgrade back to 6.8.3 these settings need to be restored.

With regards to plugins depends which ones you use. Have you installed any new ones like Nvidia etc as they only work on 6.9

codefaux · April 29, 2021

Oh, so much fantastic information, literally half an hour after I just jumped the gun and did as I said. Haha... I got home and had crashy stuff again, the whole docker macvlan thing which I've now discovered ISN'T my hardware dieing slowly but is infact a third issue I've had with the new 6.9 series, so I jumped RIGHT on reverting.

The backup/revert options would have been nice, and I probably should've seen them, but I actually did just move the files around on the flash drive. It seems to have survived.

You're right, Cache didn't re-mount after restart, and thus Docker didn't start, but I was able to figure that out pretty quickly since my Cache disk was listed next to my Docker disk in Unassigned devices. I just re-added it and started the array, nothing caught fire so far.

I WAS using the Unraid-Kernel-Helper from Community Apps --

I'm not sure if this is one of the "new ones like Nvidia etc" which you reference, but after a quick read it seems I should still be able to use it -- please let me know if I'm wrong.

Thanks!

SimonF · April 29, 2021

16 minutes ago, codefaux said:

I should still be able to use it

I know it used to support 6.8.3 but not sure if that has been depreciated now as plugins on 6.9 can install modules.

What parts did you enable in @ich777plugin?

codefaux · April 29, 2021

2 minutes ago, SimonF said:

I know it used to support 6.8.3 but not sure if that has been depreciated now as plugins on 6.9 can install modules.

What parts did you enable in @ich777plugin?

I actually cannot seem to find how to install the Unraid Kernel Helper template from CA again, so it seems that may be the case.

Honestly, I only remember installing the Docker container template, allowing it to compile, then copying the requisite files from the container to the boot partition. I don't think I used the plugin, or adjusted any options much.

All I needed it for was NVidia support for the Docker, though. It appears there's still a prebuilt 6.8.3 available on that page, so I'll likely just use that and call it good enough -- unless that's not a good approach.

trurl · April 29, 2021

On 4/28/2021 at 4:40 AM, codefaux said:

I apologize

me too

codefaux · May 1, 2021

On 4/28/2021 at 7:04 AM, SimonF said:

It would be 6.8.3 as I believe the issue started with 6.9, but will depend on hardware, i.e. if your new MD needs a newer kernel etc.

I finally got around to re-configuring my SMART values after reverting days ago. Turns out the 6.8.x smart-one.cfg stored by diskX/parity/cache, and starting 6.9 disks are stored by disk ID (in my case, 1AMCC_etcetc as seen previously) so my handy one-liner only works with 6.9+.. -- almost everything is normal.

After re-writing all of the configurables for SMART, I'm still getting one instance of the SG_IO message sporadically, instead of one instance per disk for 31 disks..

image.png.e79af75d323042b4e3784a5e6a45712c.png

As you can see it's not all that often. I've attached a diagnostic.zip this time, since it isn't over-filled with useless garbage.

I SUSPECT the single message is related to my single unassigned device, but it is also confgured correctly and reporting in the webUI so that also feels unlikely.

Honestly this is less of a "how do I fix this" and more "I really hope someone can explain this behavior because I am so curious" -- does anyone have a guess as to what's handled differently, internally? The unassigned device seems like the only one that would be handled out-of-band, yes?

Clearly I can ignore that level of spam. I just like knowing why too. Naturally curious, and all that.

Beyond that, no further needs. Thanks a ton guys.

codefaux-tower-diagnostics-20210430-1739.zip

Edited May 1, 2021 by codefaux
Forgot to attach the damn ZIP derp

SimonF · May 1, 2021

7 hours ago, codefaux said:

I just like knowing why too. Naturally curious, and all that.

Looking at the logs etc is /dev/sdp an Unassigned Disk? Smart is failing for it so likely that is the one producing the error.

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-4.19.107-Unraid] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/sdp failed: AMCC/3ware controller, please try adding '-d 3ware,N',
you may need to replace /dev/sdp with /dev/twlN, /dev/twaN or /dev/tweN

codefaux · May 1, 2021

57 minutes ago, SimonF said:

Smart is failing

~~I mean, it probably was when I first booted or some such, but it shows in the webUI as valid.~~

See below, lol

And under Identity;

Interestingly, in the webUI, there's nowhere to put SMART configuration data for the Unassigned Device. I put it into smart-one.cfg by hand, using a little script I had written to reference which drives, and guessed that 'sdp' was the header it would look under given the URL patterns matched the smart-one.cfg header patterns... So, the webUI is using the information I gave it which it wasn't designed for, but something else in the system...isn't?

I've uploaded a new diagnostic, taken between screenshots with the SMART data showing in the webUI. Huh.

Well, that turned out to be more interesting than I expected. Thanks for humoring me.

codefaux-tower-diagnostics-20210501-0127.zip

Edited May 1, 2021 by codefaux

SimonF · May 1, 2021

2 hours ago, codefaux said:

SMART configuration data for the Unassigned Device.

What do you get if you run

cat /sys/block/sdp/queue/rotational

UD pre 6.9 tries to spin down every 15 mins for spinners using hdparm. It looks at the value above to get if ssd or spinner. I am guessing because the raid card may be showing as a spinner value 1 then it is trying to spin it down.

Also it runs the following to get drive temp. Which I think its likely cause of logs.

/usr/sbin/smartctl -n standby -A $dev | /bin/awk 'BEGIN{t=\"*\"} $1==\"Temperature:\"{t=$2;exit};$1==190||$1==194{t=$10;exit} END{print t}

But isnt doing the override for the controller with -d .......

do you get an error for hdparm -S180 /dev/sdp

Edited May 1, 2021 by SimonF

codefaux · May 2, 2021

16 hours ago, SimonF said:

cat /sys/block/sdp/queue/rotational

root@Tower:~# cat /sys/block/sdp/queue/rotational
1

You are correct in this much, sir. The RAID controller is definitely reporting it as a spinner, despite it being distinctly non-rotational.

16 hours ago, SimonF said:

/usr/sbin/smartctl -n standby -A $dev | /bin/awk 'BEGIN{t=\"*\"} $1==\"Temperature:\"{t=$2;exit};$1==190||$1==194{t=$10;exit} END{print t}

Manually running that (replacing $dev and removing everything after the pipe) I get an error, but no new log entry.

root@Tower:~# /usr/sbin/smartctl -n standby -A /dev/sdp
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-4.19.107-Unraid] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/sdp failed: AMCC/3ware controller, please try adding '-d 3ware,N',
you may need to replace /dev/sdp with /dev/twlN, /dev/twaN or /dev/tweN

I'd paste the kernel log around that time, but it's difficult to capture a lack of a change, lol. I also ran it with proper parameters (/dev/twa1 -d 3ware,7) and it returns normal data, still no kernel log message.

16 hours ago, SimonF said:

hdparm -S180 /dev/sdp

root@Tower:~# hdparm -S180 /dev/sdp

/dev/sdp:
 setting standby to 180 (15 minutes)
 HDIO_DRIVE_CMD(setidle) failed: Invalid argument

Well that gave AN error message in the kernel log, but not the one I'm seeing spammed.

[181527.479504] 3w-9xxx: scsi2: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85.

SimonF · May 2, 2021

"/usr/sbin/hdparm -C $dev 2>/dev/null | /bin/grep -c standby"

is used for checking is the drive is spun down.

Try uninstalling the UD plugin just to see if the errors go away.

codefaux · May 4, 2021

On 5/2/2021 at 12:08 AM, SimonF said:

"/usr/sbin/hdparm -C $dev 2>/dev/null | /bin/grep -c standby"

Two lines at the exact same time, both a different message than the logspam.

[364448.237758] 3w-9xxx: scsi2: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85.
[364448.238036] 3w-9xxx: scsi2: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85.

On 5/2/2021 at 12:08 AM, SimonF said:

uninstalling the UD plugin

If you mean the Unassigned Devices plugin, I'm using that. Stores some of my Docker container volumes. I could likely work around it temporarily if you really suspect that's the cause.

SimonF · May 4, 2021

I just checked your poll timer on disk settings and it is set to 5mins. Are you able to change say to 10mins to see if the log entries happen every 10mins?

codefaux · May 4, 2021

I assume you mean tunable poll_attributes, on Disk Settings? I've changed it, we'll see how that goes.

[6.9.2] SMART settings wiped occasionally

User Feedback

Recommended Comments

trurl 2950

Link to comment

codefaux 19

Link to comment

trurl 2950

Link to comment

codefaux 19

Link to comment

codefaux 19

Link to comment

SimonF 956

Link to comment

JorgeB 7510

Link to comment

codefaux 19

Link to comment

SimonF 956

Link to comment

codefaux 19

Link to comment

SimonF 956

Link to comment

codefaux 19

Link to comment

SimonF 956

Link to comment

codefaux 19

Link to comment

trurl 2950

Link to comment

codefaux 19

Link to comment

SimonF 956

Link to comment

codefaux 19

Link to comment

SimonF 956

Link to comment

codefaux 19

Link to comment

SimonF 956

Link to comment

codefaux 19

Link to comment

SimonF 956

Link to comment

codefaux 19

Link to comment

Join the conversation