Jump to content

S.M.A.R.T Self Test seems to run constantly


aglyons

Recommended Posts

This wasn't happening when I first got the server set up. This is new. I also have the "Starting Services..." notice at the bottom of the UI which was mentioned in another thread. I'm wondering if they are related somehow.

 

I have a Dell PERC 310 running in HBA mode and from what I understand about Dell's HBA and LSI's IT mode, are the same thing.

 

Quote

Dell's Firmware has RAID mode and HBA mode. In HBA mode, there is no RAID functionality, it's disabled, and the individual disks can each be set to non-RAID, and they will pass through to the host. IT is not a mode, it's a separate firmware. It has nothing to do with Dell's firmware either, IT is an LSI firmware that has no RAID functionality in it. It simply passes through the disks to the host by default. There is no big difference between Dell's HBA mode and LSI's IT firmware, they both accomplish the same thing.

 

I've done some more research and according the Dell specs, the Perc310 does support S.M.A.R.T. management, at least it does in RAID mode according to the spec sheet - https://i.dell.com/sites/doccontent/shared-content/data-sheets/Documents/dell-perc-h310-spec-sheet.pdf.

 

I'm looking for clarification in HBA mode but if it's supported in RAID I think it's safe to say it would be.

Edited by aglyons
Link to comment

Poking around I found the hardware profile tool which I had never run before. According the output, from what I can understand, the card has already been flashed to the LSI firmware.

 

<node id="raid" claimed="true" class="storage" handle="PCI:0000:05:00.0" modalias="pci:v00001000d00000073sv00001028sd00001F78bc01sc04i00">
         <description>RAID bus controller</description>
         <product>MegaRAID SAS 2008 [Falcon]</product>
         <vendor>Broadcom / LSI</vendor>
         <physid>0</physid>
         <subproduct>Dell</subproduct>
         <subvendor>Dell</subvendor>
         <businfo>pci@0000:05:00.0</businfo>
         <logicalname>scsi1</logicalname>
         <version>03</version>
         <width units="bits">64</width>
         <clock units="Hz">33000000</clock>
         <configuration>
          <setting id="driver" value="megaraid_sas" />
          <setting id="latency" value="0" />
         </configuration>

 

Link to comment

I changed the default Smart Controller Type in disk settings to Megaraid since that is the driver that is being used according to the hardware report.

 

image.thumb.png.fff6f272581091952c99086e7cb00f8c.png

 

The continuous SMART test seems to have gone away, but, I have now noticed that each individual drives SMART config has some extra fields.

 

It appears as though the Device Name is populated automatically and correctly. But Disk Index does not have a value. 

 

image.thumb.png.3f688a4d719a9db3df50c881638a2f79.png

 

I've searched online and through the docs and can't find anything that refers to the Disk Index and where/how I would get that.

 

Link to comment

So can you help me understand the difference between HBA (which is non-raid passthrough) and IT mode (which is non-raid passthrough)?

 

Sorry, I'm getting frustrated as my system is running with data on it. If I have to do something dramatic to correct the situation, It's not going to be a fun time.

Link to comment
Just now, aglyons said:

So can you help me understand the difference between HBA (which is non-raid passthrough) and IT mode (which is non-raid passthrough)?

 

Sorry, I'm getting frustrated as my system is running with data on it. If I have to do something dramatic to correct the situation, It's not going to be a fun time.

The two terms are not really equivalents.   HBA is just a generic term (stands for Host Bus Adapter I think).   IT mode one way in which an HBA can operate.

Link to comment

"IT mode stands for "initiator target". It presents each drive individually to the host."

 

https://dannyda.com/2021/09/22/what-are-it-mode-hba-mode-raid-mode-in-sas-controllers/

 

Quote

RAID mode: (Redundant Array of Independent Disks) mode, the controller will work in RAID mode, the operating system will not see each individual disks

 

HBA mode: (Host Bus Adapter) mode, the controller will not work in RAID mode, so that the operating system can see each disks individually

(Dell’s Firmware has RAID mode and HBA mode)

 

IT mode: (Initiator Target) mode, the controller will not work in RAID mode, so that the operating system can see each disks individually

 

(LSI/Broadcom’s Firmware has RAID mode and IT mode)

 

Basically HBA mode and IT mode are the same, just different vendors give the non-RAID mode different names

 

Link to comment

Querying  smartctl from the CLI shows me that it seems to identify the SCSI devices AND it also is identifying the megaraid separately.

 

root@KNOXX:~# smartctl --scan
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/sdd -d scsi # /dev/sdd, SCSI device
/dev/sde -d scsi # /dev/sde, SCSI device
/dev/sdf -d scsi # /dev/sdf, SCSI device
/dev/sdg -d scsi # /dev/sdg, SCSI device
/dev/sdh -d scsi # /dev/sdh, SCSI device
/dev/sdi -d scsi # /dev/sdi, SCSI device
/dev/sdj -d scsi # /dev/sdj, SCSI device
/dev/sdk -d scsi # /dev/sdk, SCSI device
/dev/sdl -d scsi # /dev/sdl, SCSI device
/dev/sdm -d scsi # /dev/sdm, SCSI device
/dev/bus/1 -d megaraid,0 # /dev/bus/1 [megaraid_disk_00], SCSI device
/dev/bus/1 -d megaraid,1 # /dev/bus/1 [megaraid_disk_01], SCSI device
/dev/bus/1 -d megaraid,2 # /dev/bus/1 [megaraid_disk_02], SCSI device
/dev/bus/1 -d megaraid,3 # /dev/bus/1 [megaraid_disk_03], SCSI device
/dev/bus/1 -d megaraid,4 # /dev/bus/1 [megaraid_disk_04], SCSI device
/dev/bus/1 -d megaraid,5 # /dev/bus/1 [megaraid_disk_05], SCSI device
/dev/bus/1 -d megaraid,6 # /dev/bus/1 [megaraid_disk_06], SCSI device
/dev/bus/1 -d megaraid,7 # /dev/bus/1 [megaraid_disk_07], SCSI device
/dev/bus/1 -d megaraid,10 # /dev/bus/1 [megaraid_disk_10], SCSI device
/dev/bus/1 -d megaraid,11 # /dev/bus/1 [megaraid_disk_11], SCSI device
/dev/bus/1 -d megaraid,12 # /dev/bus/1 [megaraid_disk_12], SCSI device
/dev/bus/1 -d megaraid,13 # /dev/bus/1 [megaraid_disk_13], SCSI device

 

Querying one of the SCSI devices SDC directly shows that it sees the drive, identifies it just fine and can see that the device is SMART compatible and enabled.

 

root@KNOXX:~# smartctl -i /dev/sdc
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.19.17-Unraid] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST6000VN001-2BB186
Serial Number:    ZR12XXYK
LU WWN Device Id: 5 000c50 0e39de1db
Firmware Version: SC60
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5425 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5417
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Feb 12 15:35:54 2023 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

 

Link to comment

OK, fixed it.

 

Dell PERC310 running in HBA mode. You have to set the SMART controller type to SCSI mode.

 

Once I did that I could see the "Last Smart test results:" spinning and return "Completed" I checked the report download and was presented SMART stats.

 

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.19.17-Unraid] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

User Capacity:        6,001,175,126,016 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Lowest aligned LBA:   8
Rotation Rate:        5425 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500e39dc12a
Serial number:        ZR12XYE5
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Sun Feb 12 15:40:56 2023 EST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature:     27 C
Drive Trip Temperature:        0 C

Error Counter logging not supported


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -      70                 - [-   -    -]

Long (extended) Self-test duration: 43980 seconds [12.2 hours]

Device does not support Background scan results logging

 

  • Like 1
Link to comment

And I spoke too soon.

 

Triggering a manual short test produced an error message;

 

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']

 

There's a bunch of hits online about this actually being a bug in the smartctl app.

 

There was this one post on (https://bugzilla.redhat.com/show_bug.cgi?id=1907729) at the very end mentioning blacklisting the individual USB disk -it was a post about external drives and smart failing.

 

Quote

OK, so I found a most helpful solution here: https://forum.openmediavault.org/index.php?thread/26601-how-to-get-smart-on-external-drives/ It may only be a temporary solution, but here it is, and I quote: 'You need to UAS blacklist your individual Seagate USB3 disks since otherwise SMART will always fail.' Do an lsusb to find the USB id of your Seagate disk, 0bc2:2343 in my case, then: echo "options usb-storage quirks=0bc2:2343:u" >> /etc/modprobe.d/usb-storage-quirks.conf and re-boot. Then, this command works: smartctl -d sat /dev/sdb -a NB. Specify 'sat' instead of 'scsi' in the above command. No '--smart=on' flag is needed. Hopefully this will help some-one until the problem is fixed in a later kernel (I assume).

 

The search continues

Link to comment

I think this IS actually resolved. 

 

Manually triggering a shot self-test does return a message in red saying "Errors occurred - Check SMART report" but checking the SMART report shows no problem. And, if you refresh the screen the message changes to green and says "Completed".

 

The interesting thing is this error message only pops up for 3.5inch spinning drives that are part of the Unraid array. I don't see the same error message on SSD drives that are part of the cache pool.

Edited by aglyons
Link to comment

OK, so this long rabbit hole brought me to this conclusion.

 

There was nothing wrong with SMART testing to begin with. There is however, I think a bug in the Unraid UI.

 

Switching the default SMART controller type back to auto, results in manually triggering tests to run, report the progress % and complete successfully without error.

 

BUT,

 

The UI for each drives disk settings will report there is a test running, and, the self-test page will also show "self-test in progress, 100% complete". THis is shown even though there are no self-tests actually running. I've confirmed with smartctl that no tests are currently running on /dev/sdc yet the GUI is showing a test in progress.

 

image.png.c98aebfa8e95b23a6ec6518a160650d3.png

 

 

image.png

 

root@KNOXX:~# smartctl -a /dev/sdc
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.19.17-Unraid] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST6000VN001-2BB186
Serial Number:    ZR12XXYK
LU WWN Device Id: 5 000c50 0e39de1db
Firmware Version: SC60
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5425 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5417
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Feb 12 16:41:53 2023 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 729) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   081   064   006    Pre-fail  Always       -       134257168
  3 Spin_Up_Time            0x0003   093   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       124
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   081   060   045    Pre-fail  Always       -       125890195
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       7136
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       55
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   047   040    Old_age   Always       -       33 (Min/Max 31/36)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       284
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       855
194 Temperature_Celsius     0x0022   033   053   000    Old_age   Always       -       33 (0 21 0 0 0)
195 Hardware_ECC_Recovered  0x001a   081   064   000    Old_age   Always       -       134257168
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       6333h+41m+47.380s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       54866540544
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       264620654405

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      7136         -
# 2  Short offline       Completed without error       00%      7136         -
# 3  Short offline       Completed without error       00%      7135         -
# 4  Short offline       Completed without error       00%      5617         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

The particular line to refer to for a currently running test is;

 

Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.

 

If the test WAS currently running that would have a value showing the current progress like this.

 

Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.

 

Edited by aglyons
Link to comment
17 hours ago, aglyons said:

So the next question is, why is Unraid using the Megaraid driver for the controller that is running in HBA mode?

Because it's not a true HBA, a true LSI HBA will use the mpt3sas driver, some LSI controllers even use it when they are in RAID mode, you have a controller that has a HBA mode but it's not a true HBA, so like mentioned SMART might not work correctly.

Link to comment
  • 1 month later...
  • 2 months later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...