aglyons Posted February 11, 2023 Share Posted February 11, 2023 (edited) I've just noticed that the SMART self test seems to be running constantly and it's stuck at 100% on all drives. I've rebooted the server and still the self test starts up again, and again, at 100%, for all drives in the array including cache and parity drives. knoxx-diagnostics-20230211-1550.zip Edited February 11, 2023 by aglyons Quote Link to comment
JorgeB Posted February 12, 2023 Share Posted February 12, 2023 SMART will likely not work correctly with a RAID controller, use an LSI HBA in IT mode if possible. Quote Link to comment
aglyons Posted February 12, 2023 Author Share Posted February 12, 2023 (edited) This wasn't happening when I first got the server set up. This is new. I also have the "Starting Services..." notice at the bottom of the UI which was mentioned in another thread. I'm wondering if they are related somehow. I have a Dell PERC 310 running in HBA mode and from what I understand about Dell's HBA and LSI's IT mode, are the same thing. Quote Dell's Firmware has RAID mode and HBA mode. In HBA mode, there is no RAID functionality, it's disabled, and the individual disks can each be set to non-RAID, and they will pass through to the host. IT is not a mode, it's a separate firmware. It has nothing to do with Dell's firmware either, IT is an LSI firmware that has no RAID functionality in it. It simply passes through the disks to the host by default. There is no big difference between Dell's HBA mode and LSI's IT firmware, they both accomplish the same thing. I've done some more research and according the Dell specs, the Perc310 does support S.M.A.R.T. management, at least it does in RAID mode according to the spec sheet - https://i.dell.com/sites/doccontent/shared-content/data-sheets/Documents/dell-perc-h310-spec-sheet.pdf. I'm looking for clarification in HBA mode but if it's supported in RAID I think it's safe to say it would be. Edited February 12, 2023 by aglyons Quote Link to comment
aglyons Posted February 12, 2023 Author Share Posted February 12, 2023 Poking around I found the hardware profile tool which I had never run before. According the output, from what I can understand, the card has already been flashed to the LSI firmware. <node id="raid" claimed="true" class="storage" handle="PCI:0000:05:00.0" modalias="pci:v00001000d00000073sv00001028sd00001F78bc01sc04i00"> <description>RAID bus controller</description> <product>MegaRAID SAS 2008 [Falcon]</product> <vendor>Broadcom / LSI</vendor> <physid>0</physid> <subproduct>Dell</subproduct> <subvendor>Dell</subvendor> <businfo>pci@0000:05:00.0</businfo> <logicalname>scsi1</logicalname> <version>03</version> <width units="bits">64</width> <clock units="Hz">33000000</clock> <configuration> <setting id="driver" value="megaraid_sas" /> <setting id="latency" value="0" /> </configuration> Quote Link to comment
aglyons Posted February 12, 2023 Author Share Posted February 12, 2023 I changed the default Smart Controller Type in disk settings to Megaraid since that is the driver that is being used according to the hardware report. The continuous SMART test seems to have gone away, but, I have now noticed that each individual drives SMART config has some extra fields. It appears as though the Device Name is populated automatically and correctly. But Disk Index does not have a value. I've searched online and through the docs and can't find anything that refers to the Disk Index and where/how I would get that. Quote Link to comment
JorgeB Posted February 12, 2023 Share Posted February 12, 2023 Controller is using the MegaRAID driver, so it's not flashed to IT mode, SMART with MegaRAID it known to not work correctly, at least most times. Quote Link to comment
aglyons Posted February 12, 2023 Author Share Posted February 12, 2023 So can you help me understand the difference between HBA (which is non-raid passthrough) and IT mode (which is non-raid passthrough)? Sorry, I'm getting frustrated as my system is running with data on it. If I have to do something dramatic to correct the situation, It's not going to be a fun time. Quote Link to comment
itimpi Posted February 12, 2023 Share Posted February 12, 2023 Just now, aglyons said: So can you help me understand the difference between HBA (which is non-raid passthrough) and IT mode (which is non-raid passthrough)? Sorry, I'm getting frustrated as my system is running with data on it. If I have to do something dramatic to correct the situation, It's not going to be a fun time. The two terms are not really equivalents. HBA is just a generic term (stands for Host Bus Adapter I think). IT mode one way in which an HBA can operate. Quote Link to comment
aglyons Posted February 12, 2023 Author Share Posted February 12, 2023 "IT mode stands for "initiator target". It presents each drive individually to the host." https://dannyda.com/2021/09/22/what-are-it-mode-hba-mode-raid-mode-in-sas-controllers/ Quote RAID mode: (Redundant Array of Independent Disks) mode, the controller will work in RAID mode, the operating system will not see each individual disks HBA mode: (Host Bus Adapter) mode, the controller will not work in RAID mode, so that the operating system can see each disks individually (Dell’s Firmware has RAID mode and HBA mode) IT mode: (Initiator Target) mode, the controller will not work in RAID mode, so that the operating system can see each disks individually (LSI/Broadcom’s Firmware has RAID mode and IT mode) Basically HBA mode and IT mode are the same, just different vendors give the non-RAID mode different names Quote Link to comment
aglyons Posted February 12, 2023 Author Share Posted February 12, 2023 (edited) So the next question is, why is Unraid using the Megaraid driver for the controller that is running in HBA mode? @limetech If you have a moment can you help me clear this up? Edited February 12, 2023 by aglyons Quote Link to comment
aglyons Posted February 12, 2023 Author Share Posted February 12, 2023 Querying smartctl from the CLI shows me that it seems to identify the SCSI devices AND it also is identifying the megaraid separately. root@KNOXX:~# smartctl --scan /dev/sdb -d scsi # /dev/sdb, SCSI device /dev/sdc -d scsi # /dev/sdc, SCSI device /dev/sdd -d scsi # /dev/sdd, SCSI device /dev/sde -d scsi # /dev/sde, SCSI device /dev/sdf -d scsi # /dev/sdf, SCSI device /dev/sdg -d scsi # /dev/sdg, SCSI device /dev/sdh -d scsi # /dev/sdh, SCSI device /dev/sdi -d scsi # /dev/sdi, SCSI device /dev/sdj -d scsi # /dev/sdj, SCSI device /dev/sdk -d scsi # /dev/sdk, SCSI device /dev/sdl -d scsi # /dev/sdl, SCSI device /dev/sdm -d scsi # /dev/sdm, SCSI device /dev/bus/1 -d megaraid,0 # /dev/bus/1 [megaraid_disk_00], SCSI device /dev/bus/1 -d megaraid,1 # /dev/bus/1 [megaraid_disk_01], SCSI device /dev/bus/1 -d megaraid,2 # /dev/bus/1 [megaraid_disk_02], SCSI device /dev/bus/1 -d megaraid,3 # /dev/bus/1 [megaraid_disk_03], SCSI device /dev/bus/1 -d megaraid,4 # /dev/bus/1 [megaraid_disk_04], SCSI device /dev/bus/1 -d megaraid,5 # /dev/bus/1 [megaraid_disk_05], SCSI device /dev/bus/1 -d megaraid,6 # /dev/bus/1 [megaraid_disk_06], SCSI device /dev/bus/1 -d megaraid,7 # /dev/bus/1 [megaraid_disk_07], SCSI device /dev/bus/1 -d megaraid,10 # /dev/bus/1 [megaraid_disk_10], SCSI device /dev/bus/1 -d megaraid,11 # /dev/bus/1 [megaraid_disk_11], SCSI device /dev/bus/1 -d megaraid,12 # /dev/bus/1 [megaraid_disk_12], SCSI device /dev/bus/1 -d megaraid,13 # /dev/bus/1 [megaraid_disk_13], SCSI device Querying one of the SCSI devices SDC directly shows that it sees the drive, identifies it just fine and can see that the device is SMART compatible and enabled. root@KNOXX:~# smartctl -i /dev/sdc smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.19.17-Unraid] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate IronWolf Device Model: ST6000VN001-2BB186 Serial Number: ZR12XXYK LU WWN Device Id: 5 000c50 0e39de1db Firmware Version: SC60 User Capacity: 6,001,175,126,016 bytes [6.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5425 rpm Form Factor: 3.5 inches Device is: In smartctl database 7.3/5417 ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sun Feb 12 15:35:54 2023 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled Quote Link to comment
aglyons Posted February 12, 2023 Author Share Posted February 12, 2023 OK, fixed it. Dell PERC310 running in HBA mode. You have to set the SMART controller type to SCSI mode. Once I did that I could see the "Last Smart test results:" spinning and return "Completed" I checked the report download and was presented SMART stats. smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.19.17-Unraid] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org User Capacity: 6,001,175,126,016 bytes [6.00 TB] Logical block size: 512 bytes Physical block size: 4096 bytes Lowest aligned LBA: 8 Rotation Rate: 5425 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000c500e39dc12a Serial number: ZR12XYE5 Device type: disk Transport protocol: SAS (SPL-4) Local Time is: Sun Feb 12 15:40:56 2023 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Disabled or Not Supported Read Cache is: Enabled Writeback Cache is: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: OK Current Drive Temperature: 27 C Drive Trip Temperature: 0 C Error Counter logging not supported [GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on'] SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed - 70 - [- - -] Long (extended) Self-test duration: 43980 seconds [12.2 hours] Device does not support Background scan results logging 1 Quote Link to comment
aglyons Posted February 12, 2023 Author Share Posted February 12, 2023 And I spoke too soon. Triggering a manual short test produced an error message; [GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on'] There's a bunch of hits online about this actually being a bug in the smartctl app. There was this one post on (https://bugzilla.redhat.com/show_bug.cgi?id=1907729) at the very end mentioning blacklisting the individual USB disk -it was a post about external drives and smart failing. Quote OK, so I found a most helpful solution here: https://forum.openmediavault.org/index.php?thread/26601-how-to-get-smart-on-external-drives/ It may only be a temporary solution, but here it is, and I quote: 'You need to UAS blacklist your individual Seagate USB3 disks since otherwise SMART will always fail.' Do an lsusb to find the USB id of your Seagate disk, 0bc2:2343 in my case, then: echo "options usb-storage quirks=0bc2:2343:u" >> /etc/modprobe.d/usb-storage-quirks.conf and re-boot. Then, this command works: smartctl -d sat /dev/sdb -a NB. Specify 'sat' instead of 'scsi' in the above command. No '--smart=on' flag is needed. Hopefully this will help some-one until the problem is fixed in a later kernel (I assume). The search continues Quote Link to comment
aglyons Posted February 12, 2023 Author Share Posted February 12, 2023 (edited) I think this IS actually resolved. Manually triggering a shot self-test does return a message in red saying "Errors occurred - Check SMART report" but checking the SMART report shows no problem. And, if you refresh the screen the message changes to green and says "Completed". The interesting thing is this error message only pops up for 3.5inch spinning drives that are part of the Unraid array. I don't see the same error message on SSD drives that are part of the cache pool. Edited February 12, 2023 by aglyons Quote Link to comment
aglyons Posted February 12, 2023 Author Share Posted February 12, 2023 (edited) OK, so this long rabbit hole brought me to this conclusion. There was nothing wrong with SMART testing to begin with. There is however, I think a bug in the Unraid UI. Switching the default SMART controller type back to auto, results in manually triggering tests to run, report the progress % and complete successfully without error. BUT, The UI for each drives disk settings will report there is a test running, and, the self-test page will also show "self-test in progress, 100% complete". THis is shown even though there are no self-tests actually running. I've confirmed with smartctl that no tests are currently running on /dev/sdc yet the GUI is showing a test in progress. root@KNOXX:~# smartctl -a /dev/sdc smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.19.17-Unraid] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate IronWolf Device Model: ST6000VN001-2BB186 Serial Number: ZR12XXYK LU WWN Device Id: 5 000c50 0e39de1db Firmware Version: SC60 User Capacity: 6,001,175,126,016 bytes [6.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5425 rpm Form Factor: 3.5 inches Device is: In smartctl database 7.3/5417 ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sun Feb 12 16:41:53 2023 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART Status not supported: Incomplete response, ATA output registers missing SMART overall-health self-assessment test result: PASSED Warning: This result is based on an Attribute check. General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 729) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x70bd) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 081 064 006 Pre-fail Always - 134257168 3 Spin_Up_Time 0x0003 093 091 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 124 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 081 060 045 Pre-fail Always - 125890195 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 7136 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 55 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 067 047 040 Old_age Always - 33 (Min/Max 31/36) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 284 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 855 194 Temperature_Celsius 0x0022 033 053 000 Old_age Always - 33 (0 21 0 0 0) 195 Hardware_ECC_Recovered 0x001a 081 064 000 Old_age Always - 134257168 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 6333h+41m+47.380s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 54866540544 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 264620654405 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 7136 - # 2 Short offline Completed without error 00% 7136 - # 3 Short offline Completed without error 00% 7135 - # 4 Short offline Completed without error 00% 5617 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. The particular line to refer to for a currently running test is; Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. If the test WAS currently running that would have a value showing the current progress like this. Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining. Edited February 12, 2023 by aglyons Quote Link to comment
JorgeB Posted February 13, 2023 Share Posted February 13, 2023 17 hours ago, aglyons said: So the next question is, why is Unraid using the Megaraid driver for the controller that is running in HBA mode? Because it's not a true HBA, a true LSI HBA will use the mpt3sas driver, some LSI controllers even use it when they are in RAID mode, you have a controller that has a HBA mode but it's not a true HBA, so like mentioned SMART might not work correctly. Quote Link to comment
Conan the Barbarian Posted March 26, 2023 Share Posted March 26, 2023 Hi guys, I'm getting the same IU issue in one of my UnRAID servers, with a HP p822 SAS controller. In one Dell Server, with a PERC 750 works fine. Quote Link to comment
unCoreX Posted June 21, 2023 Share Posted June 21, 2023 I have the same problem, I have an HP Proliant ML 350p G8 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.