ChristianMingle Posted June 24, 2022 Share Posted June 24, 2022 so... clearly this is not the ideal SMART scan, and it never shows up as "healthy", so, how bad is this? The drive in question is a 4x200gb Sun Oracle F80 800GB PCIe Flash Accelerator. I bought it second hand a little over 6 months ago. It shows as having been online for 54,000 hours which comes out to around 6.2 years. Ive heard these things can read/write petabytes before dying, which is definitely doable in 6 years, but id imagine these drives have a little more endurance than your normal drives. With that being the case, how concerning are these results? I am using 1x200gb as a cache drive and the other 3x200 are just chilling holding random data. Should i disable the 3x200gb and just leave the 1x200gb as a cache drive until it dies? Ideally id like to use it until death since i basically just bought it. I don't really know how to read a SMART report beyond knowing that anything above the threshold probably isn't good, but i don't know what a normal report looks like, or what normal wear and tear looks like. Any insight would be appreciated, thank you. Quote Link to comment
JorgeB Posted June 24, 2022 Share Posted June 24, 2022 Normalized values look good, RAW value probably not the main indicator for this device, I wouldn't worry for now. Quote Link to comment
ChristianMingle Posted August 4, 2022 Author Share Posted August 4, 2022 On 6/24/2022 at 5:02 AM, JorgeB said: Normalized values look good, RAW value probably not the main indicator for this device, I wouldn't worry for now. Should i be worried about this? I noticed my log file was at 100% and it is all from this one file. It contains a bunch of repeats of "read errors" from disk 5,6,7 I tried uploading the log, but it is 127.9mb so it fails. i tried to pastebin it, but its over 300,000 lines and crashes my tab whenever i try to paste it, so its much easier to just upload it and paste a mega link: https://mega.nz/file/QkRXibpa#k7kktDsPwLWgEueh9Ov2BLdNzYGmLIsJ3QLDybrRhCE sorry for the late reply, but any info would be helpful. Should i increase the log size? is this a problem? do i need to suppress the errors or something? Quote Link to comment
JorgeB Posted August 4, 2022 Share Posted August 4, 2022 Please post the diagnostics. Quote Link to comment
ChristianMingle Posted August 5, 2022 Author Share Posted August 5, 2022 On 8/4/2022 at 4:21 AM, JorgeB said: Please post the diagnostics. I have attached the diagnostics. Also, is this concerning? "Last check completed on Sat 30 Jul 2022 09:07:14 PM PDT (six days ago) Finding 48827965 errors Duration: 17 hours, 37 minutes, 13 seconds. Average speed: 154.4 MB/sec" i assume finding 0 errors is the goal, so almost 50 million has my eyebrows raised lol. tower-diagnostics-20220805-0953.zip Quote Link to comment
JorgeB Posted August 5, 2022 Share Posted August 5, 2022 There's basically nothing in the syslog. As for the sync errors run another check, if the last one was correct is should find 0 errors. Quote Link to comment
trurl Posted August 5, 2022 Share Posted August 5, 2022 1 hour ago, JorgeB said: There's basically nothing in the syslog Literally only a few lines which itself is abnormal. Did you delete logs? What do you get from command line with this? ls -lah /var/log/syslog* 2 hours ago, ChristianMingle said: Finding 48827965 errors My first thought is you did something to invalidate parity. If next parity check still has errors that suggests a hardware problem. Quote Link to comment
trurl Posted August 5, 2022 Share Posted August 5, 2022 Nevermind 7 minutes ago, trurl said: Literally only a few lines which itself is abnormal. Did you delete logs? What do you get from command line with this? ls -lah /var/log/syslog* On 8/4/2022 at 3:17 AM, ChristianMingle said: its much easier to just upload it Easier for everyone if you just zip that syslog and attach it. Quote Link to comment
ChristianMingle Posted August 5, 2022 Author Share Posted August 5, 2022 On 8/4/2022 at 3:17 AM, ChristianMingle said: Should i be worried about this? I noticed my log file was at 100% and it is all from this one file. It contains a bunch of repeats of "read errors" from disk 5,6,7 I tried uploading the log, but it is 127.9mb so it fails. i tried to pastebin it, but its over 300,000 lines and crashes my tab whenever i try to paste it, so its much easier to just upload it and paste a mega link: https://mega.nz/file/QkRXibpa#k7kktDsPwLWgEueh9Ov2BLdNzYGmLIsJ3QLDybrRhCE sorry for the late reply, but any info would be helpful. Should i increase the log size? is this a problem? do i need to suppress the errors or something? this mega link is the log file, however, the diagnostics i posted later on was AFTER i deleted the log file because it was at 100% and i figured itd be fine since i had a copy of the actual log file. Im not sure what else is helpful besides the log file and the diagnostics, but i can get anything required once i get back home from my girlfriends birthday weekend. Thanks for the help guys sorry im a noobie lol Quote Link to comment
JorgeB Posted August 5, 2022 Share Posted August 5, 2022 8 minutes ago, ChristianMingle said: this mega link is the log file We usually don't like external links, but I checked the log and it was a HBA problem: Jul 17 09:32:48 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!! No idea if this is the latest firmware, so check the Broadcom site: Jul 17 08:16:29 Tower kernel: mpt2sas_cm1: WarpDrive: FWVersion(113.05.03.01), ChipRevision(0x03), BiosVersion(110.00.01.00) Other than that make sure it's well seated and sufficiently cooled, you can also try a different slot if available. Quote Link to comment
ChristianMingle Posted August 5, 2022 Author Share Posted August 5, 2022 39 minutes ago, JorgeB said: We usually don't like external links, but I checked the log and it was a HBA problem: Jul 17 09:32:48 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!! No idea if this is the latest firmware, so check the Broadcom site: Jul 17 08:16:29 Tower kernel: mpt2sas_cm1: WarpDrive: FWVersion(113.05.03.01), ChipRevision(0x03), BiosVersion(110.00.01.00) Other than that make sure it's well seated and sufficiently cooled, you can also try a different slot if available. Is the SAS host being non-operational something that can happen due to a bug and is fixed on restart, or is it something that generally once it shows up it wont go away until the problem is fixed? Will it only show up during parity checks or will it cause problems during day to day usage? I havent noticed anything weird besides the 50mil error check on the first parity check in 90d and the log file being full(i deleted it and it seems fine now) so maybe its just a parity check thing? Either way id like to fix it since parity problems are never a good thing so ill look into your recommendations. Quote Link to comment
JorgeB Posted August 5, 2022 Share Posted August 5, 2022 1 minute ago, ChristianMingle said: Is the SAS host being non-operational something that can happen due to a bug and is fixed on restart, I would not call it a bug, the HBA stops responding, this is usually more a hardware issue, could also be firmware related, reboot/reset should bring it back online, until it happens again (or not). 3 minutes ago, ChristianMingle said: so maybe its just a parity check thing? More likely to happen during heavy load, like a parity check, do what I recommended and run another check, maybe it was a one time thing. Quote Link to comment
ChristianMingle Posted August 5, 2022 Author Share Posted August 5, 2022 35 minutes ago, JorgeB said: I would not call it a bug, the HBA stops responding, this is usually more a hardware issue, could also be firmware related, reboot/reset should bring it back online, until it happens again (or not). More likely to happen during heavy load, like a parity check, do what I recommended and run another check, maybe it was a one time thing. 9.8 TB Elapsed time:less than a minute Current position:4.89 GB (0.0 %) Estimated speed:168.5 MB/sec Estimated finish:16 hours, 8 minutes Sync errors corrected:1193109 im assuming 1mil+ errors within the first 30 seconds of the parity check is not what we are looking for... so i will try updating the SAS firmware later and keep it spun down until i get the chance Quote Link to comment
ChristianMingle Posted August 5, 2022 Author Share Posted August 5, 2022 2 minutes ago, ChristianMingle said: 9.8 TB Elapsed time:less than a minute Current position:4.89 GB (0.0 %) Estimated speed:168.5 MB/sec Estimated finish:16 hours, 8 minutes Sync errors corrected:1193109 im assuming 1mil+ errors within the first 30 seconds of the parity check is not what we are looking for... so i will try updating the SAS firmware later and keep it spun down until i get the chance also, is it fair to assume that the 3 drives with 50mil errors are probably not in working order? I got the 4x200GB (one device) used and it has had Errors ever since i started using it, but i havent noticed anything but i also try not to use it very much, its HEAVILY used by the last person with something like 60k hours on it, but it is a Sun Oracle F80 800GB so its meant to have a pretty heavily used lifecycle but the errors concern me as every other drive i have is at 0. I think the errors are just bad sectors and i was told that it knows that and avoids using the bad sectors, but dont know more than that. Quote Link to comment
JorgeB Posted August 6, 2022 Share Posted August 6, 2022 13 hours ago, ChristianMingle said: im assuming 1mil+ errors within the first 30 seconds of the parity check is not what we are looking for. No, but they are likely the result of the previous check, where those devices could not be read, if it's a correcting check let it finish then run another one. 13 hours ago, ChristianMingle said: also, is it fair to assume that the 3 drives with 50mil errors are probably not in working order? Post new diags. Quote Link to comment
trurl Posted August 6, 2022 Share Posted August 6, 2022 Why do you have SSD in the array (disk1)? SSDs in the array cannot be trimmed, and can only be written at parity speed. Better use for that is in a pool. Personally, I wouldn't even bother to use any of those very small disks (5,6,7) Quote Link to comment
ChristianMingle Posted August 8, 2022 Author Share Posted August 8, 2022 On 8/6/2022 at 4:21 AM, JorgeB said: No, but they are likely the result of the previous check, where those devices could not be read, if it's a correcting check let it finish then run another one. Post new diags. Last check completed on Sat 06 Aug 2022 05:11:45 AM PDT (yesterday) Finding 48827869 errors Duration: 17 hours, 37 minutes, 51 seconds. Average speed: 154.4 MB/sec diagnostics attached i havent done any of the recommended steps yet as i just got back home, but i figured id post these here because they ran over the time i was gone tower-diagnostics-20220807-2224.zip Quote Link to comment
trurl Posted August 8, 2022 Share Posted August 8, 2022 Disks 5,6,7 all disconnected. I would New Config without that SSD in the array, and without any of those 200GB disk, rebuild parity, and go forward from there. Quote Link to comment
ChristianMingle Posted August 8, 2022 Author Share Posted August 8, 2022 3 hours ago, trurl said: Disks 5,6,7 all disconnected. I would New Config without that SSD in the array, and without any of those 200GB disk, rebuild parity, and go forward from there. ideally i could make the cache and disks 5,6,7 combine into the 800GB cache disk i originally planned on having. (cache and Disks 5,6, and 7 are all the same Sun Flash Accelerator F80, but i didn't know how to flash it into one 800GB drive) If the cache drive goes offline is it less problematic than other ones? If not should i make the SSD the cache drive and just remove the Sun Flash Accelerator F80? I wish i didn't have to since i overpaid for it to begin with and now its causing problems lol Quote Link to comment
ChristianMingle Posted August 8, 2022 Author Share Posted August 8, 2022 3 hours ago, trurl said: Disks 5,6,7 all disconnected. I would New Config without that SSD in the array, and without any of those 200GB disk, rebuild parity, and go forward from there. also sorry, but can you explain how one of these can be healthy and the other 3 have errors, even though they're all the same drive? I know its actually 4 200GB drives but like, is it just like firmware problems? I find it hard to believe that the one drive i made cache just happened to be the only healthy drive and the other 3 are crapping out lol. (Cache and drive 5,6,7 are all the same Sun Flash Accelerator F80 in case you didnt see the other post yet) Quote Link to comment
JorgeB Posted August 8, 2022 Share Posted August 8, 2022 Try adding all those drives to a cache pool to see if they work there, they don't on the array. Quote Link to comment
trurl Posted August 8, 2022 Share Posted August 8, 2022 5 hours ago, ChristianMingle said: What do you see when you hover over those SMART warnings? Quote Link to comment
ChristianMingle Posted August 8, 2022 Author Share Posted August 8, 2022 4 hours ago, trurl said: What do you see when you hover over those SMART warnings? a lot of the values are also over the threshhold(provided more info on the middle drive) also i know the F80 has over 55,000 powered on hours on it, its over 6 years old, but everyone always says that they "last forever" so im hoping that this one isnt dying out yet lol Quote Link to comment
trurl Posted August 8, 2022 Share Posted August 8, 2022 Every one of those is trash Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.