... is this SMART result concerning...


Recommended Posts

QRBypzZ.png

so... clearly this is not the ideal SMART scan, and it never shows up as "healthy", so, how bad is this? The drive in question is a 4x200gb Sun Oracle F80 800GB PCIe Flash Accelerator. I bought it second hand a little over 6 months ago. It shows as having been online for 54,000 hours which comes out to around 6.2 years. Ive heard these things can read/write petabytes before dying, which is definitely doable in 6 years, but id imagine these drives have a little more endurance than your normal drives. With that being the case, how concerning are these results? I am using 1x200gb as a cache drive and the other 3x200 are just chilling holding random data. Should i disable the 3x200gb and just leave the 1x200gb as a cache drive until it dies? Ideally id like to use it until death since i basically just bought it. I don't really know how to read a SMART report beyond knowing that anything above the threshold probably isn't good, but i don't know what a normal report looks like, or what normal wear and tear looks like. Any insight would be appreciated, thank you.

Link to comment
  • 1 month later...
On 6/24/2022 at 5:02 AM, JorgeB said:

Normalized values look good, RAW value probably not the main indicator for this device, I wouldn't worry for now.

Should i be worried about this? I noticed my log file was at 100% and it is all from this one file. It contains a bunch of repeats of "read errors" from disk 5,6,7

I tried uploading the log, but it is 127.9mb so it fails. i tried to pastebin it, but its over 300,000 lines and crashes my tab whenever i try to paste it, so its much easier to just upload it and paste a mega link: https://mega.nz/file/QkRXibpa#k7kktDsPwLWgEueh9Ov2BLdNzYGmLIsJ3QLDybrRhCE

sorry for the late reply, but any info would be helpful. Should i increase the log size? is this a problem? do i need to suppress the errors or something?

Link to comment
On 8/4/2022 at 4:21 AM, JorgeB said:

Please post the diagnostics.

I have attached the diagnostics. Also, is this concerning?

"Last check completed on Sat 30 Jul 2022 09:07:14 PM PDT (six days ago)
Finding 48827965 errors Duration: 17 hours, 37 minutes, 13 seconds. Average speed: 154.4 MB/sec"

i assume finding 0 errors is the goal, so almost 50 million has my eyebrows raised lol.

 

tower-diagnostics-20220805-0953.zip

Link to comment
1 hour ago, JorgeB said:

There's basically nothing in the syslog

Literally only a few lines which itself is abnormal. Did you delete logs?

 

What do you get from command line with this?

ls -lah /var/log/syslog*

 

2 hours ago, ChristianMingle said:

Finding 48827965 errors

My first thought is you did something to invalidate parity. If next parity check still has errors that suggests a hardware problem.

Link to comment

  Nevermind

7 minutes ago, trurl said:

Literally only a few lines which itself is abnormal. Did you delete logs?

 

What do you get from command line with this?

ls -lah /var/log/syslog*

 

 

On 8/4/2022 at 3:17 AM, ChristianMingle said:

its much easier to just upload it

Easier for everyone if you just zip that syslog and attach it.

Link to comment
On 8/4/2022 at 3:17 AM, ChristianMingle said:

Should i be worried about this? I noticed my log file was at 100% and it is all from this one file. It contains a bunch of repeats of "read errors" from disk 5,6,7

I tried uploading the log, but it is 127.9mb so it fails. i tried to pastebin it, but its over 300,000 lines and crashes my tab whenever i try to paste it, so its much easier to just upload it and paste a mega link: https://mega.nz/file/QkRXibpa#k7kktDsPwLWgEueh9Ov2BLdNzYGmLIsJ3QLDybrRhCE

sorry for the late reply, but any info would be helpful. Should i increase the log size? is this a problem? do i need to suppress the errors or something?

this mega link is the log file, however, the diagnostics i posted later on was AFTER i deleted the log file because it was at 100% and i figured itd be fine since i had a copy of the actual log file. Im not sure what else is helpful besides the log file and the diagnostics, but i can get anything required once i get back home from my girlfriends birthday weekend. Thanks for the help guys sorry im a noobie lol

Link to comment
8 minutes ago, ChristianMingle said:

this mega link is the log file

We usually don't like external links, but I checked the log and it was a HBA problem:

 

Jul 17 09:32:48 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!!

 

No idea if this is the latest firmware, so check the Broadcom site:

 

Jul 17 08:16:29 Tower kernel: mpt2sas_cm1: WarpDrive: FWVersion(113.05.03.01), ChipRevision(0x03), BiosVersion(110.00.01.00)

 

Other than that make sure it's well seated and sufficiently cooled, you can also try a different slot if available.

Link to comment
39 minutes ago, JorgeB said:

We usually don't like external links, but I checked the log and it was a HBA problem:

 

Jul 17 09:32:48 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!!

 

No idea if this is the latest firmware, so check the Broadcom site:

 

Jul 17 08:16:29 Tower kernel: mpt2sas_cm1: WarpDrive: FWVersion(113.05.03.01), ChipRevision(0x03), BiosVersion(110.00.01.00)

 

Other than that make sure it's well seated and sufficiently cooled, you can also try a different slot if available.

Is the SAS host being non-operational something that can happen due to a bug and is fixed on restart, or is it something that generally once it shows up it wont go away until the problem is fixed? Will it only show up during parity checks or will it cause problems during day to day usage? I havent noticed anything weird besides the 50mil error check on the first parity check in 90d and the log file being full(i deleted it and it seems fine now) so maybe its just a parity check thing? Either way id like to fix it since parity problems are never a good thing so ill look into your recommendations.

Link to comment
1 minute ago, ChristianMingle said:

Is the SAS host being non-operational something that can happen due to a bug and is fixed on restart,

I would not call it a bug, the HBA stops responding, this is usually more a hardware issue, could also be firmware related, reboot/reset should bring it back online, until it happens again (or not).

 

3 minutes ago, ChristianMingle said:

so maybe its just a parity check thing?

More likely to happen during heavy load, like a parity check, do what I recommended and run another check, maybe it was a one time thing.

Link to comment
35 minutes ago, JorgeB said:

I would not call it a bug, the HBA stops responding, this is usually more a hardware issue, could also be firmware related, reboot/reset should bring it back online, until it happens again (or not).

 

More likely to happen during heavy load, like a parity check, do what I recommended and run another check, maybe it was a one time thing.

9.8 TB

Elapsed time:less than a minute

Current position:4.89 GB (0.0 %)

Estimated speed:168.5 MB/sec

Estimated finish:16 hours, 8 minutes

Sync errors corrected:1193109
im assuming 1mil+ errors within the first 30 seconds of the parity check is not what we are looking for... so i will try updating the SAS firmware later and keep it spun down until i get the chance

Link to comment
2 minutes ago, ChristianMingle said:

9.8 TB

Elapsed time:less than a minute

Current position:4.89 GB (0.0 %)

Estimated speed:168.5 MB/sec

Estimated finish:16 hours, 8 minutes

Sync errors corrected:1193109
im assuming 1mil+ errors within the first 30 seconds of the parity check is not what we are looking for... so i will try updating the SAS firmware later and keep it spun down until i get the chance

also, is it fair to assume that the 3 drives with 50mil errors are probably not in working order? I got the 4x200GB (one device) used and it has had Errors ever since i started using it, but i havent noticed anything but i also try not to use it very much, its HEAVILY used by the last person with something like 60k hours on it, but it is a Sun Oracle F80 800GB so

its meant to have a pretty heavily used lifecycle but the errors concern me as every other drive i have is at 0. I think the errors are just bad sectors and i was told that it knows that and avoids using the bad sectors, but dont know more than that.

 

 

 

errors.thumb.png.4e48837f7feec0f3d80f00907fd93938.png

Link to comment
13 hours ago, ChristianMingle said:

im assuming 1mil+ errors within the first 30 seconds of the parity check is not what we are looking for.

No, but they are likely the result of the previous check, where those devices could not be read, if it's a correcting check let it finish then run another one.

 

13 hours ago, ChristianMingle said:

also, is it fair to assume that the 3 drives with 50mil errors are probably not in working order?

Post new diags.

Link to comment
On 8/6/2022 at 4:21 AM, JorgeB said:

No, but they are likely the result of the previous check, where those devices could not be read, if it's a correcting check let it finish then run another one.

 

Post new diags.

Last check completed on Sat 06 Aug 2022 05:11:45 AM PDT (yesterday)
Finding 48827869 errors Duration: 17 hours, 37 minutes, 51 seconds. Average speed: 154.4 MB/sec

diagnostics attached

 

i havent done any of the recommended steps yet as i just got back home, but i figured id post these here because they ran over the time i was gone

 

 

tower-diagnostics-20220807-2224.zip

Link to comment
3 hours ago, trurl said:

Disks 5,6,7 all disconnected.

 

I would New Config without that SSD in the array, and without any of those 200GB disk, rebuild parity, and go forward from there.

ideally i could make the cache and disks 5,6,7 combine into the 800GB cache disk i originally planned on having. (cache and Disks 5,6, and 7 are all the same Sun Flash Accelerator F80, but i didn't know how to flash it into one 800GB drive) If the cache drive goes offline is it less problematic than other ones? If not should i make the SSD the cache drive and just remove the Sun Flash Accelerator F80? I wish i didn't have to since i overpaid for it to begin with and now its causing problems lol

Link to comment
3 hours ago, trurl said:

Disks 5,6,7 all disconnected.

 

I would New Config without that SSD in the array, and without any of those 200GB disk, rebuild parity, and go forward from there.

also sorry, but can you explain how one of these can be healthy and the other 3 have errors, even though they're all the same drive? I know its actually 4 200GB drives but like, is it just like firmware problems? I find it hard to believe that the one drive i made cache just happened to be the only healthy drive and the other 3 are crapping out lol. (Cache and drive 5,6,7 are all the same Sun Flash Accelerator F80 in case you didnt see the other post yet)

08-08-2022--02-56-14AM--BlaringBordercollie.png

Link to comment
4 hours ago, trurl said:

What do you see when you hover over those SMART warnings?

a lot of the values are also over the threshhold(provided more info on the middle drive)
also i know the F80 has over 55,000 powered on hours on it, its over 6 years old, but everyone always says that they "last forever" so im hoping that this one isnt dying out yet lol

08-08-2022--12-41-45PM--WeirdCoypu.png

08-08-2022--12-41-53PM--ThankfulCoypu.png

08-08-2022--12-42-01PM--ImmenseOregonsilverspotbutterfly.png

08-08-2022--12-43-24PM--NarrowSquid.png

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.