Jump to content

Red Ball Disk


bkastner

Recommended Posts

I received my first notification of a failed disk. When I look in the GUI the disk has been taken offline and reports 2 read errors.

 

I figured I would look at the SMART history, but everything in the smarthistory folder is from Nov 2014, which isn't good.

 

So, I have a couple of questions...

 

1) What happened to the SMART reports? Why would these no longer be reporting (I had it installed via UnMENU, but don't use UnMENU anymore)

2) How do I best go about testing the state of the disk?

3) Do I just assume the disk is toast and look to replace?

 

As mentioned, I've never had a disk issue before (I've been lucky for 4 years), so am not sure of the correct process at this point

Link to comment

What happens when you click the disk link on the Main page and look at the S.M.A.R.T data via the GUI?

 

I would have expected that you would be able to see the most recent S.M.A.R.T attributes from there?

 

I couldn't get the latest smart test results as the GUI said the disk needs to be spun up, which it couldn't do.

 

I tried rebooting to see if the disk came back online, and it now just reports as 'not installed'.

 

I decided to check the warrany, and got excited as it expires in Oct 2015, but then remembered it was an external drive that was on sale that I cracked open, so am guessing I am screwed.

Link to comment

What happens when you click the disk link on the Main page and look at the S.M.A.R.T data via the GUI?

 

I would have expected that you would be able to see the most recent S.M.A.R.T attributes from there?

 

I couldn't get the latest smart test results as the GUI said the disk needs to be spun up, which it couldn't do.[/qupte]

 

That suggests it might have dropped offline for some reason.

 

I tried rebooting to see if the disk came back online, and it now just reports as 'not installed'.

Do you mean in unRAID?    That is a bit strange as I would expect it to say the disk was missing rather than not installed.

 

I decided to check the warrany, and got excited as it expires in Oct 2015, but then remembered it was an external drive that was on sale that I cracked open, so am guessing I am screwed.

The vast majority of times that a disk fails it turns out to not be the disk, but an external factor such as power spike or a power/sata cabling issue.

 

If you can get the drive to be recognised by the system (even if only at the BIOS level) then it should be possible to obrain a SMART report to see what that says about the disk's health.

Link to comment

Okay.... so life is strange.

 

As I mentioned, when I rebooted I got the 'not installed' message. This server is in a Norco 4224 case, so there are no individual cables to the drives, they are all backplanes.

 

I stopped the array, unseated and re-seated each drive and rebooted again. My actual disk 10 (the disk in error) shows up as an unassigned drive. I stopped the array and re-added it back as disk10, and now it's rebuilding the data on the drive (even though it already existed).

 

I am also running an extended smart test on the disk and will post the results.

 

I am confused on why it didn't recognize the same drive signature as the original disk10.

Link to comment

Okay, I am truly confused as to what is going on.

 

Here is a summary:

 

- With no prior warning I got a notice that disk10 had issues:

Event: unRAID Disk 10 error

Subject: Alert [CYDSTORAGE] - Disk 10 in error state (disk dsbl)

Description: (sdk)

Importance: alert

 

-I tried stopping the array and restarting without success, so I rebooted and got:

Event: unRAID Disk 10 error

Subject: Alert [CYDSTORAGE] - Disk 10 in error state (disk dsbl)

Description: No device identification present

Importance: alert

 

as well as:

Event: unRAID Status

Subject: Notice [CYDSTORAGE] - array health report [FAIL]

Description: Array has 15 disks (including parity & cache)

Importance: alert

 

Parity - (sdd) - active 25°C [OK]

Disk 1 - (sdi) - active 27°C [OK]

Disk 2 - (sdo) - active 24°C [OK]

Disk 3 - (sdn) - active 26°C [OK]

Disk 4 - (sdm) - active 27°C [OK]

Disk 5 - (sdp) - active 25°C [OK]

Disk 6 -  (sdh) - active 26°C [OK]

Disk 7 - (sdg) - active 27°C [OK]

Disk 8 - (sdk) - active 28°C [OK]

Disk 9 - (sdl) - active 27°C [OK]

Disk 10 - No device identification present - standby [DISK DSBL]

Disk 11 - (sdc) - active 26°C [OK]

Disk 12 - (sdj) - active 28°C [OK]

Disk 13 - (sde) - active 24°C [OK]

Cache - (sdf) - active 30°C [OK]

 

Data is invalid

Last checked on Tue 16 Jun 2015 05:58:12 PM EDT (4 days ago), finding 0 errors.

Duration: unavailable (system reboot or log rotation)

 

-Mid day yesterday I went out and bought a replacement 4TB disk and started pre-clearing it.

 

About 30 mins later I got:

 

Event: unRAID Disk 1 error

Subject: Alert [CYDSTORAGE] - Disk 1 in error state (disk missing)

Description: No device identification present

Importance: alert

 

However, when I look in the GUI, there is no issue with Disk 1.

 

I also saw that disk 10 was 'unassigned' but seemed fine, and the GUI was indicating Disk 10 was 'not installed'. I re-added Disk 10 and it started to rebuild from parity (which I still don't understand). I also started a long SMART test on disk 10.

 

Event: unRAID Disk 10 error

Subject: Alert [CYDSTORAGE] - Disk 10 in error state (disk dsbl)

Description: No device identification present

Importance: alert

 

Event: unRAID Disk 10 error

Subject: Warning [CYDSTORAGE] - Disk 10, drive not ready, content being reconstructed

Description: WDC_WD40EZRX-00SPEB0_WD-WCC4E0425712 (sdl)

Importance: warning

 

Data rebuild finished around 1am this morning:

 

Event: unRAID Disk 10 message

Subject: Notice [CYDSTORAGE] - Disk 10 returned to normal operation

Description: WDC_WD40EZRX-00SPEB0_WD-WCC4E0425712 (sdl)

Importance: normal

 

Event: unRAID Data rebuild:

Subject: Notice [CYDSTORAGE] - Data rebuild: finished (0 errors)

Description: Duration: 12 hours, 14 minutes, 17 seconds. Average speed: 136.2 MB/sec

Importance: normal

 

Then around 3:40am this morning I get a report of an issue with my Parity disk:

 

Event: unRAID Parity disk error

Subject: Alert [CYDSTORAGE] - Parity disk in error state (disk dsbl)

Description: WDC_WD60EFRX-68MYMN1_WD-WX51H3421101 (sdd)

Importance: alert

 

___

 

Event: unRAID array errors

Subject: Warning [CYDSTORAGE] - array has errors

Description: Array has 1 disk with read errors

Importance: warning

 

Parity disk - (sdd) (errors 80)

 

So, in the GUI the Parity drive has a red X on it, but I am still pre-clearing the replacement for disk 10.

 

I will try and reboot once the pre-clear is done, but am sitting on the edge at the moment. I've started copying data off Disk10 and have had no issues.

 

The Long SMART test completed without errors and the reports look good:

 

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed without error      00%      9811        -

 

No Errors Logged

 

I am right royally confused at this point. As mentioned, I've never had a disk issue, and now they are popping up and down without reason.

 

All my drives are in a Norco 4224 using a backplane, so there are no actual cable attachments. I've removed and re-seated each drive yesterday when I added the new replacement disk, but all seemed fine.

 

I don't know if it will help, but I am including my current syslog.

 

I am looking for suggestions, and maybe some insight into what is going on (if possible). I am utterly confused at this point.

syslog.zip

Link to comment

Replace the SAS cables.

 

 

All my drives are in a Norco 4224 using a backplane, so there are no actual cable attachments. I've removed and re-seated each drive yesterday when I added the new replacement disk, but all seemed fine.

 

 

Really? Have you seen this before?

Link to comment

Replace the SAS cables.

 

 

All my drives are in a Norco 4224 using a backplane, so there are no actual cable attachments. I've removed and re-seated each drive yesterday when I added the new replacement disk, but all seemed fine.

 

 

Really? Have you seen this before?

 

Not this exactly. But it fits the symptoms.

Link to comment

Replace the SAS cables.

 

 

All my drives are in a Norco 4224 using a backplane, so there are no actual cable attachments. I've removed and re-seated each drive yesterday when I added the new replacement disk, but all seemed fine.

 

 

Really? Have you seen this before?

 

Not this exactly. But it fits the symptoms.

 

So, sort of stupid question, but this server never gets touched (except when I added the drive). It's isolated and the case hasn't been opened in months. Are you thinking the cables have just degraded over time (which would only be around 2 years)? I could understand if I was shifting them around and wearing out the connectors, but once they were hooked up to the backplanes they've not been touched.

Link to comment

Replace the SAS cables.

 

 

All my drives are in a Norco 4224 using a backplane, so there are no actual cable attachments. I've removed and re-seated each drive yesterday when I added the new replacement disk, but all seemed fine.

 

 

Really? Have you seen this before?

 

Not this exactly. But it fits the symptoms.

 

So, sort of stupid question, but this server never gets touched (except when I added the drive). It's isolated and the case hasn't been opened in months. Are you thinking the cables have just degraded over time (which would only be around 2 years)? I could understand if I was shifting them around and wearing out the connectors, but once they were hooked up to the backplanes they've not been touched.

 

Things wear out. The ground shakes when a big truck goes by... who knows? It is less likely that multiple disks are failing than that a component common to those drives is failing. Ether the SAS port is bad, the cable is bad, the backplane is bad, the power connectors or cable is loose or bad. You have to start replacing parts until the problem is resolved.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...