bkastner Posted June 21, 2015 Share Posted June 21, 2015 I received my first notification of a failed disk. When I look in the GUI the disk has been taken offline and reports 2 read errors. I figured I would look at the SMART history, but everything in the smarthistory folder is from Nov 2014, which isn't good. So, I have a couple of questions... 1) What happened to the SMART reports? Why would these no longer be reporting (I had it installed via UnMENU, but don't use UnMENU anymore) 2) How do I best go about testing the state of the disk? 3) Do I just assume the disk is toast and look to replace? As mentioned, I've never had a disk issue before (I've been lucky for 4 years), so am not sure of the correct process at this point Link to comment
danioj Posted June 21, 2015 Share Posted June 21, 2015 What happens when you click the disk link on the Main page and look at the S.M.A.R.T data via the GUI? I would have expected that you would be able to see the most recent S.M.A.R.T attributes from there? Link to comment
bkastner Posted June 21, 2015 Author Share Posted June 21, 2015 What happens when you click the disk link on the Main page and look at the S.M.A.R.T data via the GUI? I would have expected that you would be able to see the most recent S.M.A.R.T attributes from there? I couldn't get the latest smart test results as the GUI said the disk needs to be spun up, which it couldn't do. I tried rebooting to see if the disk came back online, and it now just reports as 'not installed'. I decided to check the warrany, and got excited as it expires in Oct 2015, but then remembered it was an external drive that was on sale that I cracked open, so am guessing I am screwed. Link to comment
itimpi Posted June 21, 2015 Share Posted June 21, 2015 What happens when you click the disk link on the Main page and look at the S.M.A.R.T data via the GUI? I would have expected that you would be able to see the most recent S.M.A.R.T attributes from there? I couldn't get the latest smart test results as the GUI said the disk needs to be spun up, which it couldn't do.[/qupte] That suggests it might have dropped offline for some reason. I tried rebooting to see if the disk came back online, and it now just reports as 'not installed'. Do you mean in unRAID? That is a bit strange as I would expect it to say the disk was missing rather than not installed. I decided to check the warrany, and got excited as it expires in Oct 2015, but then remembered it was an external drive that was on sale that I cracked open, so am guessing I am screwed. The vast majority of times that a disk fails it turns out to not be the disk, but an external factor such as power spike or a power/sata cabling issue. If you can get the drive to be recognised by the system (even if only at the BIOS level) then it should be possible to obrain a SMART report to see what that says about the disk's health. Link to comment
bkastner Posted June 21, 2015 Author Share Posted June 21, 2015 Okay.... so life is strange. As I mentioned, when I rebooted I got the 'not installed' message. This server is in a Norco 4224 case, so there are no individual cables to the drives, they are all backplanes. I stopped the array, unseated and re-seated each drive and rebooted again. My actual disk 10 (the disk in error) shows up as an unassigned drive. I stopped the array and re-added it back as disk10, and now it's rebuilding the data on the drive (even though it already existed). I am also running an extended smart test on the disk and will post the results. I am confused on why it didn't recognize the same drive signature as the original disk10. Link to comment
bkastner Posted June 22, 2015 Author Share Posted June 22, 2015 Okay, I am truly confused as to what is going on. Here is a summary: - With no prior warning I got a notice that disk10 had issues: Event: unRAID Disk 10 error Subject: Alert [CYDSTORAGE] - Disk 10 in error state (disk dsbl) Description: (sdk) Importance: alert -I tried stopping the array and restarting without success, so I rebooted and got: Event: unRAID Disk 10 error Subject: Alert [CYDSTORAGE] - Disk 10 in error state (disk dsbl) Description: No device identification present Importance: alert as well as: Event: unRAID Status Subject: Notice [CYDSTORAGE] - array health report [FAIL] Description: Array has 15 disks (including parity & cache) Importance: alert Parity - (sdd) - active 25°C [OK] Disk 1 - (sdi) - active 27°C [OK] Disk 2 - (sdo) - active 24°C [OK] Disk 3 - (sdn) - active 26°C [OK] Disk 4 - (sdm) - active 27°C [OK] Disk 5 - (sdp) - active 25°C [OK] Disk 6 - (sdh) - active 26°C [OK] Disk 7 - (sdg) - active 27°C [OK] Disk 8 - (sdk) - active 28°C [OK] Disk 9 - (sdl) - active 27°C [OK] Disk 10 - No device identification present - standby [DISK DSBL] Disk 11 - (sdc) - active 26°C [OK] Disk 12 - (sdj) - active 28°C [OK] Disk 13 - (sde) - active 24°C [OK] Cache - (sdf) - active 30°C [OK] Data is invalid Last checked on Tue 16 Jun 2015 05:58:12 PM EDT (4 days ago), finding 0 errors. Duration: unavailable (system reboot or log rotation) -Mid day yesterday I went out and bought a replacement 4TB disk and started pre-clearing it. About 30 mins later I got: Event: unRAID Disk 1 error Subject: Alert [CYDSTORAGE] - Disk 1 in error state (disk missing) Description: No device identification present Importance: alert However, when I look in the GUI, there is no issue with Disk 1. I also saw that disk 10 was 'unassigned' but seemed fine, and the GUI was indicating Disk 10 was 'not installed'. I re-added Disk 10 and it started to rebuild from parity (which I still don't understand). I also started a long SMART test on disk 10. Event: unRAID Disk 10 error Subject: Alert [CYDSTORAGE] - Disk 10 in error state (disk dsbl) Description: No device identification present Importance: alert Event: unRAID Disk 10 error Subject: Warning [CYDSTORAGE] - Disk 10, drive not ready, content being reconstructed Description: WDC_WD40EZRX-00SPEB0_WD-WCC4E0425712 (sdl) Importance: warning Data rebuild finished around 1am this morning: Event: unRAID Disk 10 message Subject: Notice [CYDSTORAGE] - Disk 10 returned to normal operation Description: WDC_WD40EZRX-00SPEB0_WD-WCC4E0425712 (sdl) Importance: normal Event: unRAID Data rebuild: Subject: Notice [CYDSTORAGE] - Data rebuild: finished (0 errors) Description: Duration: 12 hours, 14 minutes, 17 seconds. Average speed: 136.2 MB/sec Importance: normal Then around 3:40am this morning I get a report of an issue with my Parity disk: Event: unRAID Parity disk error Subject: Alert [CYDSTORAGE] - Parity disk in error state (disk dsbl) Description: WDC_WD60EFRX-68MYMN1_WD-WX51H3421101 (sdd) Importance: alert ___ Event: unRAID array errors Subject: Warning [CYDSTORAGE] - array has errors Description: Array has 1 disk with read errors Importance: warning Parity disk - (sdd) (errors 80) So, in the GUI the Parity drive has a red X on it, but I am still pre-clearing the replacement for disk 10. I will try and reboot once the pre-clear is done, but am sitting on the edge at the moment. I've started copying data off Disk10 and have had no issues. The Long SMART test completed without errors and the reports look good: Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 9811 - No Errors Logged I am right royally confused at this point. As mentioned, I've never had a disk issue, and now they are popping up and down without reason. All my drives are in a Norco 4224 using a backplane, so there are no actual cable attachments. I've removed and re-seated each drive yesterday when I added the new replacement disk, but all seemed fine. I don't know if it will help, but I am including my current syslog. I am looking for suggestions, and maybe some insight into what is going on (if possible). I am utterly confused at this point. syslog.zip Link to comment
dgaschk Posted June 22, 2015 Share Posted June 22, 2015 Replace the SAS cables. All my drives are in a Norco 4224 using a backplane, so there are no actual cable attachments. I've removed and re-seated each drive yesterday when I added the new replacement disk, but all seemed fine. Link to comment
bkastner Posted June 22, 2015 Author Share Posted June 22, 2015 Replace the SAS cables. All my drives are in a Norco 4224 using a backplane, so there are no actual cable attachments. I've removed and re-seated each drive yesterday when I added the new replacement disk, but all seemed fine. Really? Have you seen this before? Link to comment
dgaschk Posted June 22, 2015 Share Posted June 22, 2015 Replace the SAS cables. All my drives are in a Norco 4224 using a backplane, so there are no actual cable attachments. I've removed and re-seated each drive yesterday when I added the new replacement disk, but all seemed fine. Really? Have you seen this before? Not this exactly. But it fits the symptoms. Link to comment
bkastner Posted June 23, 2015 Author Share Posted June 23, 2015 Replace the SAS cables. All my drives are in a Norco 4224 using a backplane, so there are no actual cable attachments. I've removed and re-seated each drive yesterday when I added the new replacement disk, but all seemed fine. Really? Have you seen this before? Not this exactly. But it fits the symptoms. So, sort of stupid question, but this server never gets touched (except when I added the drive). It's isolated and the case hasn't been opened in months. Are you thinking the cables have just degraded over time (which would only be around 2 years)? I could understand if I was shifting them around and wearing out the connectors, but once they were hooked up to the backplanes they've not been touched. Link to comment
dgaschk Posted June 23, 2015 Share Posted June 23, 2015 Replace the SAS cables. All my drives are in a Norco 4224 using a backplane, so there are no actual cable attachments. I've removed and re-seated each drive yesterday when I added the new replacement disk, but all seemed fine. Really? Have you seen this before? Not this exactly. But it fits the symptoms. So, sort of stupid question, but this server never gets touched (except when I added the drive). It's isolated and the case hasn't been opened in months. Are you thinking the cables have just degraded over time (which would only be around 2 years)? I could understand if I was shifting them around and wearing out the connectors, but once they were hooked up to the backplanes they've not been touched. Things wear out. The ground shakes when a big truck goes by... who knows? It is less likely that multiple disks are failing than that a component common to those drives is failing. Ether the SAS port is bad, the cable is bad, the backplane is bad, the power connectors or cable is loose or bad. You have to start replacing parts until the problem is resolved. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.