Jump to content

2 Disk Issues


thefly

Recommended Posts

Here is my overview:

 

1. Yesterday I replaced disk 8 and followed johnnie.black's instruction (using terminal mdcmd set invalidslot 8)

2. Drive 8 and 10 were unmountable during the parity-sync

3. I cancelled the sync at about 13% and shut down the server

4. Today I started the server to find Disk 2 unavailable for assignment BUT Disk 10, which was a problem yesterday, now is assignable

5. I shut down and I have checked all cables

6. With new drive in slot 8, disk 2 does not appear for assignment at all

7. On start up with dead drive in slot 8 back installed, disk 2 appears available for assignment but once assigned disappears after a few minutes

 

I am attaching diagnostics based on my current state with the new drive in slot 8. Your help during this freak out is very much appreciated.

 

 

tower-diagnostics-20180315-1249.zip

Link to comment

Disk2 dropped offline again, there's no SMART, but I didn't check its SMART yesterday since it wasn't one the of problem disks, but now looking at yesterdays reports it was already failing:

 

  5 Reallocated_Sector_Ct   0x0033   081   081   036    Pre-fail  Always       -       25920
197 Current_Pending_Sector  0x0012   096   096   000    Old_age   Always       -       768
198 Offline_Uncorrectable   0x0010   096   096   000    Old_age   Offline      -       768

 

Now looking at all the disk SMART reports there are more problems:

 

Disk18:

 

  5 Reallocated_Sector_Ct   0x0033   098   098   036    Pre-fail  Always       -       1560
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8


Cache:

 

187 Reported_Uncorrect      0x0032   093   093   000    Old_age   Always       -       7
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8

Disk 6:

 

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       2

Disk 7:

 

  5 Reallocated_Sector_Ct   0x0033   041   041   140    Pre-fail  Always   FAILING_NOW 1265
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       1265
197 Current_Pending_Sector  0x0032   196   196   000    Old_age   Always       -       1394

So, server it's in quite a bad state, some of the disks might still work, at least for now, and the pending sectors a false positive, but so we can be sure I recommend you run an extended SMART test in all disks, you can skip disk2 has that one is bad for sure, but we need to know how many more are failing to decide best way to proceed.

 

PS. keep disk2, some data should still be recoverable if needed using e.g. ddrescue.
 

Link to comment

Which brings this back. You should have known about all these disk issues before they multiplied. Did the Dashboard never give you warnings on these disks? Have you configured it to not give you warnings for some reason?

23 hours ago, trurl said:

In Settings, does it have Notifications? Have you configured it to send you emails? That is a lot of reallocations to appear all at once so I'm thinking it should have told you about them before the other disk had problems.

 

Parity will only help if all other disks are good. The parity calculation requires parity plus all other disks to calculate the data to recover a disk.

 

Looks like you may end up losing data, possibly from multiple bad disks. Do you have backups?

Link to comment
14 hours ago, trurl said:

Did the Dashboard never give you warnings on these disks? Have you configured it to not give you warnings for some reason?

I think lots of people never visit the dashboard - I might visit the dashboard once every 3 months. But I rely on centralized supervision.

 

Anyone who aren't using centralized supervision really must make sure they get mail from the system. And make sure they react if the mails stop arriving or if the mails indicate problems.

 

Storage servers can run stand-alone for long times, but they do require someone to step in as soon as disks or fans starts to have issues just as people shouldn't continue to run their cars without enough coolant in the radiator.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...