2 Disk Issues

thefly · March 15, 2018

I will investigate those possible issues, reboot and update this thread. Thank you.

JorgeB · March 15, 2018

On the upside there's a good chance old disk8 is also good and it dropped because of similar issues, but you'll need to figure out the problems or it will be difficult to complete a parity sync.

thefly · March 15, 2018

Here is my overview:

1. Yesterday I replaced disk 8 and followed johnnie.black's instruction (using terminal mdcmd set invalidslot

2. Drive 8 and 10 were unmountable during the parity-sync

3. I cancelled the sync at about 13% and shut down the server

4. Today I started the server to find Disk 2 unavailable for assignment BUT Disk 10, which was a problem yesterday, now is assignable

5. I shut down and I have checked all cables

6. With new drive in slot 8, disk 2 does not appear for assignment at all

7. On start up with dead drive in slot 8 back installed, disk 2 appears available for assignment but once assigned disappears after a few minutes

I am attaching diagnostics based on my current state with the new drive in slot 8. Your help during this freak out is very much appreciated.

tower-diagnostics-20180315-1249.zip

JorgeB · March 15, 2018

Disk2 dropped offline again, there's no SMART, but I didn't check its SMART yesterday since it wasn't one the of problem disks, but now looking at yesterdays reports it was already failing:

  5 Reallocated_Sector_Ct   0x0033   081   081   036    Pre-fail  Always       -       25920
197 Current_Pending_Sector  0x0012   096   096   000    Old_age   Always       -       768
198 Offline_Uncorrectable   0x0010   096   096   000    Old_age   Offline      -       768

Now looking at all the disk SMART reports there are more problems:

Disk18:

  5 Reallocated_Sector_Ct   0x0033   098   098   036    Pre-fail  Always       -       1560
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8

Cache:

187 Reported_Uncorrect      0x0032   093   093   000    Old_age   Always       -       7
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8

Disk 6:

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       2

Disk 7:

  5 Reallocated_Sector_Ct   0x0033   041   041   140    Pre-fail  Always   FAILING_NOW 1265
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       1265
197 Current_Pending_Sector  0x0032   196   196   000    Old_age   Always       -       1394

So, server it's in quite a bad state, some of the disks might still work, at least for now, and the pending sectors a false positive, but so we can be sure I recommend you run an extended SMART test in all disks, you can skip disk2 has that one is bad for sure, but we need to know how many more are failing to decide best way to proceed.

PS. keep disk2, some data should still be recoverable if needed using e.g. ddrescue.

trurl · March 15, 2018

Which brings this back. You should have known about all these disk issues before they multiplied. Did the Dashboard never give you warnings on these disks? Have you configured it to not give you warnings for some reason?

23 hours ago, trurl said:

In Settings, does it have Notifications? Have you configured it to send you emails? That is a lot of reallocations to appear all at once so I'm thinking it should have told you about them before the other disk had problems.

Parity will only help if all other disks are good. The parity calculation requires parity plus all other disks to calculate the data to recover a disk.

Looks like you may end up losing data, possibly from multiple bad disks. Do you have backups?

pwm · March 16, 2018

14 hours ago, trurl said:

Did the Dashboard never give you warnings on these disks? Have you configured it to not give you warnings for some reason?

I think lots of people never visit the dashboard - I might visit the dashboard once every 3 months. But I rely on centralized supervision.

Anyone who aren't using centralized supervision really must make sure they get mail from the system. And make sure they react if the mails stop arriving or if the mails indicate problems.

Storage servers can run stand-alone for long times, but they do require someone to step in as soon as disks or fans starts to have issues just as people shouldn't continue to run their cars without enough coolant in the radiator.

2 Disk Issues

Recommended Posts

thefly

Link to comment

JorgeB

Link to comment

thefly

Link to comment

JorgeB

Link to comment

trurl

Link to comment

pwm

Link to comment

Archived