thefly Posted March 15, 2018 Author Share Posted March 15, 2018 I will investigate those possible issues, reboot and update this thread. Thank you. Link to comment
JorgeB Posted March 15, 2018 Share Posted March 15, 2018 On the upside there's a good chance old disk8 is also good and it dropped because of similar issues, but you'll need to figure out the problems or it will be difficult to complete a parity sync. Link to comment
thefly Posted March 15, 2018 Author Share Posted March 15, 2018 Here is my overview: 1. Yesterday I replaced disk 8 and followed johnnie.black's instruction (using terminal mdcmd set invalidslot 2. Drive 8 and 10 were unmountable during the parity-sync 3. I cancelled the sync at about 13% and shut down the server 4. Today I started the server to find Disk 2 unavailable for assignment BUT Disk 10, which was a problem yesterday, now is assignable 5. I shut down and I have checked all cables 6. With new drive in slot 8, disk 2 does not appear for assignment at all 7. On start up with dead drive in slot 8 back installed, disk 2 appears available for assignment but once assigned disappears after a few minutes I am attaching diagnostics based on my current state with the new drive in slot 8. Your help during this freak out is very much appreciated. tower-diagnostics-20180315-1249.zip Link to comment
JorgeB Posted March 15, 2018 Share Posted March 15, 2018 Disk2 dropped offline again, there's no SMART, but I didn't check its SMART yesterday since it wasn't one the of problem disks, but now looking at yesterdays reports it was already failing: 5 Reallocated_Sector_Ct 0x0033 081 081 036 Pre-fail Always - 25920 197 Current_Pending_Sector 0x0012 096 096 000 Old_age Always - 768 198 Offline_Uncorrectable 0x0010 096 096 000 Old_age Offline - 768 Now looking at all the disk SMART reports there are more problems: Disk18: 5 Reallocated_Sector_Ct 0x0033 098 098 036 Pre-fail Always - 1560 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 8 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 8 Cache: 187 Reported_Uncorrect 0x0032 093 093 000 Old_age Always - 7 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 8 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 8 Disk 6: 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 2 Disk 7: 5 Reallocated_Sector_Ct 0x0033 041 041 140 Pre-fail Always FAILING_NOW 1265 196 Reallocated_Event_Count 0x0032 001 001 000 Old_age Always - 1265 197 Current_Pending_Sector 0x0032 196 196 000 Old_age Always - 1394 So, server it's in quite a bad state, some of the disks might still work, at least for now, and the pending sectors a false positive, but so we can be sure I recommend you run an extended SMART test in all disks, you can skip disk2 has that one is bad for sure, but we need to know how many more are failing to decide best way to proceed. PS. keep disk2, some data should still be recoverable if needed using e.g. ddrescue. Link to comment
trurl Posted March 15, 2018 Share Posted March 15, 2018 Which brings this back. You should have known about all these disk issues before they multiplied. Did the Dashboard never give you warnings on these disks? Have you configured it to not give you warnings for some reason? 23 hours ago, trurl said: In Settings, does it have Notifications? Have you configured it to send you emails? That is a lot of reallocations to appear all at once so I'm thinking it should have told you about them before the other disk had problems. Parity will only help if all other disks are good. The parity calculation requires parity plus all other disks to calculate the data to recover a disk. Looks like you may end up losing data, possibly from multiple bad disks. Do you have backups? Link to comment
pwm Posted March 16, 2018 Share Posted March 16, 2018 14 hours ago, trurl said: Did the Dashboard never give you warnings on these disks? Have you configured it to not give you warnings for some reason? I think lots of people never visit the dashboard - I might visit the dashboard once every 3 months. But I rely on centralized supervision. Anyone who aren't using centralized supervision really must make sure they get mail from the system. And make sure they react if the mails stop arriving or if the mails indicate problems. Storage servers can run stand-alone for long times, but they do require someone to step in as soon as disks or fans starts to have issues just as people shouldn't continue to run their cars without enough coolant in the radiator. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.