Same two drives dropping when stopping server


Recommended Posts

In the X2M machine in my signature I am running into a repeated problem when stopping the server with the same two drives dropping.

 

At the point where it is trying to unmount drives, the server locks and pegs all cores on the CPU at 100% (according to the unraid UI). No matter how long I leave it, I have to perform a hard power down to regain control of the machine. This of course deletes the logs.

 

Once I power the server back up the UI reports Parity 1 and Data 1 are emulated and I have to stop, unmount them, restart, stop, remount them then rebuild.

 

It doesn't do it every time, just about once in, say, four shutdowns. It is ALWAYS those two drives.

 

The drives are on my LSI SAS9201-8i which has two 4:1 SAS to SATA cables. Is it more likely to be the cable or the card? I never see the other two drives on that cable being dropped.

Link to comment
20 minutes ago, DanielCoffey said:

This of course deletes the logs.

If you enable the syslog server available in 6.7.2 then you can have a log that survives the forced shutdown.

 

21 minutes ago, DanielCoffey said:

The drives are on my LSI SAS9201-8i which has two 4:1 SAS to SATA cables. Is it more likely to be the cable or the card? I never see the other two drives on that cable being dropped.

You could try swapping cables over to see if the problem moves with the cable or with the drive (since Unraid recognizes drives by serial number it will not care).  If it this makes no difference then it is probably something else.  

Link to comment

Please could someone explain to a beginner how to use the syslog server in 6.7.2?

 

I can see where the option to turn it on or off is located but I don't know how to use it properly. I wouldn't mind pointing the logs to the cache or even an unassigned device but don't understand how to use it.

 

For my issue above, I have fitted a brand new SAS-SATA cable, thinking that Paroty 1 and Data 1 were on the same cable but discovered that the two drives that dropped each time before were the first two drives on each of the two SAS ports on the card.

 

I would like to have the syslog running to catch it when it next does it.

Link to comment

An update - the array did its party trick again today with Disk 1 and Parity 1 being dropped. I had changed the SAS-to-SATA cable to those two disks earlier and all seemed well. However about four sleep/wake cycles later, the array froze and became totally non-responsive again. I had to hard power off to regain control. I don't have logs I am afraid.

 

Earlier I had run the DiskSpeed docker and it did show anomalous behaviour on Disk 1 (see pics below) so despite the SMART report seeming clean I shall be performing a Parity rebuild then taking that disk out of the array.

 

I have one unassigned disk which I shall clear of my scratch backups and bring it in to replace Disk 1.

DiskSpeedBenchmark01.jpg

DiskSpeedBenchmark02.jpg

DiskSpeedBenchmark04.jpg

Link to comment

I have completed an Extended SMART test on the drive that has been giving me issues. Please could someone have a look at the results to see if there is anything that would indicate issues with the drive.

 

The drive itself is not in the array now so I am free to pull it and submit it for warranty replacement but I would like to know if there is something I could point a finger to in the SMART report rather than just relying on the DiskSpeed results.

 

The data on the drive has never been compromised (as far as I am aware) but it really hates shutting down (and I do recall one or two lockups on boot a fair while ago which may be related).

WDC_WD80EFZX-68UW8N0_VK1DZHAY-20190822-0554.txt

Link to comment
  • 3 weeks later...

Well I have to say I am REALLY impressed with WD's warranty support!

 

Despite changing the SAS-SATA cable, trying the drive on a new HBA card and deleting the partition, it still did its odd shutdown thing one time in four. It still showed a slower transfer rate between 0-1Tb and that odd wobble between 5-6Tb compared to all the other seven identical drives. All the time it still performed perfectly well as a functioning hard drive in that it never lost data.

 

First WD had me run their own SMART data collecting tool which the drive of course passed. They advised that they couldn't really accept the DiskSpeed Docker results as evidence in a Warranty case since it wasn't their own tool but agreed that it was clear the drive had "something odd" about it. They just turned round and said that I should send it in and they would just replace it with a new one.

 

Not bad service for a two year old drive (which of course has a three year warranty). I feel that their opinion was that it was simply not worth the effort of sending the drive to a technician to be examined and analysed. Just replace it and move on.

 

Customer happy? Yup.

Link to comment
45 minutes ago, DanielCoffey said:

they would just replace it with a new one.

Which is RMA department speak for a refurb drive. Not saying that's bad, but you could be getting somebody else's "something odd" drive that passed all their tests and has been reprogrammed with a new serial number and reset smart stats.

 

Test thoroughly before trusting, which actually applies to all drives, new or refurb.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.