More problems with Supermicro server - three failed drives, on parity one data,

ashman70 · January 15, 2017

So just now, with about 7 hrs left to complete the parity check, my second party drive and one of my 8TB data drives drop out of the array with red X's. The parity check cancelled itself and I have attached diags before I stopped the array and am now trying to reboot the server although the console is stuck at 'Unmounting remote filesystems:

Not sure what is going on......

tower-diagnostics-20170115-1341.zip

ashman70 · January 15, 2017

So now after the reboot, both my second parity disk and a 8TB data disk have red X's beside them. What am I to do now? Diags after the reboot attached.

tower-diagnostics-20170115-1356.zip

ashman70 · January 15, 2017

I've stopped the array and it now says next to the parity disc that has the red X, all data on this disk will be erased when the array is started.

I don't actually think these disc are bad, but that something has forced them to drop out of the array.

I'm guessing if I start the array its going to rebuild parity and fix both the 8 TB data disk with a red x and the second parity disk with a red x?

John_M · January 15, 2017

If you still have IOMMU enabled you might like to read this thread: https://lime-technology.com/forum/index.php?topic=54469.0

I'm not saying that's your problem but if you exclude the possibility you stand a better chance of fixing it.

JorgeB · January 15, 2017

Agree, on phone so can't look at the logs but 2 disks failing points to a controller/cable/etc issue, disable vt-d to rule it out as an issue before rebuilding and syncing parity, you can do both at the same time.

ashman70 · January 15, 2017

OK I am just in my BIOS now, I have couple of questions.

Should I have EFI optimized boot enabled? Its currently disabled.

Should I have processor C3 enabled? Its currently disabled

Should Processor C6 be enabled? Its currently enabled

Hyper threading is enabled

Core Multi-processing is enabled

Execute Disabled Bit is enabled

VT for Direct I/O is enabled, I assume this is what you want me to disable

Hardware prefetcher is enabled

Adjacent Cache Line Prefetch is enabled

Direct Cache Access (DCA) is enabled

In the PCI configuration of the BIOS there are two settings:

Maximize memory below 4GB, currently disabled

Maximize Mapped I/O above 4GB currently disabled (should this be enabled?)

Thanks

JorgeB · January 15, 2017

Disable only vt-d for now.

ashman70 · January 15, 2017

OK I have disabled VT-D and booted back up, I set the disks not to start automatically so the array has not started yet, should I start it now? It still says that all data on the parity2 disk will be erased and the other data disk still has a red x next to it as does the second parity disk.

JorgeB · January 15, 2017

Parity warning is normal, you may need to start the array with the other disable disk unassigned, stop array, reassign the disk and start again to begin rebuild.

ashman70 · January 15, 2017

So should I do that for both the data disk and parity disk together?

Unassign them both, start the array.

Stop the array, assign them.

Start the array, starting parity check?

JorgeB · January 15, 2017

Yes, better to do both at the same time.

John_M · January 15, 2017

It was your two Seagate VX disks that dropped off line - Disk 24 and Parity 2. Does that tally with what you saw?

ST4000DM000-1F2168_Z303SBXG has 4794 UDMA errors so check cable or seating into backplane. Other SMART reports are ok.

I see a lot of messages like this in your syslog:

Jan 14 00:25:27 Tower kernel: sas: ata30: end_device-1:1:35: dev error handler

Jan 14 00:25:27 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

I don't recognise them, but then I've never used a SAS expander. However "failed:1" doesn't look good.

Also SAS broadcasts:

Jan 14 00:36:48 Tower kernel: sas: broadcast received: 0

that are also outside of my experience.

Immediately after array start and loading of Ubooquity I see this:

Jan 14 00:25:22 Tower emhttp: Start failed: PID created but no process exists

Jan 14 00:25:25 Tower root: Fix Common Problems Version 2016.12.16

Jan 14 00:25:27 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1

which doesn't bode well.

There's an awful lot going on in your syslog and as well as having a lot of hardware you have a complex configuration too. My approach would be to strip it right back to a basic NAS, and get that stable first. Then add back the bells and whistles.

John_M · January 15, 2017

After your reboot, with two disks disabled, your syslog still has a lot of this:

Jan 15 13:54:28 Tower vsftpd[10523]: connect from 127.0.0.1 (127.0.0.1)

Jan 15 13:54:33 Tower emhttp: Start failed: PID created but no process exists

Jan 15 13:54:36 Tower root: Fix Common Problems Version 2016.12.16

Jan 15 13:54:36 Tower vsftpd[10584]: connect from 127.0.0.1 (127.0.0.1)

Jan 15 13:54:37 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1

Jan 15 13:54:37 Tower kernel: sas: ata5: end_device-1:0:20: cmd error handler

Jan 15 13:54:37 Tower kernel: sas: ata1: end_device-1:0:13: dev error handler

immediately after array start, which I find troubling but may be expected.

ashman70 · January 16, 2017

I've now got file system errors on my console that I have no idea what to do with, plus the parity check was cancelled I think, I've got problems with more drives now, I don't know if this is coincidence or bad luck. I cancelled the parity check because it looked like it was still going although there were no drive light activity on my drives. The array is still mounted. Diags attached.

tower-diagnostics-20170115-2012.zip

John_M · January 16, 2017

Three more disks dropped off-line but the others look OK and the one ending SBXG hasn't accumulated any more UDMA errors.

I see the same SAS messages as before and lots and lots of this:

Jan 15 18:02:43 Tower vsftpd[5373]: connect from 192.168.111.121 (192.168.111.121)

Jan 15 18:02:43 Tower vsftpd[5373]: [ashman] OK LOGIN: Client "192.168.111.121"

Jan 15 18:02:44 Tower vsftpd[5377]: connect from 192.168.111.120 (192.168.111.120)

Jan 15 18:02:44 Tower vsftpd[5377]: [ashman] OK LOGIN: Client "192.168.111.120"

Do you have any idea what that is all about? I doubt that it has much to do with your problem (though it might be a symptom, rather than a cause). As I said, I'd stop dockers and anything else that's trying to access the array.

I think the disks are fine in themselves. You may well have file system corruption and you now have too many disabled disks to be able to rebuild. Ultimately you'll probably want to do a New Config and rebuild both parities but before you do that you need to investigate the SAS problems. It looks to me as though it's something between the controllers and the disks themselves - cables/SAS expander/back-plane, maybe. Since I'm seeing messages I haven't seen before, I suspect the expander, though that's purely a guess.

If it was my server I'd have to break the problem down into simpler pieces and tackle each piece at a time. I'd put the pile of disks to one side and pick up a spare and do some testing, but someone else might have a better suggestion.

ashman70 · January 16, 2017

I did some testing with this setup before I swapped my disks. When I did that I had the HBA in the only x16 slot on this MB, but johnnie.black found that my board only ran the x16 at x4 electrically, so I put the card in an x8 slot where it has been since yesterday, up until then it was running fine although I hadn't run a parity check. I had started to run a parity check in the x16 slot but it was running so slow unRAID said it was going to take a week. This is what he found about my board. I am just wondering if changing slots had anything to do with it.

What do you suggest should be my next course of action? I don't want to do a new config and rebuild parity if disks are going to drop off the array again.

Intel® Server Board S5500HCV: Five expansion slots

o One PCI Express* Gen 2 slot (x16 Mechanically,x4 Electrically

o Two PCI Express* Express* Gen 2 x8 slots

o One PCI Express* Gen 1 slot (x8 Mechanically, x4 Electrically)

shared with SAS Module slot. This PCI Express* Gen 1 slot is not

available when the SAS module slot is in use and vice versa

John_M · January 16, 2017

I'd wait and see if anyone who has experience of the SAS Expander you're using can make any suggestions. I've used SAS HBAs and simple port multipliers but nothing more sophisticated than that. I'm afraid I would have used more HBAs instead if I had been building such a large array.

In the meantime I'd investigate hosts 192.168.111.120 and 192.168.111.121 to see why they are spamming your ftp daemon.

ashman70 · January 16, 2017

I will certainly listen to anything anyone has to say but at this point I believe I have narrowed down the problem.

Those hosts are exterior security cameras that FTP video footage to my server, sometimes they pick up people or cars driving by the front of the house.

JorgeB · January 16, 2017

One more disk got disabled, file system issues are probably because they both can't be correctly emulated since parity2 is invalid.

Problem seems to be the SAS2LP, it's timing out, could be cable/backplane issue, maybe try the other x8 slot and make sure the controller it's well seated, first 2 disable disks are on the back backplane but the 3rd one is on the front.

You'll need to do a new config and although parity1 should be mostly in sync it would need a parity check, so faster to just sync both.

I don't remember seeing these before:

Jan 15 19:17:53 Tower kernel: sas: broadcast received: 0
Jan 15 19:17:53 Tower kernel: sas: broadcast received: 0
Jan 15 19:17:53 Tower kernel: sas: broadcast received: 0
Jan 15 19:17:53 Tower kernel: sas: broadcast received: 0

But they look harmless, maybe someone else has more info.

ashman70 · January 16, 2017

Thanks, because everything was working fine before I moved the card to the 8x slot and that stretched the cable to the rear backplane, I'm thinking the new longer cable should sort things out. I just wish there was a way to get the parity check done faster. Thanks for your help.

John_M · January 16, 2017

I didn't realise you had a possibly damaged cable - I don't remember you mentioning it. I included cable faults in my list of suspects several posts ago.

If you want faster parity checks couldn't you add another SAS HBA or make use of some of the motherboard SATA ports?

ashman70 · January 16, 2017

It's not that I think its damaged, but that there was so much tension because it was being stretched that its likely the cause of the drives dropping off.

The thing is if I get another HBA I am not sure how to connect it, the rear backplane has two connectors and the front backplane has three. I may email Supermicro tech support and ask them.

JorgeB · January 16, 2017

Do you know the exact model of the front backplane to see if supports dual link?

ashman70 · January 16, 2017

No I've emailed Supermicro tech support. The did send me this image with part numbers, I googled it but couldn't come up with anything.

JorgeB · January 16, 2017

Supermicro manual is not very clear but from what I could find it does support dual link.

https://forums.servethehome.com/index.php?threads/playing-with-my-sas-expander.7420/

The HBA needs to support it as well, I know the SASLP and the LSI9211-8i (H310/M1015) support it because I tested those, but never tested the SAS2LP and can't find anything about it on the net, you can try it or I may test mine when I get the chance.

More problems with Supermicro server - three failed drives, on parity one data,

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation