Jump to content

Server Can't complete initial parity sync 6.5.0-rc1


david11129

Recommended Posts

I'm having a tough time with this one. I have bought 4 new drives for m Unraid. One kept dropping out, so I assumed it was the culprit. I took it out of the array,but  I still have not been able to complete an initial parity sync. It starts off well, gets to around 60 percent done, then slows way down. Once it hangs, The speed reported is 1 MBS, but I see no evidence it's writing anything at all. The initial speed is around 180. This morning, I got an error message saying "Drive mounted read-only or completely full." Followed by another error message saying "Docker image either full or corrupted." I have no idea what randomly locks up. I started the parity sync yesterday with troubleshooting mode on. I'm attaching the last diagnostics it saved,sometime around 430am.

 

Prior to the new drives, Unraid has been working wonderfully for me for the last year or so. I also upgraded the CPU when changing the drives. So I have two things to troubleshoot. Can anyone point me in the right direction?

All help appreciated. 

tower-diagnostics-20180306-0430.zip

Link to comment
18 minutes ago, 1812 said:

abandon rc1 and move to rc5 OR downgrade to 6.4.1 for starters.

 

then see if you're still getting the call traces/io errors your syslog show.

I can do that. I'm at work now so we'll see when I get home. I was on rc5, I downgraded to rc1 to see if that had any impact. Another thing I've noticed when trying to use the web terminal or the os upgrade plugin, is I get NGINX server errors. Both the terminal and the upgrade os pull up a blank white screen that simply says NGINX error. This doesn't happen after a reboot, only after the parity build hangs.

Link to comment
24 minutes ago, 1812 said:

abandon rc1 and move to rc5 OR downgrade to 6.4.1 for starters.

 

then see if you're still getting the call traces/io errors your syslog show.

As far as call traces go, they usually pop up when I leave a window in a browser logged in during a reboot. I was mostly ignoring those. The IO errors worried me. So did the docker errors. I shut down all containers prior to the parity build.

Link to comment
19 minutes ago, david11129 said:

So did the docker errors. I shut down all containers prior to the parity build.

 

The corrupt docker img can maybe be worked around for now by stopping the docker service. You're going to have to delete and recreate it later anyway. Don't worry, it's easy to get your apps back just as they were from Community Applications - Previous Apps.

 

Note that merely stopping each container is not enough. You have to stop the service by going to Settings - Docker.

 

The other things throwing I/O errors looks like your cache pool. And that is likely the reason for the corrupt docker img as well. Did you change anything about cache when you did the New Config to remove the new disk?

 

Link to comment

Why would the cache dropping have anything to do with the parity build though? I had a drive that kept dropping out during reboots. One minute it was there, the next it wasn't. It would show up as missing. I swapped the power cable from the splitter it's on with the ssd. I figured since the cache wasn't part of the array, I could still get my parity synced to maintain protection.  The cache must be dropping out because there is a problem with that plug it's connected to. That's good to know. Would that still cause the parity to not build though? If I get the SSD put back on a new sata power plug, will I have to even do anything like restoring containers? Or should they all come up properly once the drive is stable?

Link to comment

Besides cache other disk (WDC_WD80EMAZ-00WJTA0_7SGNTVAC) also dropped offline, possibly also cable disks:

 

Mar  5 20:28:16 Tower kernel: sd 7:0:8:0: device_block, handle(0x000e)
Mar  5 20:28:18 Tower kernel: sd 7:0:8:0: device_unblock and setting to running, handle(0x000e)
Mar  5 20:28:18 Tower kernel: sd 7:0:8:0: [sdj] Synchronizing SCSI cache
Mar  5 20:28:30 Tower kernel: sd 7:0:8:0: timing out command, waited 15s
Mar  5 20:28:30 Tower kernel: sd 7:0:8:0: [sdj] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00
Mar  5 20:28:30 Tower kernel: sd 7:0:8:0: [sdj] tag#1 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00

But you syslog is very confusing with lots of connects and disconnects, so it would be best to reboot, reconnect the SSD so docker/apps should start working, then try a new new sync and post new diags if it happens again.

Link to comment

Johnnie, That disk is the one I swapped with the cache drive. I never had any cache problems before, just issues with that disk dropping out. I left it out of this parity build so it wouldn't have an effect. It's strange that swapping power cables led to both drives dropping randomly. They are connected to a molex to 2x Sata adaptor. I would think if the adaptor is bad, the problem would have followed to the cache, and not effected both drives when I switched the power cables. My drives used to be connected to an LSI 9210-8i HBA. In my troubleshooting,I pulled the adapter and attached directly to the onboard lsi-2308 controller integrated in my mobo. Could any of these issues be due to the new cpu? I wish I hadn't changed the cpu and upgraded my drives at the same time!

 

Also, the SSD was never disconnected on purpose. It seems to be disconnecting on its own after swapping power adaptors with the other drive that's dropping out.

Link to comment

Since you are mentioning Molex-to-SATA power cables, let me point you to this YouTube video (there aer probably another fifty up there that show this problem in more graphic detail)

 

         https://www.youtube.com/watch?v=TataDaUNEFc

 

 

It will show the type to avoid as well as the ones which have few problems.  Plus, the molded type seem to have more bad connection issues besides the shorting problem.  

Link to comment
11 minutes ago, johnnie.black said:

Not likely., check cables, reboot and try again so the log is clean.

I rebooted last night, turned troubleshooting mode on, and uploaded the latest diagnostics file. I'm willing to try again, but I'm fairly sure the log will be full of these same results. I'll pick up a new sata adaptor today and see if anything changes.

Link to comment
9 minutes ago, Frank1940 said:

Since you are mentioning Molex-to-SATA power cables, let me point you to this YouTube video (there aer probably another fifty up there that show this problem in more graphic detail)

 

         https://www.youtube.com/watch?v=TataDaUNEFc

 

 

It will show the type to avoid as well as the ones which have few problems.  Plus, the molded type seem to have more bad connection issues besides the shorting problem.  

 I actually do have the molded kind right now unfortunately. I was actually just on mono price a few days ago meaning to order these :https://www.monoprice.com/product?p_id=8794&gclid=CjwKCAiAlfnUBRBQEiwAWpPA6UCqNPh9p0Bli92MuT_SMTfKU28cKaJO08hEZt5pEK7n-4EzbO95OxoCLhkQAvD_BwE

 

Those should do the trick right? 

 

Edit: Anyone know where to order some of these :https://www.amazon.com/gp/product/B01M9K0XAF/ref=oh_aui_search_detailpage?ie=UTF8&psc=1

That would make my cables so much easier to organize. Although, I've read that having cables too close together can cause interference. Would you worry about that with these?

Link to comment
2 minutes ago, david11129 said:

I rebooted last night, turned troubleshooting mode on, and uploaded the latest diagnostics file. I'm willing to try again, but I'm fairly sure the log will be full of these same results. I'll pick up a new sata adaptor today and see if anything changes.

You need to check/replace cables before trying again, if more than one device keeps dropping you have serious hardware issues and need to fix those first.

Link to comment
1 hour ago, johnnie.black said:

You need to check/replace cables before trying again, if more than one device keeps dropping you have serious hardware issues and need to fix those first.

I ordered some new power adaptors and I will update once they arrive. I never had any problems until I upgraded my drives. It was stable up until then.

Link to comment
5 hours ago, david11129 said:

I'm having a tough time with this one. I have bought 4 new drives for m Unraid.

 

You also should be checking your PS to make sure it can supply the +12V power required to handle four new drives.  It has long been a rule-of-thumb that each drive requires 2A at the peak current during drive spin up.  Supplies with dual +12V rails often have issues with this,

Link to comment
17 minutes ago, Frank1940 said:

 

You also should be checking your PS to make sure it can supply the +12V power required to handle four new drives.  It has long been a rule-of-thumb that each drive requires 2A at the peak current during drive spin up.  Supplies with dual +12V rails often have issues with this,

Power shouldn't be an issue. While adding the new drives, I removed 4 2tb 7200rpm drives. That was actually part of the reason to upgrade. I wanted to save some power. The replacement drives are 8tb 5400rpm WD reds. 

Link to comment
2 hours ago, johnnie.black said:

You need to check/replace cables before trying again, if more than one device keeps dropping you have serious hardware issues and need to fix those first.

Since the new cables won't be in until Friday, would it be worth it to put my drives in another machine to test? I have a machine with an e5-1650,48gb of ram, and enough sata ports on the mobo and the power supply to power all the drives. 

Link to comment

It's still going strong, although it usually takes longer than this to start screwing up. Monitoring the log though, I'm seeing quite a few entries for this :

Mar  6 21:44:48 Tower kernel: sd 7:0:4:0: attempting task abort! scmd(ffff880408fdb948)
Mar  6 21:44:48 Tower kernel: sd 7:0:4:0: [sdg] tag#0 CDB: opcode=0x85 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00
Mar  6 21:44:48 Tower kernel: scsi target7:0:4: handle(0x000d), sas_address(0x4433221105000000), phy(5)
Mar  6 21:44:48 Tower kernel: scsi target7:0:4: enclosure_logical_id(0x500304801c82c501), slot(6)
Mar  6 21:44:48 Tower kernel: sd 7:0:4:0: task abort: SUCCESS scmd(ffff880408fdb948)

Normal? What is that even saying?

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...