Takes multiple reboots to get all drives to show correctly


Sparkum

Recommended Posts

Really hoping someone can tell me what to do,

 

Everytime, since day one that I want to reboot my unraid server it turns into an hour+ job on rebooting, having unassigned disks, rebooting etc until they are all back up.

 

For example I went to upgade unraid last night, took me 90 minutes and the update actually failed, had to roll back.

 

I would say I rebooted 15+ times

 

Its definately slowly making me hate unraid when a simple reboot is actually 5+ reboots and typically having to write "shutdown -r now" into terminal atleast once.

 

Please! Let me know what yo need from me or want me to do and I will do it!

 

thanks

Link to comment

Attached Diagnostics.

 

Sorry not sure how it works, I assume it just clumps is all together, esentially every reboot yesterday was a fail until the last one.

 

And when the computer reboots I do see every single harddrive, they are just sitting in unassigned.

 

Um....

 

I can definitely get more specific after work.

 

ASRock mobo, I5, LSI flashed card, 16GB ram, 500W power supply,

 

11 harddrives, another 2 parity, and 2 cache drives,

 

Mainly all 2TB, I think 3 1.5TB then 240GB Kingstons's for cache 

 

Then a 4GB Lexar flash drive.

tower-diagnostics-20170209-1139.zip

Link to comment

..... yep no got that..

 

I'm asking questions, making comments, initiating 2 way conversation.

 

Not sure how its being taken but apparently not how its meant.

 

My last comment was stating that I didn't provide nor do I apparently have any logs covering the time frame asked for, if that wasn't clear.

 

My previous comment was me being surprised that Trurl said it was a hardware fault since I could see the hard drives pre unraid....

 

So ya, apparently this isnt going anywhere and I shouldn't bother asking for help.

 

 

.... with respect.

 

 

Link to comment

Logs do not cover reboots obviously. You already posted a log in your diagnostics.

 

What do you think unRAID is doing differently when you reboot? The correct answer is nothing, since rebooting does not reconfigure anything about the software. That's why I said rebooting is seldom the solution.

 

But rebooting does have an effect on your hardware and sometimes you get lucky and it works. That's why I said it was a hardware problem.

Link to comment

Ya I see that about the logs now.

Next time it happens I'll start grabbing logs (If there are of any help)

 

Typically its just this panic trying to get it back up.

 

Here's my question though, and I fully get what you are saying.

 

So this has been happening for...4 months now roughly so I've done alot of googling since then.

 

I read a post that suggested rebooting with IE.

 

Now when I do this, I have a much greater change of everything coming back up.

 

Just a dumb coincidence perhaps?

 

Additionally I've used the ControlR app to reboot the server (just once) and that worked flawlessly,

 

So I seem to get much different results depending on what I use.

Link to comment

There have been some reports that some settings in the webUI are not saved correctly by some browsers. The solution has been to use a different browser to save the settings. Once the setting is saved it should work from then on until you change it.

 

Also, the webUI doesn't always work well with adblockers, so if you are using one, whitelist your server.

Link to comment

Definitely use adblocker on Firefox (browser of choice)

 

I typically try to remember to reboot with IE though (as I have a much higher success rate with it)

 

Could any of this be due to my USB?

I definitely say my USB might not be the best. Just a Lexar I had lying around, it worked so I  continued on with it.

 

But if not then I wont worry about it.

Link to comment

Haha, so far (minus this if this is mobo related) I'm a fan of it.

 

Maybe I'll start paying more attantion to which drives.

 

See if its a "always drive 3 5 6 9 kinda thing"

 

I have a LSI and mobo connections and the 8 or so times I rebooted yesterday I saw the LSI cards on the spash screen come up for 100% everytime.

 

So I'll def start pen and papering this and ya, maybe its the mobo and I just need a second LSI card.

 

That would be the best $100 I've ever spent because I cringe when I have to do a reboot, and I literally set aside hours to do it.

 

I was up until almost 2 am last night and had to be up for 6:30 because I needed to get it online before work.

Link to comment

Did you make any progress?

 

Really hoping someone can tell me what to do,

 

Please! Let me know what yo need from me or want me to do and I will do it!

 

Your SAS card seems to be resetting repeatedly. I'm sure others can give you more advice but I'd check the cables between it and the drives and make sure it's seated in its slot properly.

 

syslog:

Feb  9 02:56:58 Tower kernel: mpt2sas_cm0: fault_state(0x265d)!

Feb  9 02:56:58 Tower kernel: mpt2sas_cm0: sending diag reset !!

Feb  9 02:57:00 Tower kernel: mpt2sas_cm0: diag reset: SUCCESS

Feb  9 02:57:00 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

Feb  9 02:57:00 Tower kernel: mpt2sas_cm0: log_info(0x30030100): originator(IOP), code(0x03), sub_code(0x0100)

Feb  9 02:57:00 Tower kernel: mpt2sas_cm0: log_info(0x30030100): originator(IOP), code(0x03), sub_code(0x0100)

Feb  9 02:57:00 Tower kernel: mpt2sas_cm0: LSISAS2008: FWVersion(02.15.63.00), ChipRevision(0x03), BiosVersion(07.01.09.00)

Feb  9 02:57:00 Tower kernel: mpt2sas_cm0: Protocol=(

Feb  9 02:57:00 Tower kernel: Initiator,Target

Feb  9 02:57:00 Tower kernel: ), Capabilities=(

Feb  9 02:57:00 Tower kernel: Raid,TLR

Feb  9 02:57:00 Tower kernel: ,EEDP,Snapshot Buffer

Feb  9 02:57:00 Tower kernel: ,Diag Trace Buffer,Task Set Full

Feb  9 02:57:00 Tower kernel: ,NCQ<6>)

Feb  9 02:57:00 Tower kernel: mpt2sas_cm0: sending port enable !!

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: port enable: SUCCESS

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: search for end-devices: start

Feb  9 02:57:07 Tower kernel: scsi target1:0:1: handle(0x0009), sas_addr(0x4433221100000000)

Feb  9 02:57:07 Tower kernel: scsi target1:0:1: enclosure logical id(0x5782bcb00a076a00), slot(7)

Feb  9 02:57:07 Tower kernel: scsi target1:0:2: handle(0x000a), sas_addr(0x4433221103000000)

Feb  9 02:57:07 Tower kernel: scsi target1:0:2: enclosure logical id(0x5782bcb00a076a00), slot(4)

Feb  9 02:57:07 Tower kernel: scsi target1:0:4: handle(0x000b), sas_addr(0x4433221101000000)

Feb  9 02:57:07 Tower kernel: scsi target1:0:4: enclosure logical id(0x5782bcb00a076a00), slot(6)

Feb  9 02:57:07 Tower kernel: handle changed from(0x000c)!!!

Feb  9 02:57:07 Tower kernel: scsi target1:0:5: handle(0x000c), sas_addr(0x4433221104000000)

Feb  9 02:57:07 Tower kernel: scsi target1:0:5: enclosure logical id(0x5782bcb00a076a00), slot(3)

Feb  9 02:57:07 Tower kernel: handle changed from(0x000d)!!!

Feb  9 02:57:07 Tower kernel: scsi target1:0:6: handle(0x000d), sas_addr(0x4433221105000000)

Feb  9 02:57:07 Tower kernel: scsi target1:0:6: enclosure logical id(0x5782bcb00a076a00), slot(2)

Feb  9 02:57:07 Tower kernel: handle changed from(0x000e)!!!

Feb  9 02:57:07 Tower kernel: scsi target1:0:7: handle(0x000e), sas_addr(0x4433221106000000)

Feb  9 02:57:07 Tower kernel: scsi target1:0:7: enclosure logical id(0x5782bcb00a076a00), slot(1)

Feb  9 02:57:07 Tower kernel: handle changed from(0x000f)!!!

Feb  9 02:57:07 Tower kernel: scsi target1:0:0: handle(0x000f), sas_addr(0x4433221107000000)

Feb  9 02:57:07 Tower kernel: scsi target1:0:0: enclosure logical id(0x5782bcb00a076a00), slot(0)

Feb  9 02:57:07 Tower kernel: handle changed from(0x0010)!!!

Feb  9 02:57:07 Tower kernel: scsi target1:0:3: handle(0x0010), sas_addr(0x4433221102000000)

Feb  9 02:57:07 Tower kernel: scsi target1:0:3: enclosure logical id(0x5782bcb00a076a00), slot(5)

Feb  9 02:57:07 Tower kernel: handle changed from(0x000b)!!!

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: search for end-devices: complete

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: search for raid volumes: start

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: search for responding raid volumes: complete

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: search for expanders: start

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: search for expanders: complete

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: _base_fault_reset_work: hard reset: success

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: removing unresponding devices: start

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: removing unresponding devices: end-devices

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: removing unresponding devices: volumes

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: removing unresponding devices: expanders

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: removing unresponding devices: complete

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: scan devices: start

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: scan devices: expanders start

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: break from expander scan: ioc_status(0x0022), loginfo(0x310f0400)

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: scan devices: expanders complete

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: scan devices: phys disk start

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: break from phys disk scan: ioc_status(0x0022), loginfo(0x00000000)

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: scan devices: phys disk complete

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: scan devices: volumes start

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: break from volume scan: ioc_status(0x0022), loginfo(0x00000000)

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: scan devices: volumes complete

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: scan devices: end devices start

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: break from end device scan: ioc_status(0x0022), loginfo(0x310f0400)

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: scan devices: end devices complete

Feb  9 02:57:07 Tower kernel: mpt2sas_cm0: scan devices: complete

Feb  9 02:57:07 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

Feb  9 02:57:07 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

Feb  9 02:57:07 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

Feb  9 02:57:07 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

Feb  9 02:57:07 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

 

repeated 398 times in 7 hours. This isn't something that rebooting will fix. Also check backplane, if you have one, and power to drives.

 

Link to comment

Hey.

 

So just re-seeded SAS card, put into a different slot even, checked all wires/cords.

 

Wrote down all serials of drives connected to the SAS card as well as all of them connected to the mobo.

 

Currently in my reboot loop trying to get all drives to come back up and everytime they are mobo drives that arent coming up.

 

Not once has it been a SAS card drive, so being as I have a second slot for another SAS card.....that might just be my easiest solution (really dumb though)

 

I dunno, I'm also maybe thinking power? Might be time to go to a bigger size.

 

EDIT:

 

Spoke too soon, latest reboot contained 2 SAS card drives missing,

Link to comment

Yes, I understand that.

 

It does however give me a short term fix (until I reboot again) otherwise I just have a heavy paperweight.

 

I was just stating my findings and my guesses on the problem.

 

I stated I not only re-seeded my SAS card but additionally but it into another slot, all cords were checked and double checked and stated that my power supply "may" be underpowered.

 

Additionally that both mobo and SAS card drives were dropping off.

Link to comment

I would have thought that the most likely thing to cause the symptoms you are having is a power supply that is under-rated and thus not capable of handling the max current when the system tres to spin up all drives simultaneously as is normal at power on.  Reboots work because at that point in time some of the drives are probably still spinning so the total current required is less.    What power supply do you have, and how many drives in the system?

Link to comment

I would have thought that the most likely thing to cause the symptoms you are having is a power supply that is under-rated and thus not capable of handling the max current when the system tres to spin up all drives simultaneously as is normal at power on.  Reboots work because at that point in time some of the drives are probably still spinning so the total current required is less.    What power supply do you have, and how many drives in the system?

 

Reasonable question. Seen stranger things caused by a flakey or underpowered psu (reported on the forum, not personally). I once had power issues and had to power on twice in rapid succession to get all drives spinning. Once all drives were spun up, everything was good. Doesn't feel like that, but can't rule it out.

 

I wanted to answer the question about the USB. No, this is not related to the USB.

 

Also concerning the swipe as ASRock. Lots of people here have ASRock MBs (me included) and I don't agree with the sentiment that they are poor quality. But regardless, MBs do go bad and I have had a few go bad over the years. They can definitely cause some funky symptoms. MB failures are unique - sometimes they just die. But when they flake out these are the kinds of things that can happen. I'd put this relatively high on the list, and subject to being ruled out.

 

Starting the obvious, not s single other user is reporting the problems you are reporting. This is something unique to your system and your hardware. Could be an incompatibility. Intermittent connectivity (board not fully inserted, cabling loose in connection chain, etc.), bad motherboard, bad PSU, bad controller, or bad something else.

 

I can't tell you what it is, but can tell you what I would do in your situation. I'd breakdown the server to its most basic setup. Remove addon controllers. One drive connected to one motherboard port. On your usb, make a backup of the config folder and then delete the file called super.dat from the real config folder. Boot unRaid. See if it comes up, reboots and does all the things it is supposed to do. I'd avoid writing to any disks, but probably would assign the disk to disk1 slot and you could start the "array". Once you've convinced yourself it is working well, power down. If this is working fine, repeatedly, this would reduce likelihood of a MB failure. This test is especially important - knowing you are on a solid foundation is critical to the isolation process. Once you are convinced, add your controller, and connect the one drive there. Repeat. I can't tell you the step by step progression, but you want to slowly and methodically add components/drives to the system until you know it is solid and then move on. I've done this quite a few times with my computers over the years and have had very good luck isolating weird issues. Occasionally, when I'm done, everything just worked. Means there was some connectivity issue that I fixed in the process.

 

It is important that you not be in denial that this is something unique to your system and not some flakiness of unRaid itself.  You need to be in that mindset to be an effective investigator to unravel the mystery.

 

Good luck!

Link to comment

Hey.

 

Thanks all.

 

Yes, I am definitely on board with the idea thats its me. (Now, I definately wasn't when I first got here)

 

I'm learning towards power myself, its a Corsair 450 Gold (if memory serves) definately didnt have as many harddrives when I bought it but there's currently 14 spinning drives and 2 SSD connected

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.