Jump to content

SOLVED: unRAID 6 beta 12 hung while running 'Mover'


Porterhouse

Recommended Posts

Hi,

I see a few people reporting 'hanging' problems with v6b12.

I also have a v5.0.5 server which has been running fine and stable since day 1, a good year ago now.

I've been wanting to build a ('duplicate') v6 server and to experiment with plugins and docker containers. Gave up waiting for the finished version 6. I had just finished the original (2 day) parity build of the new server, changed my DNS pointers around to make it #1 and demoted my v5 server to #2. I kicked off 'Mover' within a day of the above parity build completing to copy a wad of new rips, but Mover just never came back to me with any progress updates. I gave it plenty of time to complete the expected move action duration for the volume of rips queued and then tried to 'wake it up'. By that stage I could still login on a console which I happened to still have in place (but not to telnet as I hadn't yet installed unmenu) and I could still ping out and ping it, but I couldn't read any drives/mounts from the console or from my W8 laptop. Although the lights were still on, there was 'no-one home'. Powerdown would not execute (in background or current shell) shutdown now also would not execute. Cache appeared (from my http session) to still not be empty but all attempts to wake it up had failed. Pressed reset switch - the svr appeared to reboot but did not come back to life, so had to power it off/on.

When re-powered it came back with parity errors so parity had to be re-done. 2 days later ... it's all back to working state, but HUGE worry that I might have lost weeks worth of work. Can someone please check my syslog and advise what may have been (or may still be) the cause and I'm now seeing a lot of worrying errors indicating a scsi issue.

Syslog is far too big to attach so will have to 'trim' it down somehow - welcome an suggestions what to loose

Thanks.

 

syslog.20150127.subset.txt

Link to comment

Don't completely understand your description of what you have done to troubleshoot.

... By that stage I could still login on a console which I happened to still have in place (but not to telnet as I hadn't yet installed unmenu)
unMenu is not required for telnet. Perhaps you were thinking about the screen package that you install from unMenu, but telnet is built-in to unRAID and nothing else needs to be installed. Did you actually try to telnet?
and I could still ping out and ping it, but I couldn't read any drives/mounts from the console
how did you try to read the drives from console? Did you try
ls -al /mnt

or what?

... Cache appeared (from my http session) to still not be empty
Are you referring here to the webGUI shares page where it indicates that there is still data on cache? If not then I don't know what you mean by "http session".
but all attempts to wake it up had failed. Pressed reset switch - the svr appeared to reboot but did not come back to life, so had to power it off/on.

When re-powered it came back with parity errors so parity had to be re-done.

You don't specifically mention it at this point, but normally if you power down without stopping the array, then unRAID will automatically begin a correcting parity check when it restarts. Is this what happened, or did you have to start a parity check yourself, or do you mean that you did a parity rebuild in some other way? Did unRAID actually say it had parity errors?
2 days later ... it's all back to working state, but HUGE worry that I might have lost weeks worth of work.
If it is working now, you should be able to see if you have lost anything so I'm not sure what you mean here. Is your data there or not?
Can someone please check my syslog and advise what may have been (or may still be) the cause and I'm now seeing a lot of worrying errors indicating a scsi issue.

Syslog is far too big to attach so will have to 'trim' it down somehow - welcome an suggestions what to loose

Thanks.

This syslog is mostly just the same stuff over and over, and it might be just as well that you trimmed it if there was nothing else to see. However, it is not very useful and a full syslog from boot up would be better.

 

Syslogs are seldom too big to attach despite many people saying so on this forum. They are just text and so they compress very well. Just zip it next time. Also, the syslog "rotates" and you can often find the earlier parts of your syslog in /var/log with names like syslog.1, syslog.2, etc.

 

Does unRAID indicate any drive problems right now in the webGUI? Try to get us a better syslog.

Link to comment

Apologies, my 30 year old unix skills are very rusty now, so pls excuse my half soaked explanation.

 

Syslog copied from /var/log was 1.4Mb !! (1.36 MB (1,433,600 bytes)) so would not attach to the post as-is, and hence the reason for arbitrarily cutting out the middle. At some point the transactions since boot were overwritten despite not rebooting again since 26/01, so there is no log data around the reboot time. The start of the file IS the start of the log as at the time of copy from /var/log.

 

You are correct in that I coudn't actually telnet as my crappy w8 doesn't support telnet natively and I didn't have an alternative app in place to do this, but I tried //servername:8080 and it did not connect. Whereas I could still ping out from the server to the router, etc, so I concluded that networking at least was still working on the server even if some services were hung.

 

w.r.t mounted disks, I tried 'cd /mnt/cache' etc, and the command would not complete. So had to ^C out.

 

Cache appeared to still not be empty from the WebGUI despite having been left FAR longer than it would normally take to move that volume of cache'd data to a disk.

 

Yes, parity-check started automatically when unRAID came back up (powered off/on) and yes the data is now back, recovered, thankfully.

 

But my concern is, what are the chances this will happen again SOON. I now have 16Tb and rising with more disks on order. I'm now getting lots of scsi error messages (from LSI 3ware 9650SE24) in the log which did not appear until this event (hanging) occurred. So I'm wondering if the hang was caused by a scsi error/failure, a disk problem, or if the scsi error is caused by the reboot, and would appreciate some advice on how to debug this further, whether to now shutdown, reboot, fsck on boot, or whether to fsck via unmenu (now installed) before rebooting, or whether to just bite the bullet and buy a new LSI card. Note that I have the exact same LSI card in my 5.0.5 server and it's rock solid. Both have been upgraded to the latest firmware following LSI advice to get above-2Tb compatibility from this card. 

 

Any advice/guidance/thoughts gratefully received.

 

Current tail end of /var/syslog taken from unmenu >syslog item is:

 

Jan 27 22:51:59 MEDIAsvr1 emhttp: shcmd: shcmd (18820): exit status: 22 (Other emhttp)Jan 27 22:52:09 MEDIAsvr1 emhttp: shcmd (18822): /usr/sbin/hdparm -y /dev/sdj $stuff$> /dev/null (Drive related)Jan 27 22:52:09 MEDIAsvr1 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85. (Errors)Jan 27 22:52:09 MEDIAsvr1 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85. (Errors)Jan 27 22:52:09 MEDIAsvr1 emhttp: shcmd: shcmd (18822): exit status: 22 (Other emhttp)Jan 27 22:52:19 MEDIAsvr1 emhttp: shcmd (18824): /usr/sbin/hdparm -y /dev/sdj $stuff$> /dev/null (Drive related)Jan 27 22:52:20 MEDIAsvr1 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85. (Errors)Jan 27 22:52:20 MEDIAsvr1 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85. (Errors)Jan 27 22:52:20 MEDIAsvr1 emhttp: shcmd: shcmd (18824): exit status: 22 (Other emhttp)Jan 27 22:52:23 MEDIAsvr1 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85. (Errors)Jan 27 22:52:23 MEDIAsvr1 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85. (Errors)Jan 27 22:52:23 MEDIAsvr1 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85. (Errors)Jan 27 22:52:23 MEDIAsvr1 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85. (Errors)Jan 27 22:52:24 MEDIAsvr1 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85. (Errors)Jan 27 22:52:24 MEDIAsvr1 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85. (Errors)Jan 27 22:52:24 MEDIAsvr1 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85. (Errors)Jan 27 22:52:24 MEDIAsvr1 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85. (Errors)Jan 27 22:52:24 MEDIAsvr1 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85. (Errors)Jan 27 22:52:24 MEDIAsvr1 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85. (Errors)Jan 27 22:52:30 MEDIAsvr1 emhttp: shcmd (18826): /usr/sbin/hdparm -y /dev/sdj $stuff$> /dev/null (Drive related)Jan 27 22:52:30 MEDIAsvr1 emhttp: shcmd: shcmd (18826): exit status: 22 (Other emhttp)Jan 27 22:52:30 MEDIAsvr1 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85. (Errors)Jan 27 22:52:30 MEDIAsvr1 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85. (Errors)Jan 27 22:52:31 MEDIAsvr1 unmenu[12834]: awk: ./unmenu.awk:473: fatal error: internal error: segfault (Errors)Jan 27 22:52:31 MEDIAsvr1 unmenu[12834]: ./uu: line 55: 30397 Aborted                awk -W re-interval -f ./unmenu.awk 2>$stuff$1 < /dev/nullJan 27 22:52:31 MEDIAsvr1 unmenu-status: Exiting unmenu web-server, exit status code = 134Jan 27 22:52:31 MEDIAsvr1 unmenu[12834]: exit status 134 - unmenu.awk will be re-startedJan 27 22:52:31 MEDIAsvr1 unmenu-status: Starting unmenu web-server  Total Lines: 5000

 

Link to comment

Very hard to look at your log in this form. Can you not zip a more complete log and attach it?

 

What exactly are you proposing to run fsck on?

 

You might start with a memtest since that would be easy and could eliminate one potential problem.

 

Could you elaborate on your hardware a little more? I don't really have any experience with SAS cards but plenty of people here do so maybe someone else will comment.

 

Link to comment

My hardware is Gigabyte Z97N mobo (Mini-ITX, DDR3, USB3, Gb LAN, HDMI)/Intel G3258 CPU (Dual Core 3.20GHz, Socket H3 LGA-1150), with an LSI 9650SE 24 port PCI card connected to SATA discs (via 8-way SAS-SATA fan-out cables) via 3x 5way hot swap bays, to WD Red disks, currently 4x4Tbs.

At the moment I only have 4x4Tb disks + 4Tb Parity + 256Gb Samsung Pro SSD Cache. I do have 4 random discs fitted as placeholders but not assigned. These were only intended as placeholders for SSD's to be added to cache pool later, but I haven't got that far yet.

 

I have attached several zipped files incl syslog and output from lsdev, lspci, etc

 

Hope this is now more informative

 

I presume you're suggesting that I reboot and then run memtest at the Grub menu - I'm not aware how else to run memtest (it doesn't appear to be an option in unMenu or elsewhere) but I have so far avoided rebooting because I gather this flushing the logs, but it looks like it's already doing that itself, daily.

 

Please confirm that you want me to reboot and then run memtest from the Grub menu - or give me the syntax to run at the console,

 

Thanks 

Logs__Config_files.zip

Link to comment

Do you mean you have some drives attached but not currently assigned? Since they aren't doing anything you might try eliminating them to simplify your system for diagnostic purposes.

 

Yes, you have to boot into memtest. Memory cannot be tested with the OS running. You can select it from the boot menu that appears before unRAID begins to load.

 

While you have things shut down to eliminate unused disks, you might as well also reseat cards, cables, and power connectors.

Link to comment
  • 4 weeks later...

The investigations with LSI(Avago.com) took many 'cycles' because there feedback is so cryptic that it requires more Q's to understand what they are telling you, and each cycle takes >2 days or 4-5 days if it overlaps a weekend, plus time to then investigate further (whilst also working away from home 4-5 days a week). Hence this investigation has taking me weeks.

LSI did point to errors from certain disks (captured from the 'lsigetlunix' command/dump) and lots of errors could also clearly be seen in the unRAID logs, but it turned out that these 'errors' filling up the logs were actually due to unsupported commands issued by the OS (eg hdparm) which is just a command which the particular version of LSI/3Ware/Megaraid/(Dell/HP rebranded) card, didn't recognise. I also got the expected 'unsupported ... suggest replacing' advice from LSI.

So, given that the data and effort required to create it was worth more than he £1000 9650SE/24i RAID card causing the problem, I decided to bite the bullet and replace it with an LSI 9201/16i HBA costing around £340, and hey, no more errors, and it's reading the disks at SATA III-6Gb/s spec now.

Also note that I was extremely worried about changing 'controllers' in case it forced a reformat, but I needn't have worried. Since my disks were setup in single disk mode on the 9650 RAID card, I had no problem moving them to the 9201 HBA. Just follow the process of disabling parity first (Stop array, unassign parity, start array) then (stop array, power off) move the disks to the new HBA (fire up, run 'new config', assign parity disk, restart array, wait ..long time.. until parity is rebuilt), job done.

The only weird thing about this HBA is that it seems to pick up the disks in a random order, not consistent with the SAS/SATA port numbers etc, but hey, it's no biggy I suppose. 

Note to anyone interested enough to want to read the outcome of this, LSI are in the process of launching a 9300 series HBA and this will apparently be faster still than the 9201 at 12Gb/s I believe. 

I guess I'll never know for sure if the volume of 'errors' did cause unRAID to hang but I think it's a fair bet so I will close this thread as RESOLVED.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...