willdouglas

Members
  • Posts

    28
  • Joined

  • Last visited

Everything posted by willdouglas

  1. I think this is solved. Like a true crawlspace scientist I changed two things at once so I'm not sure which was the solution, but it's fixed and I'm out of the crawlspace. I swapped out my SATA cables for new cables and moved half the drives to a new splitter fed from a different power feed which is probably on another rail. No more scrolling errors and the rebuild is proceeding at a rate I can comprehend as a human.
  2. I moved the drives from an LSI card with SAS breakout cables to onboard SATA last time I encountered an issue like this and at the time assumed the card was failing. I am using power splitters. I wouldn't think it was a power issue but it is the last common factor between the ports experiencing issues. New splitters and cables are inbound today, I'll swap out when they arrive. If the issue persists I'll swap the PSU and see if that makes a difference.
  3. I had a disk fail and every replacement so far has also errored out, run incredibly slow, refused to mount, failed to format, failed the rebuild, or experienced a combination the aforementioned issues. They're not used disks, and I'm pretty sure the first few had previously been through pre-clear though that was many months ago and they've just been hanging around unassigned waiting for a failure. I've RMA'd the original failure and two brand new drives from the same batch thinking I just had bad luck, but I've moved on to a newer batch and different manufacturers with no better luck. I've tried swapping cables and moving to different ports but see the same behavior. At this point the server is refusing to see one drive and refuses to mount or pre-clear two others but is otherwise functional. I'm pretty sure the original data was lost, so I'm not so concerned about rebuilding for the sake of recovering as much as I'm concerned about rebuilding for the sake of having a fault tolerant array again so I can move it all over to new hardware I have staged and waiting. I've been through a couple read checks, I haven't been offered the opportunity to rebuild the array in a few weeks and at this point attempting to select any of the drives that do show up causes the selected drive to disappear from the dropdown. I'm seeing errors scroll when I look at logs no matter where I look so I'm not finding the signal in the noise.
  4. Had some life events get in the way of my troubleshooting. Memtest wouldn't boot for me from the unraid install, so I pulled the newest version and ran that. It took a few days but came out perfect, four passes and no errors. I've moved on to full rebuild mode in the interest of having a working system in place during coronavirus lockdown. New bootable USB drive with a fresh image of 6.8.3. Currently pre-clearing all my disks because I was seeing errors on one of the new drives already.
  5. Ran the file system check and let it correct the errors, the outcome wasn't pretty. I did upgrade RAM in this host a few weeks before the first failure. Leaving memtest running overnight and tracking down some spare DIMMs. I also checked the drives that have been pulled in a different machine, they're definitely a mixture of dead/dying/busted. One refuses to spin up at all and the other two throw tons of errors but will allow me to view the directory structure before refusing to function beyond that.
  6. If there's a low effort way to get the shares back to normal I'll give it a shot, but I think I can pull from the individual drives via SCP and get a decent transfer rate. I'll sort it out on the other side.
  7. Rebuild completed, disk check/repair run, user shares available but several show empty when navigated over the network. I can browse the filesystem, for instance my TV directory, from the webgui but actually attempting to access over the network displays an empty directory. cadance-diagnostics-20200308-0943.zip
  8. Will wait for rebuild completion, run file system checks, and report back. Thanks!
  9. Had some drives fail, luckily within warranty. First drive was replaced with a "hot spare" drive I leave in the server but don't allocate. During rebuild another drive failed but rebuild completed. First drive was removed and replaced, and I tried to begin the next rebuild but I observed some funny behavior that persisted through reboots. The array had difficulty starting and stopping, missing VMs, missing docker apps. Issues cleared when I pulled the second failed disk out. I also moved all drives to the SATA ports since all the failures were on the same SAS breakout. The rebuild completed, but another drive reported as bad during the juggling. I'm pretty sure I went beyond my failure limit at some point during the rebuilds and all the swapping around, my disk space utilization has dropped about 5TB from start to finish. My array is currently reporting good drive health but it's also reporting no user shares and when I try to start VMs I'm getting an error. I thought I might be able to add the shares back, but when I try nothing happens and "starting services..." is added to the status at the bottom of the GUI. I'm not looking to get back to where I was, more just wondering how to make the media I can recover available so it can be pulled off. Ideally I'd like to recover what I can and re-provision the whole setup from scratch. There have been some hardware and configuration changes to the machine through it's life and I'm sure I've got a nice pile of mistakes stacked precariously on top of other mistakes. I don't want to make things worse for myself during the recovery process. There is nothing I care about on this array, but there is a lot of stuff so I'd like to save time/bandwidth/admin work if possible. cadance-diagnostics-20200307-1231.zip
  10. Ten days of uptime. I'm comfortable leaving well enough alone. End thread.
  11. Figured I should follow up with this. I never figured out what the issue was or managed to pull usable info out of a crash. I replaced the USB drive with a newer drive, and with a fresh install only installed the official plex docker. All other apps are running in an unraid VM until I feel like investing time in this again.
  12. Alright, cutting down docker apps to sonarr, nzbget, and plex I've somehow extended average up-time to three or four days, and have a new symptom to add. If I catch the lockup after network access drops, but before hangcheck value past margin messages start appearing there is limited access to the box at console. I wasn't able to log in, but I was able to get it to respond to key presses at what felt like a fifteen to twenty second delay. I ended up rebooting rather than fight with it because there was a request for Plex access from family but I'll re-enable logging and see if I can catch it quickly enough. I do have an impi controller onboard, so I'll see if I can get into the box next time to poke around and maybe dump logs to another box for easy viewing.
  13. I'm going to, possibly unfairly, blame the binhex-radarr plugin. They did warn me, it said it was under active development but I've been using radarr for a while, how bad could it be?! I'm at just over 48 hours of uptime and going to put the box back in the crawlspace where it lives and continue without radarr for now, maybe run it on another host for the time being.
  14. Fail! Made it somewhere around nineteen hours I think. Going to leave Radarr disabled and see how long of an up time counter I can accrue. EDIT, forgot to mention I upgraded to 6.5.0 before starting radarr so I was on a fresh reboot and upgrade, possibly hosing my troubleshooting by changing a variable. It seemed so innocent, so safe! Uptime was great! FCPsyslog_tail.txt cadence-diagnostics-20180314-0542.zip
  15. 72 hours uptime. Enabled deluge, nzbget and sonarr all at the same time yesterday and triggered a couple searches to see if they were playing nicely together. No lockups in the last 24 hours. At this point I only have radarr and plexpy to re-enable and I'm kinda tempted to just not enable them, but I will forge ahead and enable radarr.
  16. I've broken 48 hours of uptime. Feel safe ruling out plex at this point. Time for the next app!
  17. I'm breaking records over here with twenty hours of uptime without a lockup. Once I hit 24 hours I'll start the first app, probably plex, then 24 hours later start another, and so on. I thought I had them all installed and running before the lockups started but might not have. I did update library shares before the crashes start so maybe it'll be tied to plex usage as people started adding load.
  18. I'm not seeing anything pop up using stress or memory testers. I also updated my post with the correct SAS controller, it's an LSI 9211-8i. I also verified it has the most recent IT firmware flashed - 20.00.07.00-IT, nothing newer on their site anyway. Going to start trying permutations of apps I guess. I feel like hardware is good, only new purchases for this were drives and controller, server previously did duty as a virtualization server for training so if it was going to flake I hope I'd have seen it previously impacting VMs.
  19. Have you observed any output at console when the system locks up? I'm seeing a similar issue and am also elbow deep in hardware tests that seem to pass with flying colors.
  20. Logs from second lockup. cadence-diagnostics-20180310-0257.zip FCPsyslog_tail.txt
  21. Daily lockup achieved, logs captured and attached. cadence-diagnostics-20180309-1217.zip FCPsyslog_tail.txt
  22. I will do that now. 24 hours of memory tests haven't revealed any issues so it's time for the next thing.
  23. Too impatient to let this thing freak out while I'm away repeatedly. Feels like flirting with disaster. Redid PSU calcs to make sure I'm not under powering and I have quite a bit of leeway. Removed tuner card since it wasn't being used anyway, flashed BIOS(probably same revision, forgot to check), currently running Memtest to verify RAM and looking for some good burn in software.