weirdcrap

Members
  • Posts

    454
  • Joined

  • Last visited

Everything posted by weirdcrap

  1. I just replaced the sata cable going to the parity drive. So new non-marvel based controller and a brand new sata cable. Hopefully that will address my parity disk link resets during the parity checks. Sent from my Pixel XL using Tapatalk
  2. Fair enough, any ideas about the rest of it? Those parity errors are currently at the bottom of my list of concerns with this server. It's all duplicated to my main server and replaceable.
  3. I understand that part, I just wasn't sure if it did some writes to the array or anything as part of the preparation for the update on the reboot. I had started a parity check before the OS update like I said and it didnt pick any of those errors up. It wasn't until after I had applied the os update during my check, rebooted the server, and restarted my check that they appeared. Seems odd to me that the parity errors would be caused by anything else (at least as far as writes to the array) . Data only gets copied to this server once a night around 2AM CST via a scheduled rsync script. It's all moved to the array first thing the following morning by the mover.
  4. Update on this. --Moved disk, parity check completed fine but with a couple initial link resets on the parity disk that I did think (not anymore) may have been caused by the marvel controller it was hooked up to. --During the check though I started to get a bunch of call traces from the CPU but the server always seemed to recover and never became unresponsive. Attached are the two zipped logs, the first is when I started the parity check all the way through to when I restarted the server after it was complete. The second is after the server restart. syslogs.zip --Let it ride while I waited for my new non-marvel controller. Fast forward to today, got and installed my new 2 port SATA card. Immediately started a parity check and I am STILL seeing those initial hard link resets on the parity disk. Always only two and no more right at the start of the check. In addition, I get these call traces now whenever I run a parity check which was not the case last month. They don't seem to occur any other time but during my parity checks and at seemingly random intervals throughout the check. This is my log currently as the parity check is in progress: void-syslog-20181204-1829.zip The call traces seem to me like they indicate the CPU is upset because of how busy it is? I have only 2 dockers on this server: DuckDNS and Plex. Plex should be sitting idle as I'm the only user and it only gets used when my internet is down for local streaming. Ignore the 5 parity errors, I caused those earlier and they can be ignored, I will correct them once I get everything else figured out. I accidentally applied the OS update while my parity check was running in the background and UnRAID got kind of wonky on me (which I believe is where these errors came from). These errors appeared after that restart. EDIT: I am on the latest v6.6.6 now btw for OS version.
  5. So far the parity check hasn't canceled or thrown any errors for disk 18 since moving it, about 40% into the check. The parity disk is throwing more of these now (only seems to happen during parity check): Dec 2 06:37:22 VOID kernel: ata7: hard resetting link Dec 2 06:37:32 VOID kernel: ata7: softreset failed (1st FIS failed) Dec 2 06:37:32 VOID kernel: ata7: hard resetting link Dec 2 06:37:38 VOID kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Dec 2 06:37:38 VOID kernel: ata7.00: configured for UDMA/133 Dec 2 06:37:38 VOID kernel: ata7: EH complete Dec 2 06:40:44 VOID kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x190002 action 0xe frozen Dec 2 06:40:44 VOID kernel: ata7.00: irq_stat 0x80400000, PHY RDY changed Dec 2 06:40:44 VOID kernel: ata7: SError: { RecovComm PHYRdyChg 10B8B Dispar } Dec 2 06:40:44 VOID kernel: ata7.00: failed command: READ DMA EXT Dec 2 06:40:44 VOID kernel: ata7.00: cmd 25/00:60:30:62:35/00:02:4d:00:00/e0 tag 4 dma 311296 in Dec 2 06:40:44 VOID kernel: res 50/00:00:2f:62:35/00:00:4d:00:00/e0 Emask 0x10 (ATA bus error) Dec 2 06:40:44 VOID kernel: ata7.00: status: { DRDY } Dec 2 06:40:44 VOID kernel: ata7: hard resetting link Dec 2 06:40:54 VOID kernel: ata7: softreset failed (1st FIS failed) Dec 2 06:40:54 VOID kernel: ata7: hard resetting link Dec 2 06:40:59 VOID kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Dec 2 06:40:59 VOID kernel: ata7.00: configured for UDMA/133 Dec 2 06:40:59 VOID kernel: ata7: EH complete So I went ahead and bought a replacement SATA expander (ASM-1061) that does not have a marvel controller. Hopefully that will resolve these new errors on the parity disk. Only the parity and 1 data disk are attached to that controller. void-diagnostics-20181202-0702.zip
  6. I will after the rebuild is done, I swapped the "bad" disk with one connected to the mobo SATA controller directly. FYI the parity disk didn't reset when the rebuild started and all the disks spun up and it is still on the Marvell controller. I don't recall seeing that in the other logs either but I will keep my eye out for it as well.
  7. All of those disks (Parity, the one that failed last time, and this most recent one) are in the top Norco 5 bay enclosure so they share a common power connection (2x molex connectors) so wouldn't a power issue from the PSU or molex connectors affect multiple disks, not just 1 or 2 at random? Especially since a parity spinup would draw the most power for spinning up 20 disks. Yes it is a marvel controller, never had issues with it before even when all my other Marvel controllers started acting up and I replaced them with LSIs. I'd be open to swapping it out for something else that can support 2-4 disks and will fit in a PCIE x1 slot (the only open slots I have left). Suggestions? The issue is never in the same bay but so far they have all been in that same top 5 bay enclosure.
  8. New month, new parity check, another disk is now exhibiting the exact same symptoms. On the same LSI controller as the last one and even shares the same breakout cable as the first disk. This is not the disk I mentioned above that I pulled in from the primary server, it has had zero issues so far on the same card and breakout cable. So I'm guessing another disk has died. I'm going to try and swap to a different bay after I rebuild again for the 4th or 5th time in as many months. If these F-ing Seagate disks are just going to keep dying one after the other I will never buy another Seagate drive ever again. I have 10+ year old Samsung SpinPoints still going strong and these NAS level drives crap out in 3 years. I never understood the hate people have for Seagate, but I am starting to understand why many people tell me they don't buy Seagate on principle. void-diagnostics-20181201-0837.zip
  9. thank you for the workaround. Didn't even realize I had a problem until I saw this thread.
  10. Fair enough, the good news is I'm about to pull another 4TB disk from my primary server and upgrade to an 8TB so I will have a replacement disk by this evening. Thanks for your help as always johnnie.
  11. So the issue is back. I swapped the disk locations and the rebuild went perfectly fine no issues. Well now it is monthly parity check time again and the same disk physical disk dropped out so I guess it must be the disk. It just seems really odd that it can handle the entire rebuild process and all of those writes to the disk but within 10 minutes of me starting a non-correcting check it resets and causes the array to drop it.
  12. Thanks for the breakdown. I did some thinking last night and there isn't anything on this server that isn't either a mirror of my main server or backed up somewhere else in the cloud. I'll go with option 1 for now since I don't have a spare 4TB disk right now and I'm fairly certain there were no writes but I can't guarantee it... The rebuild is in progress and hopefully everything will go smoothly and that will be the end of it, thanks for your help Johnnie. EDIT: Rebuild completed with zero issues, seems like it was just a fluke.
  13. All my drives are mounted in 5 bay NORCO SS-500s so swapping drive position or SATA cables would be easiest for testing. What do you mean by rebuild to the same? I'm assuming the data on the disk is ok unless the failure during the parity check does some sort of write to the disk? EDIT: I guess my main concern is first making sure the parity data is 100% correct before I go rebuilding over the disabled device if that is what is recommended.
  14. I don't know of any specifically to recommend but the Wiki has a section that should get you started in your search: https://wiki.unraid.net/Hardware_Compatibility#PCI_SATA_Controllers
  15. Basic server info: -Unraid Pro 6.6.1 -The two HBAs I have installed are Dell H200s (SAS2008) flashed to the latest firmware P20.00.07.00 following the instructions from Fireball3. -6 SATA III ports on the mobo, all in use IIRC. -M/B: ASUSTeK COMPUTER INC. - M5A97 R2.0 | CPU: AMD FX™-6300 Six-Core @ 3500 | Memory: 16 GB (max. installable capacity 32 GB) This morning my backup server kicked off its monthly non-correcting parity check. About an hour and a half into the check I got several PushBullet notifications about disk errors from disk 16 followed by a notification that the parity check had been canceled. I imagine when the OS marked the disk as failed it must have some logic to stop the parity check to prevent killing other disks...Either way I was able to capture my diagnostics zip file because it canceled the check which is attached. I haven't made any changes (hardware or software) to my backup server in almost 6 months and it hasn't gotten a single error on any of its recent parity checks. No case openings, drive swaps, SATA cable moves nothing. The drive is only a few years old and was pre-cleared at least twice when I first purchased it before being installed. I shut the server down and pulled the offending drive and attached it to my technician bench. I am running drive tests in SeaTools for Windows (it is a pre-IronWolf Seagate 4TB NAS drive) and so far the drive has passed the SMART check, short DST, and short generic test. It is currently 1/4 of the way through the long generic test and so far hasn't shown any signs of a problem so I am tempted to think this was some sort of fluke with the cabling or controller. I wanted to ask the community though before I go putting this drive back into service as if nothing happened. If I do end up putting it back into service I will obviously want to run several more non-correcting parity checks to ensure there are consistently zero errors being returned. EDIT: To clarify I have not actually verified if the offending disk was attached to one of the Dell H200s or if it was running off the motherboard SATA controller. I thought I could tell from looking through the diagnostic report but I may just have to physically verify it when I get home from work. The SMART report for the disk doesn't highlight anything bad (the seek error and raw read error rate are garbage stats from what I have read online). void-diagnostics-20181001-0842.zip
  16. Binhex's rTorrent VPN docker has the ability to enable flood already, I use it personally and like it well enough. One of the docker arguments is "enable-flood" and once set to yes will enable the flood web interface on port 3000.
  17. this fixed everything for me too! Edit: OK when I try to login to the plugin webui I get an error (see attached). App is connecting over http and the server is running on port 80 with https set to no instead of auto. Everything in the app on my phone works except for Docker and plugin viewing...which I assume I have to log into to get those working.
  18. My issue has not resurfaced since moving my cards around, so it essentially boiled down to some sort of intermittent IRQ conflict because of a shared PCI-E bus I am assuming. Do you have any free USB headers (for front case plugs or the like) on your board? Maybe those are causing the intermittent conflict? I'm honestly just grasping at straws for you though. My only other suggestion is to get really vague with your google terms. I think I found my solution Googling for some combination of Ubuntu and the IRQ 16 error messages thinking that it being one of the most common OSes I was likely to find at least some sort of direction to take my troubleshooting in.
  19. I am going to run into this as well so thanks for the adapter tip! I have a Node 804 and currently, the only 6TB drives I have found that maintain the old screw layouts are the non-pro WD Red 6TB. Once I go to 8TBs I will have to make adapters like you did.
  20. I believe it is an L for length (aka size), from the man page: -l, --length length Specifies the length of the range, in bytes.
  21. OK sweet. As far as the price goes, is this a deal for an r510 with these specs? Or could I do better for a similar price point? I don't want to sink hundreds of dollars in a pre-owned server unless it's a smoking deal. My main requirements are 10+ bays for disks, supports at least 64GB of RAM, and obviously a solid raid controller that will play nice with Unraid if someone thinks of something better before the sale ends.
  22. I work in an IT shop so once I get it going I will probably get it put into one of our racks here. Or if not get a small rack for my home office and start my own home lab lol. Yeah, I was curious if anyone had recent opinions of the H200, I did some searching and saw some threads on here and most people say they are fine but limited to 16 disks? Not an issue for me since the server only has 12 bays. These were posts from several years ago though.
  23. https://www.ebay.com/itm/Dell-PowerEdge-R510-12-Bay-Server-2x2-67GHz-12-Cores-32GB-H200-12x-Trays/172577298188?hash=item282e68970c:g:21UAAOSwySlZ6O4D Are these a good deal for the hardware specs? Thinking about pulling the trigger on one and figured I would get some re-assurance and inform the community! My backup server has Marvel controllers that until the last few UnRAID versions haven't given me any issues but lately they have been really acting up. I figured rather than sink several hundred dollars into new RAID cards for not that great hardware (see VOID in my sig) I could just spend a bit more and do a whole overhaul. I will lose some space for a few disks (16 down to 12) but the 4TB disks I am swapping in should make up for the difference.
  24. I may have just solved my irq16 issue, several restarts and USB devices plugged into ports and the front and back have not produced a new IRQ16 disabled error. I found a post on the ubuntu forums that suggested swapping PCI-E cards to different slots fixed it for one user with a Gigabyte motherboard like mine. I had to come in to work on the server today anyways as one of my drives was throwing errors at boot that got its speed downgraded from 6GBps to 3GBps so I figured what the hell I might as well try it. Originally I had two PCI-E 1x expansion cards plugged into the two 1x slots at the bottom of the board. I moved the bottom one off the 1x slot and put it into the x16 slot (I have no need for a GPU) and everything SEEMS to be happy now. IRQ16 is still what it always was according to /proc/interrupts: root@Node:/home/user# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 0: 25 0 0 0 0 0 0 0 IR-IO-APIC 2-edge timer 5: 0 0 0 0 0 0 0 0 IR-IO-APIC 5-edge parport0 8: 4 0 0 0 0 0 0 0 IR-IO-APIC 8-edge rtc0 9: 0 0 0 0 0 0 0 0 IR-IO-APIC 9-fasteoi acpi 16: 40 0 0 0 0 0 0 0 IR-IO-APIC 16-fasteoi ehci_hcd:usb1 18: 0 0 0 0 0 0 0 0 IR-IO-APIC 18-fasteoi i801_smbus 23: 38 0 0 0 0 0 0 0 IR-IO-APIC 23-fasteoi ehci_hcd:usb2 24: 0 0 0 0 0 0 0 0 DMAR-MSI 0-edge dmar0 25: 0 0 0 0 0 0 0 0 DMAR-MSI 1-edge dmar1 26: 23814 0 0 0 0 0 0 0 IR-PCI-MSI 327680-edge xhci_hcd 27: 100794 0 0 0 0 0 0 0 IR-PCI-MSI 512000-edge ahci[0000:00:1f.2] 28: 4017 0 0 0 0 0 0 0 IR-PCI-MSI 524288-edge ahci[0000:01:00.0] 29: 2511 0 0 0 0 0 0 0 IR-PCI-MSI 2097152-edge ahci[0000:04:00.0] 30: 708529 0 0 0 0 0 0 0 IR-PCI-MSI 1572864-edge eth0 NMI: 0 0 0 0 0 0 0 0 Non-maskable interrupts LOC: 140729 218526 231275 187016 186757 154390 152016 148618 Local timer interrupts SPU: 0 0 0 0 0 0 0 0 Spurious interrupts PMI: 0 0 0 0 0 0 0 0 Performance monitoring interrupts IWI: 0 0 0 0 0 0 0 0 IRQ work interrupts RTR: 0 0 0 0 0 0 0 0 APIC ICR read retries RES: 20997 7530 3206 2167 1897 1142 1148 958 Rescheduling interrupts CAL: 2144 2680 3039 2932 2581 2553 2516 2791 Function call interrupts TLB: 675 1322 1617 1496 1328 1298 1357 1577 TLB shootdowns TRM: 0 0 0 0 0 0 0 0 Thermal event interrupts THR: 0 0 0 0 0 0 0 0 Threshold APIC interrupts DFR: 0 0 0 0 0 0 0 0 Deferred Error APIC interrupts MCE: 0 0 0 0 0 0 0 0 Machine check exceptions MCP: 5 5 5 5 5 5 5 5 Machine check polls HYP: 0 0 0 0 0 0 0 0 Hypervisor callback interrupts ERR: 0 MIS: 0 PIN: 0 0 0 0 0 0 0 0 Posted-interrupt notification event NPI: 0 0 0 0 0 0 0 0 Nested posted-interrupt event PIW: 0 0 0 0 0 0 0 0 Posted-interrupt wakeup event We will see if it continues to behave itself over the next week or so. TL;DR try moving your PCI-E cards around if you have other open slots, it may resolve your issue (YMMV).