johnsanc

Members
  • Posts

    208
  • Joined

  • Last visited

Everything posted by johnsanc

  1. @JorgeB - Thanks so much for that link. I've read that post but after re-reading it carefully and checking I see now that my 9207-8i + RES2SV240 is actually capable of supporting 16 drives at 275 MB/s For some reason I had these diagrams and their associated speeds stuck in my head which are the LSI 2008 chipset: So that being said it looks like some potential options are: 12 drives Change my current setup to single link for my 12 drives in my main enclosure using my existing 9207-8i + RES2SV240 Route a SAS cable to my other enclosure Add another RES2SV240 for another 12 drives using single link This should maybe result in a slight bottleneck, but still very acceptable speeds without needing an extra HBA 16 drives Add a 9207-8e + RES2SV240 in dual link for 275 MB/s for the additional 16 drives 20 drives Add a 9207-8e + RES2SV240 in dual link for 275 MB/s for the 16 drives Route a cable from the internal expander out to the new enclosure for the additional 4 drives (since I'm only using 12 currently with a dual link setup) So it sounds like just getting a new HBA / Expander pair would be a good way to go if I didn't want to sacrifice any potential speed from my current setup.
  2. Yes I should have clarified - I am only interested in the HBA / Expander. My current setup is an Antec 1200 with 4x 3x5 drive cages and it works beautifully. It wasn't the most cost effective storage solution for a lot of drives but it was able to grow with my array over the years. And frankly it just looks way better than a rack IMO. My goal is to create another matching tower just for the extra drives using the same drive cages I use for my main enclosure. The problem is that I only have 1 PCE 4.0 x16 slot available to use. So here's what I've deduced so far: 12 drives = 9207-8e + RES2SV240 (dual link, basically mirroring what I have now internally. Single link may also work with very little bottleneck) 16 drives = not sure 20 drives = not sure
  3. I know there's a lot of posts about choosing HBAs and expanders, so maybe this will help others as well that are googling trying to figure this out... What is the current cheapest way to add support for an additional _____ drives with zero bottlenecks? 20 HDDs? 16 HDDs? 12 HDDs? Assumptions: One PCI-E 4.0 x16 slot available Almost useless two PCIe 2.0 x1 slots available The extra drives will be housed in a separate spare tower case Only HHDs will be connected to the HBA(s) / Expander(s) - SSDs will be direct to motherboard SATA Background: My main box right now holds 20 drives total and I have no space left (16 data, 2 parity, 2 cache) I want to support the max that unRAID is capable of (28 data, 2 parity, at least 2 cache, a couple spare slots for unassigned devices) I currently use my onboard SATA (8 drives) along with an 9207-8i + RES2SV240NC in dual link (12 drives) The motherboard I am using is an ASRock x570 Creator and the 9207-8i is currently in a PCI-E 4.0 x16 slot The minimum I would need to be able to support is 12 additional drives, but ideally I would like to be able to support 16 or even 20 if its not cost prohibitive Any recommendations are appreciated.
  4. Hmmm... I'm going to have to dig deeper into my dockers. When the Docker service is disabled, I can clear the ARP table and I get the correct MAC address. If I I start the docker service, clear the ARP table again, then I get the weird random MAC address. EDIT: Ok the MAC address I am seeing in pfSense when I turn on the Docker service is the one from "shim-br0". I assume this is because I have "Host access to custom networks" enabled in my Docker settings. I believe I only needed this for Pi-Hole, but now I replaced that with pfBlockerNG. Once I disabled the custom networks I saw my correct MAC address in pfsense.
  5. I recently added a pfSense box to my network and assigned some static IP addresses. I see in my pfSense logs a ton of entries like this: arp: xx:xx:xx:xx:xx:xx attempts to modify permanent entry for 192.168.y.yyy on igb1 It looks like something is changing my Unraid MAC address to the address represented by xxx's above. I thought maybe it was my pihole docker, but I deleted that since I no longer need it. Upon each reboot I get a different random and unique MAC address for Unraid. Any ideas what could be causing this?
  6. Forgot to followup but I got it working. I turned my old router to an AP and setup a pfSense box. I forgot to add the static route in pfSense. Once I added it things worked perfectly.
  7. I tried adding my router IP and 8.8.8.8 to the Peer DNS Server and it did not allow me to access anything aside from my LAN when using "Remote Tunneled Access". Any idea what the issue could be? EDIT: Apparently if I use NAT then I can access the internet using Remote Tunneled Access. Is there a way to make that work without the NAT setting set to "Yes" ?
  8. I recently upgraded my parity drives to 12TB. For now, my largest data drives are 10TB. I noticed that during a parity check, the check progresses past the 10TB mark and processes the full 12TB even through there is no data to check against. Why does this happen? Wouldn't it be more efficient to stop after the last bit of data from the data drives? No support needed, just a general question I was pondering.
  9. Just following up to confirm that upgrading to 6.9-beta1 seemed to have fixed the issue. Thank you all for your help and guidance as always.
  10. I am really struggling with this one and must have read though this entire thread 3 times now. Here is what I have so far: Local server uses NAT: No Local endpoint: my external IP : 51820 Peer type of access: Remote tunneled access All local tunnel/peer settings are defaults My docker config is set to allow host access to custom networks The docker IPv4 custom network I have uses the same subnet I forwarded port 51820 to my unraid server internal IP I added a static route in my router: Destination IP: 10.253.0.0 IP Subnet Mask: 255.255.255.0 Gateway IP: unraid internal ip address Metric: 2 (No idea what this is for and Netgear's help is not helpful - supposedly this is supposed to be the number of routers on the network?) Now, when I try to ping 10.253.0.1 with the command line it works: PING 10.253.0.1 (10.253.0.1): 56 data bytes 64 bytes from 10.253.0.1: icmp_seq=0 ttl=64 time=1.303 ms 64 bytes from 10.253.0.1: icmp_seq=1 ttl=64 time=2.949 ms 64 bytes from 10.253.0.1: icmp_seq=2 ttl=64 time=2.096 ms 64 bytes from 10.253.0.1: icmp_seq=3 ttl=64 time=2.886 ms 64 bytes from 10.253.0.1: icmp_seq=4 ttl=64 time=3.213 ms 64 bytes from 10.253.0.1: icmp_seq=5 ttl=64 time=2.095 ms When I try to ping 10.253.0.2 I get "Destination Host Unreachable" errors but I can also see that the errors show that the Redirect Host is going to my unraid server IP. I tried connecting with both my iPhone and the macOS WireGuard app and both show the 5 second timeout handshake error. Anyone have any suggestions? I feel like I have to be missing something obvious. EDIT: I completely forgot about my piece of hot garbage AT&T Pace gateway for my fiber connection. Since AT&T's firmware update broke DMZ+ mode a year ago (still not fixed) I had most ports opened to my Netgear router... but the range ended at 50999 since AT&T has a few service ports reserved above that. I changed my Wiregaurd port to something in range of what I forwarded and it worked without a hitch. However, how do I access both my LAN and the internet at the same time on the VPN? Do I need to select a different "peer type of access"? EDIT2: - Remote tunneled access = LAN access + no interwebs on device I'm using to VPN in - Remote access to LAN = LAN access + interwebs
  11. Yep I'm going to kick off another parity check to make sure there's zero errors. Its not an Unraid problem per-se, but doesn't the behavior above indicate that Unraid does not re-read the sync correction it just made to ensure its valid? If not it would be nice to have a "Parity Check with Validation" option.
  12. Another quick update: My parity check started firing off corrections at about the 9.25 TB mark which is right about where I started getting the IO_PAGE_FAULT error the other day during my parity check. So, after this ordeal I am left with a couple of takeaways: Its possible for Unraid to write bad parity and there is nothing in the Web UI that would indicate anything went wrong unless you look at the syslog. The "bad parity writing" issue starts with the lovely AMD IO_PAGE_FAULT error. In my case there were a few XFS errors after this and my log was not flooded... but the parity was indeed incorrect for every sector after that point. So, although I think I have recovered from this, its a bit concerning that this is apparently a scenario that can write bad parity without the user knowing. This could leave someone with a completely unprotected array and they would not even know it until their next parity check.
  13. I do have a SAS+Expander as well. LSI LSI00301 (9207-8i) + Intel RES2SV240NC Interesting about the memory - It is ECC and straight from my motherboard's QVL for RAM. Since I just upgraded to v6.9beta-1 I will let this parity check complete and monitor the logs for any more similar errors before I attempt to change any other settings.
  14. How do you know if an xfs check is "good" (with or without -n)? I don't see any kind of exit code in the syslog. ive attached the output of the xfs checks for the two disks using -vv (without the -n) xfs_check.txt
  15. I stopped the remainder of the check, upgraded to 6.9-b1, rebooted, did an XFS check on disks9 and 10. Nothing seemed to indicate any issues as far as I can tell. I am now running another correcting parity check. Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - agno = 15 - agno = 16 - agno = 17 - agno = 18 - agno = 19 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 1 - agno = 8 - agno = 13 - agno = 15 - agno = 4 - agno = 0 - agno = 2 - agno = 7 - agno = 6 - agno = 10 - agno = 11 - agno = 12 - agno = 14 - agno = 3 - agno = 17 - agno = 19 - agno = 16 - agno = 18 - agno = 5 - agno = 9 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting.
  16. Thanks @johnnie.black - Once this check completes I may try to upgrade to v6.9-b1, do the XFS checks/repairs, then do another parity check. Do you think that is a good next step or would you recommend something else?
  17. Well here is an update so far... The data disks are done with the parity check, but its currently checking nothing because my parities are 12 TB and my largest data drive is 10 TB. It looks like I had an IO_PAGE_FAULT error and then a few minutes later some XFS meta data errors, first on disk10 (which was still parity checking) then later on disk9 (which was already done checking). I can still access those disks and they are not emulated. Looking back at my old logs, this happened before the last time I got XFS errors in the log. In all cases the IO_PAGE_FAULT came from my "ASM1062 Serial ATA Controller" which is onboard. Also not sure if related, but I noticed in the logs that the XFS issues appeared shortly after 5:00 AM in both this run on 6/5 and on 6/2 (within one minute). So should I continue with another check? Or should I try to do an XFS repair on the two disks that have issues? Try to copy data and reformat those drives? Something else to try to fix whatever the controller issue is? Upgrade to 6.9 beta for better support for X570? Any guidance on next steps is appreciated as always. UPDATE: This seems very much related to issues I was having before. X570 woes. I also noticed that I forgot to add "iommu=pt avic=1" to my syslinux.cfg for Unraid GUI mode, which I am currently using. tower-diagnostics-20200605-0905.zip
  18. I've attached my syslog from the syslog server save from the past couple days. I've also attached my diagnostics as of this afternoon that shows the parity corrections. As you can see below and in the logs attached, the sector value increments by 8 on every line. Jun 4 13:18:43 Tower kernel: md: recovery thread: PQ corrected, sector=12860152832 Jun 4 13:18:43 Tower kernel: md: recovery thread: PQ corrected, sector=12860152840 Jun 4 13:18:43 Tower kernel: md: recovery thread: PQ corrected, sector=12860152848 Jun 4 13:18:43 Tower kernel: md: recovery thread: PQ corrected, sector=12860152856 Jun 4 13:18:43 Tower kernel: md: recovery thread: PQ corrected, sector=12860152864 To summarize my various parity checks from the last few days starting with the parity check I let make millions of corrections: Check #1 - 05/31 11:32:34 --> 06/02 15:20:39 Correcting I let it finish Constant sync corrections after 2 TB. It was during this check that I had two disks that were still green, but were not accessible due to XFS errors. It was after this check that I did the XFS check/repairs on the two disks that were problematic. Check #2 - 06/02 17:43:21 --> 06/03 00:16:56 Non-correcting Canceled after 2 TB mark There were no sync errors past the 2 TB mark like there was before, which lead me to believe that my last parity corrections were written correctly. There were some sync errors initially which I expected due to the writes from XFS repairs after the disks were kicked from the array. Check #3 - 06/03 00:17:45 --> 06/03 18:25:37 Correcting Stopped early Constant sync corrections after about 5 TB. After stopping I did XFS checks on all disks. Two different disks required repairs. I also reseated my expander card and checked disk connections. Check #4 - 06/03 19:55:33 --> CURRENTLY RUNNING Correcting Constant sync corrections about about 6 TB. I assume the corrections from Check #3 were good since the only parity corrections in this check were after the point I stopped Check #3 Based on this so far the only thing I can deduce is that XFS repairs can somehow invalidate parity... or my Check #1 wrote completely 100% incorrect parity after a certain point while the disks were green, but files inaccessible due to XFS errors. I will let this finish and run another check without rebooting. It'll take awhile if the parity corrections continue at ~50mb/s for the next 3 TB or so. syslog.log tower-diagnostics-20200604-1346.zip
  19. Yes all disks are in the same position every time. In this case there are both P and Q sync corrections, both starting at the same place. In my earlier chain of events the P corrections started happening first then P+Q corrections a few TB later. It really doesn't make sense to me given my current understanding of how parity works. I do not have the diagnostics for the most recent run, but a few posts back I provided two diagnostics that also should have exhibited the same excessive sync correction behavior. I stopped my correcting parity check yesterday in the middle of the flood of sync corrections. I rechecked cables and rebooted and the correcting parity check is running again. So far 0 sync corrections. In a few hours it should reach the point where the sync corrections started yesterday. I am really curious to see what happens with the rest of this run. If I see that there is a range of time where excessive sync corrections happen then are resolved - then I think that suggests something went awry and unraid was making sync corrections it shouldnt have yesterday.
  20. Perhaps this is more of an academic or mathematical question: but what can cause a parity check to start doing parity corrections to every sector? Thats what I don't understand. Take this chain of events: Do a correcting parity check, with no syslog errors - Everything seems good at this point. Mount array in maintenance mode to do XFS checks Two disks get kicked from the array during the XFS repair Do a New Config and Trust Parity Reboot the server to reseat all drives and check cable connections Start array and do a correcting parity check In this case the only writes would have been during the XFS repair. And I would have expected a few parity corrections because of it. How is it possible that the parity could have multiple terabytes of sync errors? It's simply not possible unless there is something else going on with how parity works that I haven't seen explained anywhere. There must be something that can invalidate parity past a certain point instead of simply on a sector by sector basis. The only other explanation is a hardware error that goes completely unreported in the syslog.
  21. Good callout - They are all Seagate. I guess I can just ignore those metrics.
  22. I am running into serious issues as outlined here... and I think I may have gotten to the bottom of it: After about the 4th round of trying to Parity Check I noticed nearly ALL of my drives have massive and increasing numbers for the following. I found it very odd that I was getting these errors across all sata controllers: Raw read error rate Seek error rate Hardware ECC recovered However there are no errors in the syslog or anything that would indicate an issue as far as I know. Were there actually errors but suppressed due to the P+Q sync corrections? What can cause the 3 bullets above and what should I try to do to fix? Or is this normal? EDIT: Upon closer inspection all but one of the disks are attached to my Intel RES2SV240NC RAID Expander Card + LSI LSI00301 (9207-8i)
  23. OK another update. My non-correcting parity check went over the 2TB mark with only a few sync errors. I stopped this check and restarted with a correcting parity check. Now again deep into the check after about 5TB or so I see every sector triggering a parity correction. Again there are ZERO issues in the syslog. Does anyone have an explanation for this? Its concerning that I keep getting these sync corrections but there is no info in the syslog that would suggest any issues what-so-ever. I did notice that the parity corrections seem to coincide with network activity, which happened to be a download from my Nextcloud. Considering there are no errors in the syslog it almost seems like the parity check/sync is very fragile and can be easily thrown off. Either that or there is something weird going on with my SATA controller and its returning incorrect data that is going unreported in the syslog. This one is a mystery and any theories around what is happening are appreciated.
  24. The parity check crossed over the 10 TB mark, which is the size of my largest data drive right now. There are no more writes to either parity drive, which is what I would have expected. Once this completes in a few hours I will try a non-correcting parity check. UPDATE: After the correcting parity check completed I immediately put the array into maintenance mode and did an XFS check on all of my disks. Two other disks had XFS errors. All 4 disks that had errors over the course of this ordeal were on my ASMedia controller. My parity disks are NOT on that controller. The XFS repair attempt caused the disks to be kicked from the array. I did another new config and manually changed all disks back to "XFS" (no idea why unraid did not remember that setting....). After powering down and reseating all drives on that controller I booted back up and ran another XFS check on all drives - No issues this time. I finally started up the array in normal mode and a non-correcting parity check is currently running. There are few corrections detected, but I expected those due to the XFS repair that was running while the disks were kicked from the array. I plan on letting this check run past the 2 TB mark since thats where things went awry last time... more to come.