• Posts

  • Joined

  • Last visited

Everything posted by timekiller

  1. Posting an update in case it's helpful to others. I believe I have this solved. I realized that the drive issues I was having were all on drives in this drive cage I bought to try and squeeze an extra drive into the server. I removed the cage, rearranging drives to get everything mounted securely and started yet another parity rebuild. This one took about 37 hours and completed successfully last night. The array has been online and stable since then. Hopefully That was the source of all my issues and I can move on. Thank you everyone who gave advice, even if I didn't necessarily take all of it.
  2. of the 24 drives, there is a grand total of 2 splitters (not daisy chained, and not on the same line). Again, I don't believe this is causing the issue.
  3. a couple, just due to the runs I have. Not as many as you might be thinking. It's a 1,000 watt PSU so there are plenty of ports for peripherals. I really on't think the PSU or the way I have the power wired is the issue.
  4. Yup. My mistake. The data drive was marked unmountable, not disabled. The 2 parity drives were still online and the data on the unmountable drive was missing fromt he array. That was 3 weeks ago and I've been dealing with so many issues since then I forgot the specific error for the drive. This doesn't answer my current problem though. New drives, new cables, new controller cards, and I still can't get a successful parity rebuild.
  5. I understand how it's supposed to work. Both parity drives were online. The disabled drive was no emulated. Don't know what else to tell you on that. Doesn't really matter since this was like 3 weeks ago and not relevant to my current issues.
  6. I have a very large array - 24 drives, all 10,12, or 14 TB. A couple of weeks again I started having issues with the parity drives getting disabled. I was told to ditch my Marvel based controllers, which I did. I bought two LSI 9201-16i cards and installed them. I still had issues. I was told my firmware is very old, so I updated to the latest. I thought maybe the drives were faulty, so I bought new 14TB drives for parity. I've swapped my cables I don't even know how many times at this point. I moved the server out of my network closet to a cooler room. Every time I made a change I had to start the parity build over. With this many drives it takes several days to complete. Most of the time, it doesn't complete. It did complete once, and a few hours after completion a data drive was disabled, taking all it's data offline. Yesterday I moved the server and updated the controller firmware. Now I have read errors on 3 drives and unraid has disabled one of the parity drives. I'm pulling my hair out with frustration. I've swapped out all the hardware short of building a new server from scratch. I need to nail down the actual issue and get my server back. Please help! diag attached storage-diagnostics-20211121-1156.zip
  7. Firmware is updated, I also moved the server to another room because it was getting pretty hot where it was. Parity is rebuilding AGAIN 🤞
  8. Pulling my hair out with this server for weeks now. Was having a ton of issues that seemed to be caused by my Marvel based sata controllers. I took this community's advice and replaced both 16 port cards with LSI 9201-16i cards. I wound up having to rebuild parity, which finally finished yesterday with no errors and I thought I was in the clear. This morning I woke up to errors on both parity drives and both parity drives disabled. Now I assume I need to do a new config to trigger a parity rebuild AGAIN. But I need to fix the underlying problem. I have swapped sata cables, and both of these parity drives are brand new because I was having the same issue with another set of parity drives, so I pulled them and both 2 new 14TB drives. I have also tried a number of sata cables. I'm extremely frustrated at this point and just want my data reliably protected. At this point I have swapped out the controllers, the sata cables, AND the parity drives. What is going on here? diag attached. storage-diagnostics-20211120-1059.zip
  9. yup, saw the call trace and my eyes skimmed right past the nfsd stuff - thanks!
  10. Thanks, at least it's not hardware related this time! So I can better diagnose in the future, where did you find this? I'm looking through the diagnostics file and don't see it.
  11. My desktop is Linux, so definitely need NFS. Never seen NFS cause the array to go offline before, any idea how this happened?
  12. So I finally took everyone here's advice and replaced my Marvell based controller cards (IO Crest 16 Port) with 2 LSI 9201-16i cards. In addition I needed to shuffle some disks around, so when I installed the new cards I also wound having to do a new config and start a parity rebuild. It's been running for about 33 hours and everything was going great until about 30 minutes ago when I got an error deleting a file. Investigation shows that I lost /mnt/user - "Transport endpoint is not connected". Interestingly, /mnt/user0 is still connected and the array is accessible from there. Of course all of my docker container and shares use /mnt/user, so now the entire server is effectively offline. I stopped all my docker containers to hopefully avoid further issues there. I assume a reboot will fix this, but 1) I'd like to know what happened here, and 2) I don't want to have to restart the parity rebuild. There is currently an estimated 9 hours left and it appears to be running fine. Do I have any options beyond reboot and start over, or go 9 hours or more without my server? Diagnostics attached storage-diagnostics-20211119-0937.zip
  13. I'm open to recommendations for a replacement. I've asked for suggestions more than once, but haven't received a straight answer. I have 21 drives, so need 16 port cards.
  14. Another update: I restarted the array in maintenance mode so I could repair the now missing, emulated drive. I ran xfs_repair on /dev/mapper/md1 and it did it's thing. What I did not expect is that it xfs_repair moved every single file/directory on the drive to lost+found. This is especially confusing because running the same command on the real drive did not do this. Since the original drive is fine, I'm just going to do a new config and let parity get rebuilt. I realize I did not handle this the "right" way from the beginning, but I can't help but wonder if I had run the repair against the real drive in the first place if I would now be forced to manually go through 10TB worth of lost+found files and manually move/rename them.
  15. Update: I powered down the server and removed the unmountable drive. After rebooting, unraid didn't start the array and I had to tell it to with a missing drive, as expected. However, the missing data is still missing. I pulled the unmountable drive and attached it to my desktop (Linux). I opened the drive with cryptsetup and had to run xfs_repair befor eI could mount it, but the missing data is still there, so that's good. But now unraid is in a state where it knows a drive is missing, but unraid is not emulating the missing data. I can rsync the entire drive back, but that will take a long while. I don't think there is any way around this though since unraid believes the missing drive was just waiting to be formatted. Unraid shows the missing drive, but is still labelling it as "Unmountable" and there is no directory for it under /mnt/. I feel like the only option now is to reinstall the drive use the Tools->New Config option to construct a new array with all the drives in place. Anyone see another option here?
  16. Hello, last night I went to watch a movie on my server and plex was showing I had 6 movies (I have have way more than six). After looking into it I noticed the entire array was offline. I rebooted the server (Yes I should have gotten a diagnostics snapshot first, but I didn't). When the server came up, the array was back online, and a plex rescan repopulated all my movies. This morning I woke up to many missing files and the unraid UI showing a drive as "Unmountable: not mounted", and Unraid wants me to format the disk. I logged into the server and dmesg shows a bunch of [54431.657222] ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310) [54433.274993] ata5.00: configured for UDMA/33 [54433.275011] ata5: EH complete [54433.465251] ata5.00: exception Emask 0x10 SAct 0x1fc00 SErr 0x190002 action 0xe frozen [54433.465253] ata5.00: irq_stat 0x80400000, PHY RDY changed [54433.465254] ata5: SError: { RecovComm PHYRdyChg 10B8B Dispar } [54433.465256] ata5.00: failed command: READ FPDMA QUEUED [54433.465258] ata5.00: cmd 60/40:50:90:2b:d9/05:00:09:00:00/40 tag 10 ncq dma 688128 in res 40/00:00:d0:30:d9/00:00:09:00:00/40 Emask 0x10 (ATA bus error) [54433.465258] ata5.00: status: { DRDY } [54433.465259] ata5.00: failed command: READ FPDMA QUEUED [54433.465261] ata5.00: cmd 60/10:58:d0:30:d9/02:00:09:00:00/40 tag 11 ncq dma 270336 in res 40/00:00:d0:30:d9/00:00:09:00:00/40 Emask 0x10 (ATA bus error) [54433.465262] ata5.00: status: { DRDY } [54433.465262] ata5.00: failed command: READ FPDMA QUEUED [54433.465264] ata5.00: cmd 60/40:60:e0:32:d9/05:00:09:00:00/40 tag 12 ncq dma 688128 in res 40/00:00:d0:30:d9/00:00:09:00:00/40 Emask 0x10 (ATA bus error) [54433.465264] ata5.00: status: { DRDY } [54433.465265] ata5.00: failed command: READ FPDMA QUEUED [54433.465267] ata5.00: cmd 60/b0:68:20:38:d9/03:00:09:00:00/40 tag 13 ncq dma 483328 in res 40/00:00:d0:30:d9/00:00:09:00:00/40 Emask 0x10 (ATA bus error) [54433.465267] ata5.00: status: { DRDY } [54433.465268] ata5.00: failed command: READ FPDMA QUEUED [54433.465270] ata5.00: cmd 60/40:70:d0:3b:d9/05:00:09:00:00/40 tag 14 ncq dma 688128 in res 40/00:00:d0:30:d9/00:00:09:00:00/40 Emask 0x10 (ATA bus error) [54433.465270] ata5.00: status: { DRDY } [54433.465271] ata5.00: failed command: READ FPDMA QUEUED [54433.465273] ata5.00: cmd 60/40:78:10:41:d9/05:00:09:00:00/40 tag 15 ncq dma 688128 in res 40/00:00:d0:30:d9/00:00:09:00:00/40 Emask 0x10 (ATA bus error) [54433.465273] ata5.00: status: { DRDY } [54433.465274] ata5.00: failed command: READ FPDMA QUEUED [54433.465275] ata5.00: cmd 60/40:80:50:46:d9/05:00:09:00:00/40 tag 16 ncq dma 688128 in res 40/00:00:d0:30:d9/00:00:09:00:00/40 Emask 0x10 (ATA bus error) [54433.465276] ata5.00: status: { DRDY } [54433.465278] ata5: hard resetting link This time I did take a diagnostic snapshot (attached). I rebooted the server and it came up in the same state - 1 drive is "unmountable" and the data on it is missing. Furthermore, Unraid is running a parity check (which I cancelled). What I can't figure out is: 1) Why isn't unraid emulating the missing drive? 2) Why did unraid restart the array if a drive is missing? 3) Is the data on the missing "unmountable" drive gone if the parity check started? My fear is that it started rewriting parity to align with the missing drive. how screwed am I? storage-diagnostics-20211031-1033.zip
  17. Sorry for the long post, but I an trying to get all the information into 1 post so there isn't a lot of back and forth. I am trying to upgrade my Unraid server with a new CPU/MOBO and I'm having a hell of a time. My existing hardware is a Core i7-6700k. I have a Ryzen 5800x CPU and I've been trying to find a motherboard to match with it that will support my existing hardware. Specifically I have (2) IO Crest 16 Port SATA cards, and (2) 1TB m.2 NVME drives for cache. My existing motherboard (MSI Z170A), which is like 5-6 years old at this point is able handle all this hardware without issue. I have now bought 2 motherboards (first board details and issues are in this thread) I returned that board and now I have a Gigabyte X570 UD which almost fits all my hardware requirements. I'm ok on PCI slots, but there is only 1 m.2 slot. Not a HUGE deal, I can buy a new 2TB m.2 drive to replace my (2) 1TB drives. The problem is that now this board will not show any drives from the SATA cards. I have tried several configurations, including removing the m.2 drive and only installing 1 SATA card. No matter what I do, I can not see any drives attached to any SATA controller on this board. The UEFI screen is very stripper and does not have a section to see installed peripherals, so I can't even confirm that the card is recognized by the system. When the system boots, I do see a ton of errors related to ATA, so I'm not sure if my issue is at the bios level, or if there is something up with the UNRAID configs. I did pull a diagnostic dump and attached it to this post. Any help is appreciated. storage-diagnostics-20211009-0128.zip
  18. Correction. The slots are as follows: (1) PCIE x16 slot (PCIEX16) (1) PCIE x16 slot (PCIEX2) (1) PCIE x16 slot (PCIEX1) (1) PCIE x1 slot (PCIEX1) So the second PCIE x16 slot is actually only an x1, this might explain why on boot Unraid is showing a bunch of "limiting drive to 3Gbps" lines on boot. Not sure if the mismatch of x2 and x1 for to two identical cards is causing the issue, but now that I know the slot is only x1, I don't want to use this board since half my drives would be bandwidth limited. Going to yank the board and return it for something better. Guessing I'll have to go to a X570 to get all the PCI lanes I need. Thanks everyone!
  19. I was able to swap the board back in, but added no m.2 drives. Still have the same problem. Only 1 SATA card is recognized. It shoudln't matter, but I have (2) IO Crest 16 Port SATA III PCIe 2.0 x2. The motherboard has (3) 16x slots, but only the top one is true x16. I have a GTX 1050 int hat slot to help with video transcoding. The other 2 slots look like x4 based on the pins I am able to see. I'm consulting the manual to see if it tells me anything.