wheel

Members
  • Posts

    201
  • Joined

  • Last visited

Everything posted by wheel

  1. Following a lot of mid-pandemic work on my unRAID towers, I’ve reached a point where I’m pretty comfortable I’ve done all I can do to ensure against catastrophic failure: finally converted all my ReiserFS drives to XFS, got everything protected by dual parity, resolved a bunch of temperature issues. One thing bugs me, though: two of these 21-drive towers (and one 13-drive tower) are about a decade (and about 7 years) old, and I keep reading snippets of “well, unless your PSU fails and takes out everything at once” in unrelated threads that, combined with the “capacity reduces by 10% or so yearly up to a point” adage, has me thinking I may be dancing on thin ice with all three of these PSUs currently. What gives me pause on replacing all three (or at least the pair of ~decade old ones) immediately is the weird use case of UnRAID (or maybe just mine specifically). All three of these towers were designed for their UnRAID WORM purpose, and none of their parts had any previous life. Am I being extremely paranoid, or is replacement at this point a prudent idea? There have definitely been times (months, even years) where one tower or the other has not been powered on at all, or has seen extremely minimal use (90% idle time when powered on). Could these use situations mitigate the normal “danger zone” timeline on replacing a PSU? …or not enough to ease larger concerns on something like built-up fan dust congealing and overheating the PSU regardless of how long it’s actually been in operation (and at what level of effort)? Any guidance on how concerned I should be (and how swiftly I should replace what I have) would be greatly appreciated! PSU/System Age Specifics (all drives 3.5” between 5700-7200): The 2011 21-disk tower is running on a Corsair Enthusiast Series TX650 ATX12V/EPS12V 80 Plus Bronze, purchased in 2011 The 2012 21-disk tower is running on a Corsair Enthusiast Series TX650 ATX12V/EPS12V 80 Plus Bronze, purchased in 2012 The 2015 13-disk tower is running on a Corsair RM Series 650 Watt ATX/EPS 80PLUS Gold-Certified Power Supply - CP-9020054-NA RM650, purchased in 2015
  2. Damn, it does: all five drives are in the same 5x3 Norco SS500 hot swap rack module (from 2011, so... damn). Since the tower is four of those Norco SS500s stacked on top of each other, I'm going to need to find a basically identically-sized 5x3 hotswap cage replacement if something's dying in that one, and I'm not having any luck with a quick search this morning. Might start up a thread in the hardware section if replacement's my solution. I'm guessing with the SS500 as the most likely culprit for power issues, there's no need for me to run extended SMART tests on the 4 drives throwing up errors, but are there any other preventative measures I can take while figuring out the hotswap cage replacement situation? My gut's telling me it's best just to keep the whole thing powered down for now, but that's a massive pain for family reasons. Thank you for confirming it's likely a power issue (connections feel way less likely considering the age involved, but might try and replace the hotswap cage's cables first just to be safe). Any other suggestions to to make sure I really need to replace this thing before I put the effort into trying to replace it would be really helpful!
  3. All four drives are Seagate SMR 8TB drives, which, considering the whole hard drive pricing thing going on for larger-sized drives, has me mildly concerned. It just feels like it's a cabling issue with 4 closely-related drives throwing up the same issue at the same time, but all of my drives are in 5-drive cages, so it feels weird seeing 4 vs 5 (though it could definitely be just the connection of those 4 drives to my LSI card, I've never seen this sort of multi-drive issue in almost 10 years of operation). Diagnostics attached, because I'm scared to touch a damn thing at this point until someone looks at what I've got going on. Thank you in advance for any guidance provided! Edit: just checked age on the drives, and three may have been purchased around the same time (around 2 years of power-on time), but one's less than a year old and definitely from a different purchase batch. Edit 2: Based on other threads I just checked, I went ahead and ran a short SMART test on each of the 4 affected drive. Updated diagnostics file attached. tower-diagnostics-20210513-1436.zip
  4. Possibly a random question with a stupid easy answer for a competent Linux head, but I’ve been searching for hours with no luck: Is there an easy way in the GUI (or terminal) to determine which disks (by any unique identifier, think I could reverse engineer the info I need from there) are being read directly through the SATA ports on my motherboard vs. the ones plugged into my LSI cards? When I initially set up this box (with 4 stacked 5-in-3 Norco hotswap cages), I wasn’t paying attention to which cable ports on the back were associated with which drives (in a left to right order), and when I compare it to another box using the same Norco hotswap cages, I’ve realized they probably changed production between my building the two boxes (both hotswap cage sets are SS-500s, but have different port layouts on the back and different light colors up front) and online instructions aren’t really helping now. So my initial plans of just tracing the motherboard-connected plugs to the hotswap cage cable port fell flat, and now I’m just trying to determine which of these swap cage trays are the ones connected directly to the board so I can use them for Parity drives specifically (as parity’s taking forever on this system and I’m following all the steps for even marginal improvement). Is there an easy way to just see if a certain disk / which disks are SATA1/2/3/4 (the four ports I have on the motherboard) and which are running through the LSIs (16 out of 20)? Thanks for any help, and sorry if this is the dumbest question I’ve ever asked on here. Always appreciate the assistance!
  5. I was kind of hoping that’d be the case, but felt like it’d be safest to check when playing with Parity on a massive array I haven’t moved to dual Parity yet. Thanks for the help!
  6. Same situation as OP, but I’m physically moving my Parity Disk to a slot currently holding a data disk. Just completed an unrelated Parity check, so timing seems perfect. Anything I need to do differently, or swap disks / new config / re-order in GUI / trust Parity works just as simply for (single) Parity In 6.8.3? Thanks for any guidance!
  7. Yeah, I'm just reading tea leaves at this point and hoping there's something obvious I'm missing. I have at least two (could be three in a couple of days) theoretically fine 8TBs ready to roll, and the original 6tb that was throwing up errors (which may have nothing to do with the disk, now) before the rebuild. GUI shows the rebuild ("Read-Check" listed) as paused. I'm guessing my next steps without a free slot to try are going to be: Cancel rebuild ("Read Check"). Stop array, power down. Place (old 6tb? another different 8tb?) into Disk 12 slot. Try a rebuild again today (since I'm guessing unraid trying to turn the old 6tb into an 8tb but failing mid-rebuild means I can't simply re-insert the old 6tb and have unraid automatically go back to the old configuration?) Any reasons why I shouldn't other than the fact that I'm playing with fire again with another disk potentially dying while I'm doing all these rebuilds? I'm starting to think my only options are firedancing or waiting who knows how long for an appropriate hotswap cage replacement and crossing my fingers that I'll physically rebuild everything fine (and I'm almost more willing to lose a data disk's data than risk messing up my entire operation).
  8. Unfortunately not - it's an old box (first built in 2011, I want to say?), four 5-slot Norco SS-500 hotswap cages stacked on each other in the front. Nothing ever really moves around behind the cages, and the only cable movement that I can recall since I first built it was when unplugging/replugging the cage's breakout cables when replacing the Marvell cards with LSIs back in December (and these issues with disk 12 started occurring maybe a quarter of a year later). The hotswap cage containing Disk 12's slot is the second up from the bottom, and could be a massive pain to replace (presuming I can find a replacement of such an old model, or one that doesn't mess up the physical spacing of the other 3 hotswap cages). Edit 2: any chance the rebuild stopped at *exactly* 6tb could be significant? Feels like a bizarre coincidence.
  9. Soooooo something may be up with the Disk 12 slot. That 6tb couldn't finish an extended smart test, so I dropped what I was pretty sure was a fine 8TB (precleared and SMART ok after being used in another box for a couple of years) into the slot for the rebuilt. Had a choice between using an SMR Seagate and CMR WD and used the WD. Rebuild was interestingly exactly 75% complete (right at the 6tb mark) and the new 8tb in the Disk 12 slot started throwing up 1024 read errors and got disabled. My instinct's to throw another 8TB spare in the slot and try it again, but something feels weird, so here's the diagnostics. Am I reaching a point where something's likely wrong with the hotswap cage and I'm going to need to buy / replace that whole thing again? tower-diagnostics-20200605-0534.zip
  10. OK, running extended test now - hate that it's consistently throwing up errors and need to replace a 6tb soon anyway, but definitely don't want to throw out disks unnecessarily during what could be a weird economic time for getting new disks. Thanks for the quick response!
  11. The sync (vs disk) correcting parity check was a total brain fart on my end, and I'm hoping it turned out okay (no error messages but I'll go back to check the underlying data as soon as I can). I was just writing to disk 12 and the GUI threw up a read error, so I immediately pulled diagnostics to send here. I have a precleared 8tb spare ready to replace Disk 12's 6tb, and I'm leaning towards just shutting down and throwing that thing in there to start a Disk 12 rebuild/upgrade now - any reasons I shouldn't do that in terms of better-safe-than-sorry? Thanks for all the guidance! tower-diagnostics-20200604-1054.zip
  12. Weird Disk12 happenings again. I had an unclean shutdown with someone accidentally hitting the power button on my UPS that powered two unraid boxes. One booted back up and prompted me to parity check. One (this one) weirdly gave me the option for a clean shutdown, which I took, then started back up. No visible issues, but felt paranoid, so ran a non correcting parity check before modifying any files. ~200 read errors on Disk 12. Ran correcting parity check. Tried collecting diagnostics at every possible opportunity to help see if anything weird turned up that anyone else might notice: 5-27: right after "unclean" / clean shutdown 5-29: after non-correcting parity check 5-30: after correcting parity check tower-diagnostics-20200530-2053.zip tower-diagnostics-20200529-2000.zip tower-diagnostics-20200527-1017.zip
  13. Thought I'd update in case it helps anyone else searching threads: the 3.3V tape trick worked, so I'm not sure what the root problem was, but if anyone has these drives working in some SS-500s but not others, rest assured the tape trick should work on those other SS-500 cages.
  14. Nice. So the seeming disk-after-disk issues associated with slot #12 are probably just coincidental? Both the 166k error drive from March and the swiftly-disabled disk this month were pretty old (the latter being a white label I got maybe 4 years ago?), so it makes sense, but the recurrence of #12 issues definitely caught my attention in a single-parity setup.
  15. 5/12 (Diagnostics After 299 Sync Errors Non-Correcting Check) 5/13 (Diagnostics After Correcting Check) 5/14 (Diagnostics After Final, Non-Correcting Check) Hope these help figure out what's going on with the 12 slot (if anything!) tower-diagnostics-20200514-0549-FINAL-NONC-CHK.zip tower-diagnostics-20200513-0732-AFTER-CORR-CHK.zip tower-diagnostics-20200512-1054-AFTER-299ERROR-NONC-CHK.zip
  16. Added to the plan: extra diagnostics sets. I'll report back here with those in ~48 hours or so. Thanks a ton!
  17. That makes sense - trick is, I haven't run a correcting check since the one back in March described above. The check I ran after installing the replacement drive on Sunday/Monday was non-correcting, and that's the same one that's finishing up right now. It does sound like now's the time to run a correcting parity check, with a plan to run a non-correcting check after that check (two checks total, starting this morning) to make sure I don't have a bigger issue specific to Disk 12's hotswap cage considering the consistent issues across disks that may or may not be coincidentally occurring there. (Really, really hope I don't need to replace a middle-of-the-tower hotswap cage in a pandemic, but technically easier than moving everything to a new build...) Thank you both for the help and guidance, JB & trurl!
  18. Sounds like a plan: check's almost done and about to start another one. Presuming it's best to run a non-correcting one to be safe - or should I run this one as correcting, then run another to see if new (vs additional) sync errors appear? Edit: the sync errors stopped growing after they hit 299. Looks like they've stayed stable there overnight and the check's almost done, so definitely a lower volume of errors than last time Disk 12 (or its hotswap slot) started going screwy.
  19. Back to the game - but this time, with a fully-updated unraid 6.8.3 diagnostics set! I was writing to the rebuilt Disk 12 last night when the disk disabled itself with write errors. Had a hot spare 6TB sitting and'll be out of town later this week, so figured I'd go ahead and replace it now with a known-good 6TB. Rebuild seemed to go fine, and I'm running the non-correcting parity check now - bam, at some point, picked up 216 sync errors. Just jumped to 217 while I was typing this. None of the errors are associated with any specific disk on the main page, but showing up at the summary at the bottom. Diagnostics attached; should I stop the noncorrecting parity check? Any new info from the new diagnostics from an updated unraid version? Thanks for any help! Edit - 268 now, steadily growing a few errors at a time. tower-diagnostics-20200511-1913.zip
  20. That's what's been sitting in my cart - just figured I'd confirm that people've successfully taped drives and installed them in otherwise-not-recognizing SS-500s (specifically SS-500s, like OP had issues with in this thread) before I dropped quarantine funds on something for which I might not have any actual use.
  21. Actually haven't ordered any yet: was doing my research first, and had always been under the (possibly mistaken) belief that hotswap cages bypassed the 3.3V issue (mostly because they always worked in the one box).
  22. Having a similar problem, but a little stranger - two boxes, each with 4 Norco SS-500s running to a pair of Genuine LSI 6Gbps SAS HBA LSI 9211-8i P20s. One box recognizes the WD80EMAZ drives totally fine (AMI ECS A885GM-A2, Phenom II X4 820). The other (AMI Supermicro X8SIL, Core i3 540) doesn't recognize a single EMAZ, despite recognizing EFZXs just fine. My first instinct was "oh, I've finally run into the taping issue people have complained about for years," but this thread makes it seem like taping my EMAZ drives and putting them in the Norco SS-500s will simply lead to unraid not starting up, and there's another solution (in the linked 2011 thread above). I've tried swapping boot order in the BIOS, but that doesn't change anything as far as recognition goes. Am I making this pre-taping research more complicated than it should be, and the boot order change is what fixes the "unraid won't start" problem that'll occur once I apply tape to my EMAZ drives and place them into the Norco SS-500 hot swap cages? If so, any idea why one set of cages is giving me trouble while another isn't? I can't find any posted reasons anywhere on the internet (yet) why the X8SIL motherboard would be giving me specific problems with EMAZ drives, but that seems like the only difference (same wires, same cages, same case, same number of drives in setup except the non-EMAZ-accepting box has a cache drive, too). Thanks for any clarification anyone can provide!
  23. Thanks a ton, itimpi - took those steps, rebooted fine, GUI loaded fine, array started fine! One thing seems weird: on the dashboard, my parity drive's now showing a reallocated sector count of 3. I'd *just* run an errorless parity check right after installing IT-flashed LSI cards (to replace Marvell ones) before attempting to upgrade from 5.0.6 to 6.8.3, and my instinct is to run another parity check right now, to make sure it remains constant at 3, but is there anything that could have happened during the upgrade (nothing touched inside the box) to lead to the reallocated sector counts I should be cautious about before running that parity check?