weirdcrap

Members
  • Posts

    380
  • Joined

  • Last visited

Everything posted by weirdcrap

  1. Apparently I'm just going to have to un-monitor that smart attribute. Disk1 just keeps randomly hitting me with notifications for the raw read error rate even though it isn't actually incrementing whatsoever.
  2. I ran a short SMART yesterday but i'll turn off drive sleep and run an extended now that the parity check is finished. That's good to hear. I know it isn't one of the default monitored SMART attributes-I assume for reasons like this-but I had enabled it after coming across a recommendation in another thread when I was troubleshooting some disk issues that its useful for drives from certain vendors. I've never seen a WD drive with anything but zero for that attribute but I have seagate drives in my other server that all report a very high number for this attribute so I don't monitor it on VOID. EDIT: Oh and the correcting check completed with zero errors! Thanks Turl and JorgeB for helping me figure out it was the RAM.I think this is the first time I've ever had a computer issue and it was actually a bad stick of RAM.
  3. It seems like the RAM has done the trick, its ~75% through a correcting check with no errors and it hasn't hard locked or crashed yet. However I just got a random notification from the server letting me know my RAW Read error rate on disk 1 is some ridiculous number 28-09-2021 05:27 PM Unraid Disk 1 SMART health [1] Warning [NODE] - raw read error rate is 65536 WDC_WD80EFAX-68KNBN0_VAJBBYUL (sdd) Which is odd because when I go to check the smart stats in the GUI it says my raw read error rate is zero???
  4. I was able to find the same kit NIB on ebay so once it gets here I will replace and run a test with just the two new sticks to see if the panics stop.
  5. @JorgeBIt finished the non-correcting check, 73 errors. within a few hours of starting a new correcting check with the second set of RAM it has again hard locked and the server is unresponsive. So at this point I've tried two correcting checks and it kernel panics each time with this RAM. Should I assume it's bad and replace? I find it interesting that it only happens during the correcting check. I had no problems with the first set of RAM.
  6. Well it hard locked and crashed within a few hours of the second set being installed and a parrot check started. I've got someone going over today to power it off and back on. If it continues to be unstable with this set of DIMMS I wager a replacement is in order? EDIT: Unclean shutdown. I'm letting it run its non-correcting check, its already found 73 errors. It was in the middle of importing a bunch of stuff from sonarr, but it was all going to the cache drive (mover doesnt run till 3AM) so that wouldn't be the cause of these new parity errors right? After the non-correcting check is finished should I continue with the correcting checks?
  7. The first run corrected the same sector and the second reported zero errors so that's a good sign. Correcting Check #1 Sep 15 07:19:12 Node kernel: mdcmd (37): check Sep 15 07:19:12 Node kernel: md: recovery thread: check P ... Sep 15 15:01:07 Node kernel: md: recovery thread: P corrected, sector=8096460000 Sep 16 01:05:00 Node kernel: md: sync done. time=63948sec Sep 16 01:05:00 Node kernel: md: recovery thread: exit status: 0 Sep 16 01:07:02 Node Parity Check Tuning: manual Correcting Parity Check finished (1 errors) Sep 16 01:07:02 Node Parity Check Tuning: Elapsed Time 17 hr, 45 min, 48 sec, Runtime 17 hr, 45 min, 48 sec, Increments 1, Average Speed 125.1MB/s Correcting check #2 Sep 16 05:13:29 Node kernel: mdcmd (38): check Sep 16 05:13:29 Node kernel: md: recovery thread: check P ... Sep 16 23:06:51 Node kernel: md: sync done. time=64402sec Sep 16 23:06:51 Node kernel: md: recovery thread: exit status: 0 Sep 16 23:07:01 Node Parity Check Tuning: manual Correcting Parity Check finished (0 errors) Sep 16 23:07:01 Node Parity Check Tuning: Elapsed Time 17 hr, 53 min, 22 sec, Runtime 17 hr, 53 min, 22 sec, Increments 1, Average Speed 124.2MB/s Swapped the set of DIMMs and testing again.
  8. Two DIMMs removed and first correcting check is running.
  9. Ah, its a change in the 6.10 series that led to those options being removed, that makes sense. I'll have to wait on the upgrade for a bit, having some server instability right now that I want to resolve before I move to an RC version.
  10. My server was rebooted this evening and I went to check and sure enough the FTP service was re-enabled again. Is there somewhere besides Settings > FTP Server I should be disabling it so it persists between reboots?
  11. The server hard locked tonight. I have no idea what happened, I sent someone over there to check on it and they could get no output on the console screen so I've got no logs or anything... No clue if its related or not. It's running a non-correcting parity check now after a forced reboot. EDIT: Same exact sector has the last two parity checks: md: recovery thread: P incorrect, sector=8096460000
  12. Really? Well damn that makes diagnosing this significantly more inconvenient. I'll have to adjust docker memory allocations and shut some less essential dockers down if I'm going to have to cut my RAM in half. I wasn't aware memtest was so flawed when it comes to small intermittent errors, i thought that was the whole point of the software? I'll have to get someone to do this Tuesday when the office is open again.
  13. Ok, so you would recommend my next step be memtest with a few passes?
  14. An update on this. The 3rd non-correcting check (92% completed) has flagged the exact same sector that the second correcting check supposedly already repaired??? 2nd (correcting): Sep 2 02:39:36 Node kernel: md: recovery thread: P corrected, sector=8096460000 3rd (non-correcting) Sep 2 21:08:17 Node kernel: md: recovery thread: P incorrect, sector=8096460000 So the sector is staying consistent now, which I assume makes RAM less likely. Is this a disk issue? I've never seen a correcting check not actually correct the parity mismatch before. Or it did correct it and that sector has changed again? node-diagnostics-20210903-0552.zip Update: Completed, just the one incorrect sector again. Should I run another parity check? Do something else?
  15. I may be out of date on my info here but I recall that setting not persisting between reboots (at least in older versions of unraid). Which is why I thought tips and tricks had a disable telnet & ftp option in it. Yeah, it is still listed in the OP so it was there at one point:
  16. Did the disable FTP server option get removed? I noticed today that the built in FTP server was running when I explicitly had it disabled....
  17. Good morning. On my latest non-correcting parity check on NODE I received 2 reported parity errors. Sep 1 01:04:47 Node kernel: md: recovery thread: P incorrect, sector=682504056 Sep 1 03:37:49 Node kernel: md: recovery thread: P incorrect, sector=3443896592 There were none reported during last month's check and there have been no unclean shutdowns or power outages (server is on a UPS of adequate size). When the check finished I started another (what I thought was) non-correcting parity check to see if it hit the same sectors again. Well, I screwed up and started a correcting check instead🥴 The correcting check is at about 75% and has gone way past the previously reported sectors. Interestingly the second run has found and corrected only one error in an entirely different sector: Sep 2 02:39:36 Node kernel: md: recovery thread: P corrected, sector=8096460000 I'm gonna let the check finish and run a 3rd non-correcting check to see if any new (or the same) sectors are reported. If everything comes back clean on the 3rd check should I just consider them legitimate corrected errors and move on? If it isn't clean (with the reported sectors seemingly changing at random so far) should I assume there is a disk reporting bad data? How would I go about identifying the culprit? What other causes should I look at? node-diagnostics-20210902-0816.zip
  18. Yes you would need a single good backup of all 12TBs of your QNAP data as UnRAID will need to wipe and format all of your existing data disks.
  19. An update on this, moving the drive to a new port (SATAII Since that is all that was left open) seems to have resolved the issue so far. I have a parity check scheduled in 2 days so that will give it a bit of an extra stress test but it SEEMS to be solved... I'll have to put another drive on that SATAIII port next and see if the issue continues on a different drive.
  20. Yeah I'm going trying to take a step by step approach to identify the issue. I changed SATA ports today, if it is still acting up we'll swap power connections. I had my tech confirm this is the one drive on a MOLEX to SATA adapter due to the last SATA plug not being long enough. So that may be the culprit and I'll try replacing it next.
  21. 5.0.2 fixed my inability to add any servers to the app with the Object Object error.
  22. Didn't last long, back at with just some read errors now. it was only a few days after and the disk dropped last time so I imagine its going out again soon. Jul 17 18:13:53 Node kernel: sd 1:0:0:0: [sdb] tag#26 ASC=0x11 ASCQ=0x4 Jul 17 18:13:53 Node kernel: sd 1:0:0:0: [sdb] tag#26 CDB: opcode=0x88 88 00 00 00 00 02 47 d0 00 d0 00 00 00 08 00 00 Jul 17 18:13:53 Node kernel: blk_update_request: I/O error, dev sdb, sector 9794748624 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 Jul 17 18:13:53 Node kernel: md: disk5 read error, sector=9794748560 Jul 17 18:13:53 Node kernel: ata1: EH complete Jul 17 18:14:01 Node sSMTP[7263]: Creating SSL connection to host Jul 17 18:14:01 Node sSMTP[7263]: SSL connection using TLS_AES_256_GCM_SHA384 Jul 17 18:14:03 Node sSMTP[7263]: Sent mail for snip (221 2.0.0 closing connection b25sm7843287ios.36 - gsmtp) uid=0 username=root outbytes=819 Jul 17 18:14:34 Node kernel: ata1.00: exception Emask 0x0 SAct 0x10000 SErr 0x0 action 0x0 Jul 17 18:14:34 Node kernel: ata1.00: irq_stat 0x40000008 Jul 17 18:14:34 Node kernel: ata1.00: failed command: READ FPDMA QUEUED Jul 17 18:14:34 Node kernel: ata1.00: cmd 60/08:80:a8:89:2a/00:00:e9:00:00/40 tag 16 ncq dma 4096 in Jul 17 18:14:34 Node kernel: res 41/40:00:a8:89:2a/00:00:e9:00:00/00 Emask 0x409 (media error) <F> Jul 17 18:14:34 Node kernel: ata1.00: status: { DRDY ERR } Jul 17 18:14:34 Node kernel: ata1.00: error: { UNC } Jul 17 18:14:34 Node kernel: ata1.00: configured for UDMA/133 Jul 17 18:14:34 Node kernel: sd 1:0:0:0: [sdb] tag#16 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=7s Jul 17 18:14:34 Node kernel: sd 1:0:0:0: [sdb] tag#16 Sense Key : 0x3 [current] Jul 17 18:14:34 Node kernel: sd 1:0:0:0: [sdb] tag#16 ASC=0x11 ASCQ=0x4 Jul 17 18:14:34 Node kernel: sd 1:0:0:0: [sdb] tag#16 CDB: opcode=0x88 88 00 00 00 00 00 e9 2a 89 a8 00 00 00 08 00 00 Jul 17 18:14:34 Node kernel: blk_update_request: I/O error, dev sdb, sector 3911879080 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 Jul 17 18:14:34 Node kernel: md: disk5 read error, sector=3911879016 Jul 17 18:14:34 Node kernel: ata1: EH complete Jul 17 18:14:59 Node kernel: ata1.00: exception Emask 0x0 SAct 0x2000 SErr 0x0 action 0x0 Jul 17 18:14:59 Node kernel: ata1.00: irq_stat 0x40000008 Jul 17 18:14:59 Node kernel: ata1.00: failed command: READ FPDMA QUEUED Jul 17 18:14:59 Node kernel: ata1.00: cmd 60/08:68:e0:75:cb/00:00:47:02:00/40 tag 13 ncq dma 4096 in Jul 17 18:14:59 Node kernel: res 41/40:00:e0:75:cb/00:00:47:02:00/00 Emask 0x409 (media error) <F> Jul 17 18:14:59 Node kernel: ata1.00: status: { DRDY ERR } Jul 17 18:14:59 Node kernel: ata1.00: error: { UNC } Jul 17 18:14:59 Node kernel: ata1.00: configured for UDMA/133 Jul 17 18:14:59 Node kernel: sd 1:0:0:0: [sdb] tag#13 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=7s Jul 17 18:14:59 Node kernel: sd 1:0:0:0: [sdb] tag#13 Sense Key : 0x3 [current] Jul 17 18:14:59 Node kernel: sd 1:0:0:0: [sdb] tag#13 ASC=0x11 ASCQ=0x4 Jul 17 18:14:59 Node kernel: sd 1:0:0:0: [sdb] tag#13 CDB: opcode=0x88 88 00 00 00 00 02 47 cb 75 e0 00 00 00 08 00 00 Jul 17 18:14:59 Node kernel: blk_update_request: I/O error, dev sdb, sector 9794450912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 Jul 17 18:14:59 Node kernel: md: disk5 read error, sector=9794450848 Jul 17 18:14:59 Node kernel: ata1: EH complete I'll try moving it to a different SATA port on the mobo next and see if that makes any difference... node-diagnostics-20210717-1829.zip
  23. Cables replaced, power cable seating checked, rebuilding to the same disk now. So far so good, fingers crossed that was it. EDIT: Rebuild completed successfully. I will monitor and report back if this becomes a problem again.
  24. It passed SMART. I believe this disk runs directly off the PSU power so no splitters or anything. I can try having someone replace the SATA cable but I am honestly terrified that if this is a cabling/power issue swapping cables with another drive is going to drop a different disk from the array, breaking parity all together. If I try to rebuild to the same disk and it drops off again will UnRAID just re-disable the disk? Or will it break parity? This is a remote server so infuriatingly I cannot troubleshoot this problem myself without a 6 hour round trip that i just made last weekend.
  25. I had disk5 in my array throw a random smattering of read errors earlier this week and wrote it off as nothing significant after the disk passed a long and short smart test. Well it just magically dropped off out of the blue and it almost looks like the controller took the link down rather than the drive died? Jul 15 13:18:47 Node kernel: Jul 15 13:19:21 Node kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Jul 15 13:19:21 Node kernel: ata1.00: configured for UDMA/133 Jul 15 14:13:09 Node kernel: ata1: COMRESET failed (errno=-32) Jul 15 14:13:09 Node kernel: ata1: reset failed (errno=-32), retrying in 8 secs Jul 15 14:13:17 Node kernel: ata1: limiting SATA link speed to 3.0 Gbps Jul 15 14:13:19 Node kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Jul 15 14:13:19 Node kernel: ata1.00: configured for UDMA/133 Jul 15 17:20:11 Node emhttpd: spinning down /dev/sdg Jul 15 17:20:53 Node kernel: mdcmd (60): set md_write_method 0 Jul 15 17:20:53 Node kernel: Jul 15 17:33:31 Node kernel: ata1: SATA link down (SStatus 0 SControl 320) Jul 15 17:33:31 Node kernel: ata1: SATA link down (SStatus 0 SControl 320) Jul 15 17:33:31 Node kernel: ata1.00: link offline, clearing class 1 to NONE Jul 15 17:33:32 Node kernel: ata1: SATA link down (SStatus 0 SControl 320) Jul 15 17:33:32 Node kernel: ata1.00: link offline, clearing class 1 to NONE Jul 15 17:33:32 Node kernel: ata1.00: disabled Jul 15 17:33:32 Node kernel: ata1.00: detaching (SCSI 1:0:0:0) Jul 15 17:33:32 Node kernel: sd 1:0:0:0: [sdb] Synchronizing SCSI cache Jul 15 17:33:32 Node kernel: sd 1:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=0x04 driverbyte=0x00 Jul 15 17:33:32 Node kernel: sd 1:0:0:0: [sdb] Stopping disk Jul 15 17:33:32 Node kernel: sd 1:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=0x04 driverbyte=0x00 Jul 15 17:33:32 Node rc.diskinfo[12031]: SIGHUP received, forcing refresh of disks info. Jul 15 17:33:38 Node kernel: ata1: link is slow to respond, please be patient (ready=0) Jul 15 17:33:41 Node kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jul 15 17:33:41 Node kernel: ata1.00: ATA-9: WDC WD60EFRX-68L0BN1, WD-WX21DC74DH70, 82.00A82, max UDMA/133 Jul 15 17:33:41 Node kernel: ata1.00: 11721045168 sectors, multi 0: LBA48 NCQ (depth 32), AA Jul 15 17:33:41 Node kernel: ata1.00: configured for UDMA/133 Jul 15 17:33:41 Node kernel: scsi 1:0:0:0: Direct-Access ATA WDC WD60EFRX-68L 0A82 PQ: 0 ANSI: 5 Jul 15 17:33:41 Node kernel: sd 1:0:0:0: Attached scsi generic sg1 type 0 Jul 15 17:33:41 Node kernel: sd 1:0:0:0: [sdm] 11721045168 512-byte logical blocks: (6.00 TB/5.46 TiB) Jul 15 17:33:41 Node kernel: sd 1:0:0:0: [sdm] 4096-byte physical blocks Jul 15 17:33:41 Node kernel: sd 1:0:0:0: [sdm] Write Protect is off Jul 15 17:33:41 Node kernel: sd 1:0:0:0: [sdm] Mode Sense: 00 3a 00 00 Jul 15 17:33:41 Node kernel: sd 1:0:0:0: [sdm] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Jul 15 17:33:41 Node kernel: sdm: sdm1 Jul 15 17:33:41 Node kernel: sd 1:0:0:0: [sdm] Attached SCSI disk Jul 15 17:33:41 Node rc.diskinfo[12031]: SIGHUP received, forcing refresh of disks info. Jul 15 17:33:47 Node emhttpd: read SMART /dev/sdg Jul 15 17:33:49 Node kernel: md: disk5 read error, sector=1743040 Jul 15 17:33:49 Node kernel: md: disk5 read error, sector=1743048 Jul 15 17:33:49 Node kernel: md: disk5 read error, sector=1743056 Jul 15 17:33:49 Node kernel: md: disk5 write error, sector=1743040 Jul 15 17:33:49 Node kernel: md: disk5 write error, sector=1743048 Jul 15 17:33:49 Node kernel: md: disk5 write error, sector=1743056 node-diagnostics-20210715-1742.zip Is this a disk or controller/cabling problem? It appears to be the onboard intel SATA controller. I had one other disk (disk9) on the same controller throw some read errors recently but I went ahead and replaced it and I haven't seen any more. What are my options at this point? Since it failed a write test I'm going to have to rebuild on the same disk or a new one. EDIT: Ok so taking another look at this it looks like UnRAID now sees the same disk as /dev/sdm? So maybe this is a controller or cabling issue after all since it "lost" and "found" the disk again? I'm probably going to get a new disk on order anyway but it might be worth having someone check and re-seat the cables tomorrow. If cabling seems good should I try a rebuild on top of the old disk? The emulated content is present and accounted for and I have another long smart test of the drive running.