drawde

Members
  • Posts

    332
  • Joined

  • Last visited

Everything posted by drawde

  1. not sure if there's a way to download a SMART self-test but got the following: Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 2756 - Looks good to me. Going to run a preclear just for peace of mind before adding back into action (possibly as a 2nd parity for now). Thanks! EDIT: hmm, doesn't seem to want to mount under unassigned devices so i can run a preclear on it.. getting the following: Jul 20 17:31:19 Tower unassigned.devices: Adding disk '/dev/sdt1'... Jul 20 17:31:19 Tower unassigned.devices: Mount drive command: /sbin/mount -t xfs -o rw,noatime,nodiratime '/dev/sdt1' '/mnt/disks/ST16000NM001G-2KK103_ZL2ATVRA' Jul 20 17:31:19 Tower kernel: XFS (sdt1): Mounting V5 Filesystem Jul 20 17:31:19 Tower kernel: XFS (sdt1): Log inconsistent (didn't find previous header) Jul 20 17:31:19 Tower kernel: XFS (sdt1): failed to find log head Jul 20 17:31:19 Tower kernel: XFS (sdt1): log mount/recovery failed: error -5 Jul 20 17:31:19 Tower kernel: XFS (sdt1): log mount failed Jul 20 17:31:19 Tower unassigned.devices: Mount of '/dev/sdt1' failed: 'mount: /mnt/disks/ST16000NM001G-2KK103_ZL2ATVRA: can't read superblock on /dev/sdt1. ' i tried xfs_repair -v /dev/sdt1 but it just started spamming .'s. I'll wait until parity is finished rebuilding on my other drive and try again later.
  2. Next time I will remember to capture the diagnostics immediately. However I also attached logs that go back to April, not sure if there is anything useful there, but you can see when the drive actually first started having issues yesterday.
  3. ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 071 064 044 - 12246579 7 Seek_Error_Rate POSR-- 080 060 045 - 92287026 Thanks! Is that raw value the actual number of sectors? Are numbers this big consistent with the drive experiencing some failures and being disabled only once?
  4. man, i really am messing up lol.. the drive in question is ST16000NM001G-2KK103 (sdt), also the only 16tb drive.
  5. LOL sorry, yes i did have a question. Is the hdd likely dying or could it be something else? i was never good at deciphering the smart reports, any feedback or tips would be appreciated. hardware-wise all i've done so far is reseat the drive. but took it out of service already. It's only been a few months. drive is still attached and running an extended SMART currently. if smart comes up good was thinking of running a preclear or two on it.
  6. I messed up and forgot to get diagnostics before rebooting (doh!) and I'm currently rebuilding parity on a new drive. I've attached diagnostics anyways. I do have syslogs dating back to april that hopefully is helpful though. tower-diagnostics-20210719-2046.zip syslog.txt.log
  7. hello all, trying to move some data around and i'm getting the following error: I: 2021/03/15 23:16:14 core.go:710: Command Started: (src: /mnt/disk10) rsync -avPR -X "TV/Show" "/mnt/disk6/" I: 2021/03/15 23:16:15 core.go:767: command:retcode(0):exitcode(0) I: 2021/03/15 23:16:15 core.go:1028: Command Finished I: 2021/03/15 23:16:15 core.go:1041: Current progress: 100.00% done ~ 0s left (10792.87 MB/s) I: 2021/03/15 23:16:15 core.go:972: removing:(rm -rf "/mnt/disk10/TV/Neds Declassified School Survival Guide") W: 2021/03/15 23:16:15 shell.go:51: transferProgress::(rm: cannot remove '/mnt/disk10/TV/Show Guide/banner.jpg': Read-only file system) W: 2021/03/15 23:16:15 shell.go:51: transferProgress::(rm: cannot remove '/mnt/disk10/TV/Show Guide/fanart.jpg': Read-only file system) W: 2021/03/15 23:16:15 shell.go:99: transferProgress:: waitError: exit status 1 W: 2021/03/15 23:16:15 core.go:985: Unable to remove source folder (/mnt/disk10/TV/Show): exit status 1 it appears the moved work, the files exist on the new drive but they are also still on the source drive.
  8. is this hitting million CPU suddenly for anyone else? had to stop it. every time i turn it on, my CPU usage goes through the roof.
  9. i'm having the same issue. thoughts? i've seen this in the past, i think my omvf doesn't work as in other VMs i've also had to use seabios. EDIT: nevermind, seems the instructions has been updated. followed and working now.
  10. After my last parity check I'm seeing some read errors on one of my drives. I did an extended smart and they did not increase, and also it's been a few days since my parity check and everything seems to be working okay. anything of concern? tower-smart-20191003-2355.zip tower-diagnostics-20191004-0356.zip
  11. CPU temp missing all of a sudden for anyone else?
  12. hello all, had unraid reboot on me randomly today. i have a script that copies the log to another location every 5 minutes, didn't see anything in there. when it came back online, Fix Common Problems reported an issue: Your server has detected hardware errors. You should install mcelog via the NerdPack plugin, post your diagnostics and ask for assistance on the unRaid forums. The output of mcelog (if installed) has been logged. I didn't have mcelog installed, so I installed it. But I guess this only logs going forward? In any case, in my logs from after the reboot (before mcelog was installed) showed the following message: May 28 00:16:00 Tower kernel: smpboot: CPU0: AMD Ryzen 7 1700 Eight-Core Processor (family: 0x17, model: 0x1, stepping: 0x1) May 28 00:16:00 Tower kernel: Performance Events: Fam17h core perfctr, AMD PMU driver. May 28 00:16:00 Tower kernel: ... version: 0 May 28 00:16:00 Tower kernel: ... bit width: 48 May 28 00:16:00 Tower kernel: ... generic registers: 6 May 28 00:16:00 Tower kernel: ... value mask: 0000ffffffffffff May 28 00:16:00 Tower kernel: ... max period: 00007fffffffffff May 28 00:16:00 Tower kernel: ... fixed-purpose events: 0 May 28 00:16:00 Tower kernel: ... event mask: 000000000000003f May 28 00:16:00 Tower kernel: rcu: Hierarchical SRCU implementation. May 28 00:16:00 Tower kernel: smp: Bringing up secondary CPUs ... May 28 00:16:00 Tower kernel: x86: Booting SMP configuration: May 28 00:16:00 Tower kernel: .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 May 28 00:16:00 Tower kernel: mce: [Hardware Error]: Machine check events logged May 28 00:16:00 Tower kernel: mce: [Hardware Error]: CPU 13: Machine Check: 0 Bank 5: bea0000000000108 May 28 00:16:00 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff81654a1a MISC d012000101000000 SYND 4d000000 IPID 500b000000000 May 28 00:16:00 Tower kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1559016932 SOCKET 0 APIC d microcode 8001126 May 28 00:16:00 Tower kernel: #14 #15 May 28 00:16:00 Tower kernel: smp: Brought up 1 node, 16 CPUs Is the zenstates fix still required for 6.7? I have my C-states in BIOS disabled already. tower-diagnostics-20190528-0431.zip
  13. how concerned should i be? just upgraded to 6.7 a few hours ago, not sure if related but saw these errors in my syslog after CA updated my dockers (usual on sunday AM). May 12 00:19:32 Tower rc.diskinfo[4610]: SIGHUP received, forcing refresh of disks info. May 12 00:19:32 Tower kernel: EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. May 12 00:19:32 Tower kernel: Either enable ECC checking or force module loading by setting 'ecc_enable_override'. May 12 00:19:32 Tower kernel: (Note that use of the override may cause unknown side effects.) May 12 00:19:37 Tower kernel: sd 13:0:0:0: [sdb] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00 May 12 00:19:37 Tower kernel: sd 13:0:0:0: [sdb] tag#4 CDB: opcode=0x28 28 00 e8 e0 88 00 00 00 08 00 May 12 00:19:37 Tower kernel: print_req_error: I/O error, dev sdb, sector 3907028992 May 12 00:19:37 Tower kernel: sd 13:0:3:0: [sde] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00 May 12 00:19:37 Tower kernel: sd 13:0:3:0: [sde] tag#1 CDB: opcode=0x88 88 00 00 00 00 01 d1 c0 be 00 00 00 00 08 00 00 May 12 00:19:37 Tower kernel: print_req_error: I/O error, dev sde, sector 7814036992 May 12 00:19:41 Tower kernel: sd 13:0:1:0: [sdc] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00 May 12 00:19:41 Tower kernel: sd 13:0:1:0: [sdc] tag#2 CDB: opcode=0x88 88 00 00 00 00 01 d1 c0 be 00 00 00 00 08 00 00 May 12 00:19:41 Tower kernel: print_req_error: I/O error, dev sdc, sector 7814036992 May 12 00:19:46 Tower kernel: sd 13:0:11:0: [sdm] tag#23 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00 May 12 00:19:46 Tower kernel: sd 13:0:11:0: [sdm] tag#23 CDB: opcode=0x28 28 00 e8 e0 88 00 00 00 08 00 May 12 00:19:46 Tower kernel: print_req_error: I/O error, dev sdm, sector 3907028992 May 12 00:19:46 Tower rc.diskinfo[4610]: SIGHUP received, forcing refresh of disks info. May 12 00:19:46 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:46 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:46 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:46 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:46 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:46 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:46 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:46 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:46 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:46 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:46 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:46 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:46 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:46 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP received, forcing refresh of disks info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:47 Tower rc.diskinfo[4610]: SIGHUP ignored - already refreshing disk info. May 12 00:19:48 Tower rc.diskinfo[4610]: SIGHUP received, forcing refresh of disks info. the webUI does not show any errors though next to any of the disks and all disks are green. the rc.diskinfo and the ECC disabled i've seen before. I don't have ECC ram so i dont think that's something i should care about, and the rc.diskinfo i believe is just the preclear plugin, which i'm not running so i'm not too worried about. mostly the print_req_error I/O error i'm concerned about. it's the same 2 sectors across 4 different drives? May 12 00:19:37 Tower kernel: print_req_error: I/O error, dev sdb, sector 3907028992 May 12 00:19:46 Tower kernel: print_req_error: I/O error, dev sdm, sector 3907028992 May 12 00:19:37 Tower kernel: print_req_error: I/O error, dev sde, sector 7814036992 May 12 00:19:41 Tower kernel: print_req_error: I/O error, dev sdc, sector 7814036992 tower-diagnostics-20190512-0508.zip
  14. Downloaded coppit's plg from github, edited the fields SimonF mentioned, copied the file to /boot/config/plugins and installed the plg via the webUI and snmp is working again on 6.7 (stable). I didn't install netsnmp prior as mentioned, i figured the plg would install it which it did. Haven't tried rebooting and see if it installs on boot yet though (but i don't see why it wouldn't?) If anything i can just install the new plg manually again.
  15. i thought something was wrong with my plex install since that was the docker that i kept having to restart. i wiped out all the databases and metadata and had it rescrape/rebuild everything. Doesn't seem to have solved the issue. i've also disabled c-states in my bios since apparently that's a thing for gen 1 ryzen. i'm starting to think that high cpu usage overall is causing docker to hang/become unresponsive at times. when my CPU utlization is high (nzbget downloading/unpacking something while plex is transcoding), all my dockers slow down almost to a halt. oddly i didn't have this issue on my phenom II (passmark 3.5k) compared to the ryzen which is 13.7k. to add some additional information, although not sure if important, i monitor my system with observium and the SNMP plugin, just to output a graph to a small status page i have. During these "lockups" there are gaps in data, i'm guessing because either the observium docker or SNMP plugin (or both) stops working along with most of my other dockers. when CPU goes down it will typically start again. no errors or anything in the syslog during these times. example: any thoughts? could it be the I/O that is causing issues? should i move my appdata to an SSD?
  16. no, i have mover currently only running once per month. been seeing these lockups every few days.
  17. hey all. i recently upgraded my CPU, mobo, ram, and SATA card(s) (basically almost everything aside from PSU and HDDs) from an old AMD Phenom II based system, with a jumble of random RAID/SATA controllers. It was working fine and was stable with the old hardware, but just felt like it was time for an upgrade. Every couple of days I'm seeing issues where CPU utilization will spike up and load averages are hitting high double digits (20-40+). Most things I'm running (dockers) still work, albeit very slow. Usually I'm able to resolve it by starting/stopping some of the dockers at random. Eventually I'll just stop/start all of them and that almost always fixes it. The last time this happened, I ran docker stats first and none of them were particularly using a lot of CPU, one of them was at 25% and the rest were at single digits. Restarting that docker resolved the issue, leading me to believe that is where the issue is. I haven't changed or added any dockers since changing hardware. The new hardware should be noticeably better than my old hardware but it seems like it's struggling to keep up with the load at certain times, whereas on the old hardware I've gone months and month without touching it at all. Any thoughts/suggestions? Current hardware: Gigabye AB-350 Gaming 3 AMD Ryzen 7 1700 G.SKILL NT Series 16GB LSI 6Gbps SAS HBA LSI 9201-8i IBM 46M0997 ServeRAID Expansion Adapter 16-Port SAS Expander Norco RPC-4220 case/backplane tower-diagnostics-20190423-1415.zip
  18. where can i set this? under scheduler settings i only have parity check and mover.
  19. There is also the SNMP plugin where you can send temp info to a nms like observium or librenms.
  20. since i wanna say the last 2 times i've upgraded unraid i didn't get any notifications for the unraid plg like normal. has this been removed and we have to manually check for updates now?
  21. drawde

    UPS load @ 50%

    i guess i'll keep the battery and move it to another place just for like my laptop and a couple smaller devices. any recommendations for a little beefier UPS?
  22. drawde

    UPS load @ 50%

    yes definitely the battery is due to be replaced and is actually on order right now. apcupsd is reporting about 6 minutes @ 342 Watt (57.0%). i don't remember what the runtime was when i first got the UPS. i'm assuming i could expect a better runtime with a new battery?
  23. drawde

    UPS load @ 50%

    I have a CyberPower CP1000PFCLCD UPS for my main desktop, unraid server and a network switch. I'm not sure about full load but about half load on my desktop and rebuilding a drive in unraid, my UPS is reporting about 45-50% load and about a 6-8 minute runtime. Is this good enough or is it time to upgrade my UPS? I just need to survive quick power blips and enough time to turn off my desktop and unraid server.