KDP

Members
  • Posts

    20
  • Joined

  • Last visited

Converted

  • Gender
    Undisclosed

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

KDP's Achievements

Noob

Noob (1/14)

1

Reputation

  1. initial parity check ran and found a few thousand sync errors and fixed. second parity check had no errors and I haven't had a shutdown since.
  2. I am not too experienced in the use of syslog. I just have the UNRAID server reporting to a free program I downloaded called win-syslog. Maybe it is some kind of heartbeat?
  3. I woke up at 10 am and restarted the server from an off state. Theres nothing but time entries from when I went to bed to when I started it up again as previously mentioned log_1166679578.log
  4. As of last night I am having a random shutdown and I am just looking to see if I am troubleshooting in an efficient manner. I originally built my server over a decade ago. 2 years ago I changed the case, power supply and fans. I am currently running UNRAID 6.11.5. Last night I was downloading a large series of files and adding them to my array when my server first shutdown. When I turned it back on and logged in I received the notice that my unassigned device that I use for all of my extracting had returned to normal temperature. I assumed that this was the reason the server shut down and continued on with my evening. My downloading and extracting completed without any further issues and stopped worrying. About an hour later my server shutdown again. I started it back up and did not see any notices that would have indicated what may have happened so I turned on the syslog server to monitor further issues. A couple hours later the server was still running and I went to bed. I woke up this morning and the server was powered down again. I looked in the syslog server and there are nothing but time entries since my last action before bed (I changed a share from public to private). I then looked in to my IPMI while the server was down and the only entries I have are months old (probably my last reboot prior to the current problem) and there is mention of FAN1 30009 2023/05/01 20:23:03 FAN 1 Fan Lower Non-Critical - Going Low - Asserted 30010 2023/05/01 20:23:03 FAN 1 Fan Lower Critical - Going Low - Asserted 30011 2023/05/01 20:23:03 FAN 1 Fan Lower Non-Recoverable - Going Low - Asserted 30012 2023/05/01 20:24:36 FAN 1 Fan Lower Non-Recoverable - Going Low - Deasserted 30013 2023/05/01 20:24:36 FAN 1 Fan Lower Critical - Going Low - Deasserted 30014 2023/05/01 20:24:36 FAN 1 Fan Lower Non-Critical - Going Low - Deasserted This error does not appear for the two power ups after the shutdowns. All other reports for temps and voltage are all green (good). So I decided to open up the server and blow dust out and make sure everything is seated and everything seemed seated well. I started up the server again and watched the fans. One CPU fan at the low rpm on boot looks abnormal. It would spin smooth then hitch for a fraction of a second then spin smooth. Once the server booted up and RPMs increased it runs fine. For peace of mind I will be ordering a replacement. I checked the IPMI again and see new entries which I assume are referencing the fan in question 30015 1970/01/14 19:47:42 FAN 1 Fan Lower Non-Critical - Going Low - Asserted 30016 1970/01/14 19:47:42 FAN 1 Fan Lower Critical - Going Low - Asserted 30017 1970/01/14 19:47:43 FAN 1 Fan Lower Non-Recoverable - Going Low - Asserted 30018 1970/01/14 19:50:04 FAN 1 Fan Lower Non-Recoverable - Going Low - Deasserted 30019 1970/01/14 19:50:55 FAN 1 Fan Lower Critical - Going Low - Deasserted 30020 1970/01/14 19:50:55 FAN 1 Fan Lower Non-Critical - Going Low - Deasserted 30021 1970/01/14 19:52:36 FAN 1 Fan Lower Non-Critical - Going Low - Asserted 30022 1970/01/14 19:53:30 FAN 1 Fan Lower Non-Critical - Going Low - Deasserted I am not really sure if the time thing is just because it had not synced with a NTP server or not. However, for peace of mind again, I will be replacing the CMOS battery. However, looking in the IPMI on the time page, the time is now correct. I have included my diagnostic file, although it is post boot and will likely give no useful information. The server is again running a parity check, as it does when an unclean shutdown has occurred, and everything is in the green both in the UNRAID dashboard as well as the IPMI. Should I be looking at anything else or performing any other diagnostics? elvis-diagnostics-20230805-1120.zip
  5. I have already run a scrub and implemented the noted script. I believe that I dislodged a cable while in the case and am hoping after reseating them that this was indeed the issue. I just wanted a second opinion. Thank you for taking the time to take a look at everything!
  6. Diagnostics elvis-diagnostics-20230223-0852.zip
  7. Two days ago I swapped out some drives from my array. When I rebooted the server I started getting a lot of errors on one of my SSD drives in my cache pool and my dockerimage was corrupted. I went back in to my server and made sure that all of my cables were secure on the SSD drives. I have not seen any errors in my log since and have rebuilt all my dockers. I have run btrfs dev stats /mnt/cache/ and it shows the following [/dev/sdb1].write_io_errs 32262152 [/dev/sdb1].read_io_errs 30052334 [/dev/sdb1].flush_io_errs 90551 [/dev/sdb1].corruption_errs 460309 [/dev/sdb1].generation_errs 4267 [/dev/sdc1].write_io_errs 0 [/dev/sdc1].read_io_errs 0 [/dev/sdc1].flush_io_errs 0 [/dev/sdc1].corruption_errs 0 [/dev/sdc1].generation_errs 0 Those numbers have not changed at all for the last 24 hours that the system has been running. I also ran a SMART extended test and it shows 1 Raw read error rate 0x0032 100 100 050 Old age Always Never 0 5 Reallocated sector count 0x0032 100 100 050 Old age Always Never 0 9 Power on hours 0x0032 100 100 050 Old age Always Never 34814 (3y, 11m, 17d, 14h) 12 Power cycle count 0x0032 100 100 050 Old age Always Never 34 160 Unknown attribute 0x0032 100 100 050 Old age Always Never 0 161 Unknown attribute 0x0033 100 100 050 Pre-fail Always Never 100 163 Unknown attribute 0x0032 100 100 050 Old age Always Never 10 164 Unknown attribute 0x0032 100 100 050 Old age Always Never 286621 165 Unknown attribute 0x0032 100 100 050 Old age Always Never 2130 166 Unknown attribute 0x0032 100 100 050 Old age Always Never 339 167 Unknown attribute 0x0032 100 100 050 Old age Always Never 581 168 Unknown attribute 0x0032 100 100 050 Old age Always Never 7000 169 Unknown attribute 0x0032 100 100 050 Old age Always Never 92 175 Program fail count chip 0x0032 100 100 050 Old age Always Never 0 176 Erase fail count chip 0x0032 100 100 050 Old age Always Never 0 177 Wear leveling count 0x0032 100 100 050 Old age Always Never 0 178 Used rsvd block count chip 0x0032 100 100 050 Old age Always Never 0 181 Program fail count total 0x0032 100 100 050 Old age Always Never 0 182 Erase fail count total 0x0032 100 100 050 Old age Always Never 0 192 Power-off retract count 0x0032 100 100 050 Old age Always Never 15 194 Temperature celsius 0x0022 100 100 050 Old age Always Never 40 195 Hardware ECC recovered 0x0032 100 100 050 Old age Always Never 278670 196 Reallocated event count 0x0032 100 100 050 Old age Always Never 0 197 Current pending sector 0x0032 100 100 050 Old age Always Never 0 198 Offline uncorrectable 0x0032 100 100 050 Old age Always Never 0 199 UDMA CRC error count 0x0032 100 100 050 Old age Always Never 1 232 Available reservd space 0x0032 100 100 050 Old age Always Never 100 241 Total lbas written 0x0030 100 100 050 Old age Offline Never 2296155 242 Total lbas read 0x0030 100 100 050 Old age Offline Never 1118873 245 Unknown attribute 0x0032 100 100 050 Old age Always Never 2687976 Warning: ATA error count 0 inconsistent with error log pointer 1 ATA Error Count: 0 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error -4 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 00 00 00 00 00 00 00 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- e5 00 00 00 00 00 00 08 00:00:00.000 CHECK POWER MODE b0 d5 01 00 4f c2 00 08 00:00:00.000 SMART READ LOG b0 d1 01 01 4f c2 00 08 00:00:00.000 SMART READ ATTRIBUTE THRESHOLDS [OBS-4] ec 00 01 00 00 00 00 08 00:00:00.000 IDENTIFY DEVICE b0 d5 01 01 4f c2 00 08 00:00:00.000 SMART READ LOG I will keep a close eye on it for a few more days, but can I toss it up to my bumbling in the case or should I be concerned with the drive?
  8. I just finished copying all my pages to text documents to paste over to the new wiki. I deleted and cleared the data from the old docker and reinstalled using "edge". Email seems to be working again. I have also restarted the docker a few times and it connects properly every time so far. I am about to start putting my pages back in and will let you know if I experience any future problems. Thanks for your time.
  9. This morning I have restarted the container about 10 times and its worked every time, which is unusual, so I cannot get you a NGINX log. If you believe that the "edge" version is more stable I am more than happy to try it. Does this mean I will lose all my current pages? If so that is not a huge deal as I only have about 20 or so and can simply copy them into text documents to paste into the new version. I am just looking for stability at this point (and a working email function).
  10. Here is the log file you requested. Thanks for taking the time to look at it. I restarted the docker and it fails to allow me to connect to the web interface. I then pulled the log file so it should hopefully show where the issue is coming from. mediawiki-json.log
  11. Will just the UnRaid logs be sufficient? There did not seem to be any logs inside the docker for MediaWiki. Any other information that I could include that would be helpful?
  12. I am having issues getting email to work. It worked once for some reason when I first set the docker up and i authenticated my account and has not worked since. I am seeing an error "Unknown error in PHP's mail() function." All of the configurations for the email match those of my stand alone mediawiki and I have re-verified them as being accurate. An additional problem I am having is that anytime I restart the docker I have a 50 50 chance that I won't be able to access the mediawiki wedpage. I have to delete the docker and reinstall the docker and my wiki is back up.
  13. It apparently isn't the cable as I tried a known good one and ended up with the same problem. I just wish I had tested it when WD sent it to me years ago. Thanks for taking the time to assist me.
  14. Thanks so much for pointing that out Johnnie. I have never read a SMART report before and I was a bit overwhelmed. I was googling how to decipher the data and everything always talked about how to look at the values it outputs and what they mean so I completely overlooked a lot of the remaining output.
  15. I am currently in the process of upgrading my unraid server. I had 3 drives that were not in my current unraid box. I memtested my system for 36 hours no problem. I attached the 3 drives and started the preclear plugin on them. One drive failed almost instantly. It wasnt a surprise because it is an old WD Black 2TB that I couldnt remember where I got it from. One is a WD Green 2TB that ran while I slept and was still running when I went to work. I came home and it failed/aborted on pre-read. I actually opened the factory package for this drive which I believe was a warranty replacement for a previous failed drive. I have attached the preclear log and both the smart long and short log. The short test in unraid ran all the way to 90% and then staid there for a while and finally stopped with Errors occurred - Check SMART report. The long test did not make it past 10% and errored out as well. Is this drive just crapped out? Sucks I didn't check it when I got it returned because its been sitting around for a couple years. preclear.txt smart long.txt smart short.txt