Clobes

Members
  • Posts

    27
  • Joined

  • Last visited

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

Clobes's Achievements

Noob

Noob (1/14)

6

Reputation

  1. Sep 3 04:09:13 cortana kernel: mce: [Hardware Error]: Machine check events logged Sep 3 04:09:13 cortana kernel: [Hardware Error]: Corrected error, no action required. Sep 3 04:09:13 cortana kernel: [Hardware Error]: CPU:1 (19:61:2) MC29_STATUS[Over|CE|-|-|-|-|UECC|-|Poison|-]: 0xd0072e0295011603 Sep 3 04:09:13 cortana kernel: [Hardware Error]: IPID: 0x0b08f249ffe0f7ee Sep 3 04:09:13 cortana kernel: [Hardware Error]: cache level: L3/GEN, tx: INSN This keeps repeating in the syslog about every 5 minutes, give or take. Diag attached. It started right after I selected the replacement parity drive and started rebuilding the parity. My Googling so far has lead me to believe that this is an actual hardware error. From the message itself, I am guessing it's an L3 cache error. I had just finished pre-clearing the new drive, a 70+ hours process. I stopped the array, deselected the existing 4TB parity drive and selected the new 18TB drive. At that point I restarted the array and stopped the parity rebuild to power down the server and remove the old parity drive and slot the new one into the front HDD bay. Powered on the server, began the parity rebuild at which point I started getting these MCE errors which FCP notified me about. I don't know what that whole process has to do with L3 errors on my CPU so I'm inclined to think it's just a coincidence but I don't really know. I've been running this hardware for many months trouble free up until now. Thoughts? cortana-diagnostics-20230903-0404.zip
  2. Hi Reggie. I hope you're still having good uptime. Just wanted to update here since I was able to "fix" whatever was causing the shutdowns. My last post here in October said that things had gotten better but they hadn't. The crashes kept happening after a short stint of not crashing. I continued adjusting settings both in the BIOS and in Unraid but it kept crashing. So, after that last post I got so sick of all of the issues that I decided to get rid of Unraid entirely thinking that it was an Unraid issue (that didn't last long and I'm now happily back on Unraid). I moved all of my storage back to my old hardware, installed TrueNAS Core, got 10GB NICs and a 10GB switch and decided to use the new hardware as a Proxmox host and access storage over iSCSI. That all seemed to be stable in terms of not crashing, but I wrestled with share/file permissions on TrueNAS for an entire ******* week and could not get a working configuration. That was one of the most miserable weeks of my life in terms of tech stuff. I was ready to sell everything, genuinely. I felt so completely stupid for not being able to figure out permissions when probably millions of other people before me were able to figure it out. At that point I missed the simplicity of Unraid, to a painful degree. I ripped all of my storage out of my old hardware again, put it back into the new system. By this point all of my drives had been wiped many times over so I couldn't just pick back up where I left off on Unraid. I had to start completely from scratch, which it turns out was the best thing I could have done. I started from a fresh install of Unraid on a fresh USB flash drive and copied my data back over from my backup server, set up all my Docker containers, spun up a critical VM or two and basically just waited at that point to see what would happen. About a week went by and no issues whatsoever. Since then, I have installed a 3090 and setup a gaming VM which was so reliable that I've daily driven it for close to a month now. I have about 50 or 60 hours of game time on this VM with no issues. Star Citizen, Destiny 2, Splitgate, Fallout 76... all running beautifully at 4K... well except Start Citizen but that's not the VM's fault . My server uptime is now at 26 days 13 hours. It would be longer but I had to shut down to install an NVME drive 26 days ago. If you're still having issues and you've exhausted every other option available to you, I can say that reinstalling everything is worth a shot. I ultimately don't know what was wrong with my previous installation of Unraid but the fact that it works fine on a fresh install tells me that it must've been something that I did on the old installation to make it crash on different hardware.
  3. @jmiller180 It's been terrible. Unraid 6.11.1. It's crashing daily now, sometimes multiple times daily. XMP/Expo is disabled, performance modes are disabled. It's about as barebones/stock as it can be. I've turned off any Docker containers that I thought might be responsible (mainly ApacheGuacamole because it was mentioned in the logs during most of the crashes). I've removed basically all of the plugins I was using. I've tried this and that. Nothing is helping. At this point I consider my server and the critical services I run on it non-functional so this week while I have time off from work I will be scrapping everything except my data and either reinstalling Unraid from scratch or trying something else. Before anyone says it, I accept that this is my fault for opting to be on the "cutting edge" with Ryzen 7000. Even still, this seems ridiculous.
  4. This is what I've been doing. Sorry if I didn't make that clear. I have it set up as a local syslog server going to a cache-only share on unraid. I haven't tried mirroring to the flash drive, though. Could it be possible that it would catch more log entries on the USB boot drive than it would using a location on the array?
  5. My adventures in "Unraid Crashing Constantly" continues... As the title states, is there a way to get more complete logs than what the built in Syslog provides? I feel like it's not showing me everything related to what is causing the crashes. My feeling is that it just can't generate and write to the log file fast enough and that it's missing something at the very moment that the server dies, in much the same way that you wouldn't be able to write down what killed you if you suddenly got shot in the face. What other options do I have?
  6. How's uptime looking now on your end? My system unexpectedly, inexplicably, and without any changes made on my part decided it was finished with the daily crashing nonsense. So, when last I posted, I thought things were stable after removing the GPU, but it went from multiple daily crashes to once per day. I believe it was Jorge who suggested changing power supply idle control settings and seeing if that helped. I didn't get a chance to do that right away because I was working 12+ hour days. Well after that, and again with no changes made to anything, it stopped crashing. I've been up 4 days and 3 hours now without a sign of trouble. When it was doing the daily crash, it seemed like both times it happened right around 2pm, so I'm wondering if there was some scheduled task that was causing it, but it wouldn't have been anything that I scheduled. In my experience, I can't think of a single computer that I've built that didn't have bizarre stability issues at the very beginning that didn't resolve themselves without any intervention on my part. I've always theorized that it's just cables/connections settling in or something like that. Maybe heat cycling the plastic insulation from normal use removes any tension they were under? I don't know, but it sounds smart
  7. I don't currently have any PCI-e devices installed. Neither GPU nor NVMe devices. The only thing passed through is a SATA SSD to a VM. I did have another crash about 20 minutes ago now. The only thing that keeps showing up in the log is a segfault from Guacamole client, but that could just be because of whatever else is causing the crash, or happening during the crash. Haven't been around the server to try adjusting idle control settings yet though. Hopefully that helps.
  8. I'll give that a shot, thanks. I was hoping that AMD might have figured out how to make that not a necessity on Linux, but I guess not. Or just a new platform like you said. I guess I'll let y'all know after a few days of testing after adjusting idle control settings. Thanks again, Jorge!
  9. Hi Unraid Forum, I am once again asking for your technical support... *Specs* CPU: Ryzen 9 7950x MB: ASRock x670E Taiichi RAM: Corsair Vengeance DDR5 64GB (2x32GB) 5200MHz GPU: None. I removed it because I thought that was causing the crashes.* *I still think that it was causing the crashes at first. My server would crash every few hours until I removed the GPU (EVGA 1070). Syslog was showing something to the effect of "PCI device has fallen off the bus". After removing it, I got a solid two days without a crash. I thought that was the end of it. I was just gonna wait a while, maybe until the next Unraid or BIOS update just in case it was a compatibility issues with a PCI-e 3.0 GPU on a 5.0 platform, then try running with a GPU installed again and see if anything was improved. Well, fast forward to about 30 minutes ago (at the time of writing this) and it crashes again. Not a whole lot is showing in the syslog right before the crash. In the attached log, timestamp "Oct 22 02:32:01" I believe is right before the crash, then it picks back up after the server reboots at "Oct 22 02:52:03" as shown in the log. As for what I was doing at the time of the crash: I was remoted into a VM running on the server using Guacamole, which is running as a Docker container also on the server. My session froze and I couldn't load the web GUI for a few minutes. When I could get to the GUI, it had evidently just rebooted. Now, perhaps this is just growing pains for this new AM5 platform and my only recourse is to sit and wait until more updates are available for Unraid, BIOS, etc. But figured it couldn't hurt to ask for a second pair of eyes. The attached log file should contain all of the crashes that have happened since I upgraded to AM5. Thank you very much for your time. cortana-diagnostics-20221022-0256.zip syslog-192.168.1.103.log.zip
  10. Were you able to get this working? I find myself needing to rip CDs in a Windows VM. I tried a year or so ago but never could get it working.
  11. I guess that makes at least 2 of us on 7000 series on unRAID. I was also having reboot issues but I think it had something to do with my GPU (1070). The only major issues in the log were saying that the PCI device had fallen off the bus. I removed it from the server and deleted the Nvidia driver yesterday evening and it hasn't rebooted since then, where previously it barely made it a few hours without crashing. I have no evidence of this, but I'm wondering if maybe there's some weirdness somewhere having to do with a PCI-e 3.0 device running on a 5.0 platform, plus unRAID which probably hasn't received much in the way of Ryzen 7000 and/or PCI-e 5.0 optimizations. I may try adjusting PCI-e setting in the BIOS. Not sure if this has anything to do with your issue, but since this was the only post from a Ryzen 7000 owner I could find on the forums figured I'd chime in since there seems to be very few of us at this point in time. Hopefully it's helpful to anyone else that finds their way in here.
  12. JorgeB, thank you so much! I ran the scrub and it showed that everything was corrected, zero uncorrectable. I've got the script in place to check for and notify of any errors hourly. At this point, is there anything else I need to do to get Docker and Libvirt working again or do I just need to redo those, now with the peace of mind that cache errors are corrected? I'm still getting "Service failed to start" for both services, so I tried rebooting and disabling/reenabling both but no change there. Syslog attached in case that's helpful. It's still showing some concerning messages following the reboot and service restart. Apr 19 15:51:40 Cortana emhttpd: shcmd (126): /usr/local/sbin/mount_image '/mnt/user/system/libvirt/libvirt.img' /etc/libvirt 1 Apr 19 15:51:40 Cortana kernel: BTRFS: device fsid 50562063-4cd9-473d-85b3-520dc684d738 devid 1 transid 1440 /dev/loop2 scanned by udevd (7284) Apr 19 15:51:40 Cortana kernel: BTRFS info (device loop2): using free space tree Apr 19 15:51:40 Cortana kernel: BTRFS info (device loop2): has skinny extents Apr 19 15:51:40 Cortana kernel: BTRFS error (device loop2): bad tree block start, want 33767424 have 0 Apr 19 15:51:40 Cortana kernel: BTRFS error (device loop2): bad tree block start, want 33767424 have 0 Apr 19 15:51:40 Cortana kernel: BTRFS warning (device loop2): couldn't read tree root Apr 19 15:51:40 Cortana root: mount: /etc/libvirt: wrong fs type, bad option, bad superblock on /dev/loop2, missing codepage or helper program, or other error. Apr 19 15:51:40 Cortana kernel: BTRFS error (device loop2): open_ctree failed Apr 19 15:51:40 Cortana root: mount error Apr 19 15:51:40 Cortana emhttpd: shcmd (126): exit status: 1 Thank you again for your help with this. syslog-192.168.1.103.zip
  13. Was there any indication of what the underlying issue might be in the files I provided? If there's anything else that I can provide to help diagnose the issue, please do let me know.
  14. Thanks. Will this help with the Libvirt (service failed to start) issue as well or just the Docker issue? Should I be worried about the abundance of issues reported in the filesystem check? I'm worried that something else is messed up which caused the docker image to become corrupt and cause me to have to go through this process again in the future. Or is it the corrupt Docker image that has caused a cascade of issues both with the Libvirt service and the issues in the filesystem check and redoing it will resolve all of the above mentioned issues? Sorry for the barrage of questions.
  15. Hi. I shut down my unRAID server to install a new storage drive. Booted back up, cleared the disk, formatted. All good. Then I went to go to my Docker tab and get met with "Docker Service failed to start", likewise for the VM tab. I checked the syslog and found some concerning messages relating to BTRFS. No write errors reported on the cache disks. SMART test came back clean. Check filesystem status did not look good, however. I've attached the output of that, alongside the diagnostics file. My first route of troubleshooting was to undo everything I had done regarding adding a new storage drive. I'm pretty sure that didn't really have anything to do with the issue, but both Docker and Libvirt were working just fine immediately prior to adding the storage drive, so whatever. I went through the "Shrink array" process as documented on the Wiki, cleared the new storage drive and removed it from the array. Everything on the storage side is still fine, but nothing has changed with regards to Docker and Libvirt, still broken. From my Google searches, the general consensus seems to be that I'll need to basically redo the cache and all that stuff, but I wanted to get a second opinion before I go any further and either break more stuff or go through the headache of unnecessarily redo'ing all of my Docker containers and virtual machines, which would be quite the inconvenience. But, if that's what I gotta do then so be it. From what I read, it seems like this is likely to happen again if there's a hardware issue with the cache drives. I have two of them which I thought was supposed to be for redundancy but clearly that hasn't helped me in this situation so hopefully if there is an issue it's just one of them and I can still use the second one. Though, if there was a hardware issue I would have expected to see more/different errors. But I don't know what I'm looking for exactly. I've checked all cables, swapped sata cables, etc. to no effect. Thank you for your time. cortana-diagnostics-20220416-2007.zip btrfs check status.txt