Everything posted by poeterdebier
-
nvme cache drive failure
Hi Jorge, removed all data from cache (moved to array). Removed and formatted the nvme's and recreated a pool. Scrub gives 0 errors and pool device stats are all at 0. Server operational without issues. Will closely monitor the server oncoming time. Also learned some things and will have a bit different approach on backing up files/settings in the future. I really appreciated the help and support. regards Piet
-
nvme cache drive failure
recreating the pool could be done by: moving all data to the array removing both cache from array / format adding both nvme back as cache or am I then recreating errors?
-
nvme cache drive failure
Hi Jorge, I decided to focus on disk1. So I did a file system check of disk1. I got an 'dirty log error'. I went ahead with 'zero log' and this fixed the file corruption on disk1. Started the array after that and disk1 mounted without issues. I did another scrub of the cache pool but this kept giving 2056 uncorrectable errors. As mentioned by you I started replacing/deleting some of the files that the syslog mentioned and this removed some of the uncorrectable errors in the cache pool. Not all unfortunately, 664 remaining. Also, I cannot identify them the same way as the isos were (with a mentioned file name). So regarding the 664 uncorrectable errors I am stuck at the moment. I was considering to repair the cache pool in the 'check filesystem status' but the help section was really specific with mentioning that this only needed to be done on advise of a community expert. So I only did the readonly check (but forgot to take a screenshot). tower-diagnostics-20260102-2227.zip
-
nvme cache drive failure
Completely missed that one (the path part). Thanks. Removed already part and the errors went down. Tomorrow continue. Thanks for the help so far, to be continued. Happy New Year 🎆
-
nvme cache drive failure
So I did a repair on disk1 through the webUI (Check Filesystem Status section). I had to 'zero log' due to 'dirty log detected'. After that I got a file system corruption fixed. So current disk1 is back online. I did a scrub of the cache pool and that is still the same (2056 uncorrectable errors). I also reset the pool device status (I assume that this is what you mean with 'reset pool stats'. At the moment I do not know what to do with the uncorrectable errors. The scrub should fix this if there is a correct copy available (I guess there isn't right now). I looked into the syslog to attempt to see what files are corrupt to remove/replace them but the only 'corrupt' message I can find is: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 7368, gen 4Do you have an idea how to identify the files that are corrupt? To get this uncorrectable error count back to zero?
-
nvme cache drive failure
Ok, thanks for the info. Will check the syslog/corrupt files etc. soonest. Probably not tonight 😄. Regarding the file check. That is something that you recommend to do before replacing the disk1 entirely? Or do you suspect something else is caused disk1 to become corrupt (the intermittent problem you mentioned). The reason to ask it, you want to have disk1 up and running as soon as possible right. If a second disk fails then I will be loosing data for sure. Still pretty inexperienced in all this so sorry for asking so many questions.
-
nvme cache drive failure
I dit a scrub of the cache pool as Jorge suggested. Swapping the DIMM's. Removed 1 8GB DIM of RAM and restarted the server (no issues, accept for disk1). Scrub result (aft stick): S stopped the server and removed the 2nd stick. Reinserted the first stick. And ran a scrub of the cache pool again. Scrub result (fwd stick) apart from the 2056 uncorrectable errors?! they look the same.
-
nvme cache drive failure
Sorry, don't want to spam. But as an additional question regarding rebuilding disk1. Should I run xfs_repair on disk1 first? Dec 30 20:33:41 Tower kernel: XFS (md1p1): Mounting V5 Filesystem 8c6a785e-215a-4193-b49b-741cfb95f8c2 Dec 30 20:33:42 Tower kernel: XFS (md1p1): Corruption warning: Metadata has LSN (40:2960824) ahead of current LSN (40:2958096). Please unmount and run xfs_repair (>= v4.3) to resolve. Dec 30 20:33:42 Tower kernel: XFS (md1p1): log mount/recovery failed: error -22 Dec 30 20:33:42 Tower kernel: XFS (md1p1): log mount failed Dec 30 20:33:42 Tower root: mount: /mnt/disk1: wrong fs type, bad option, bad superblock on /dev/md1p1, missing codepage or helper program, or other error.
-
nvme cache drive failure
Ok, that means doing the scrubbing of the cache before touching disk1. Would that possible mean that disk1 will be mountable again? That for example memory error can cause this 'unmountable: wrong or no file system'
-
nvme cache drive failure
I have ordered a new HDD to replace (if necessary disk 1). They are all bought around the same time and disk 2 failed July 2025 they are 5+ years old). I mean it start giving errors around that time. Not sure exactly what anymore as I just replaced the disk and rebuild parity without issues. No reason (at that time) to start looking for other problems. Unless there is something else to do before replacing disk 1, please let me know. I make sense to me to replace this disk before continuing but I am not a community expert 😄 so yeah what do I know. I am considering to purchase an extra nvme just in case one of the current nvme fails. Would you recommend to add this to the pool beforehand or just to wait until a nvme fails, then replace? So first steps: unless someone has a better approach! replace disk1 and rebuild run a parity check scrub with removed DIMMS as suggested by Jorge to see if there is a memory issue
-
nvme cache drive failure
What do you exactly mean with file system repairs? Replacing disk1 or the nvme?
-
nvme cache drive failure
The second SSD (1n1) I purchased a couple of years later than the 100% used one 0n1 so that likely explains the difference. Not sure exactly when I purchased it put it will be around 2024/2025 so around 4+ years younger. So, regardless of whether the error is coming from the memory or the NVMe, it looks like it is sensible to replace the NVMe with a new one (at least the 100% used one)? What would be the best way to keep an eye on this? A Docker app called Scrutiny or just have a look at it periodically? I never received an error/warning/suggestion about the NVMe reaching its (calculated) lifespan. Will try this next opportunity just to make sure. I hope it is not the memory though, talk about a bad time to have your memory go bad. With the server now up and running (maybe temporary) what would be your suggestions for backing up and data loss prevention? So far I have done the following: appdata backup flash backup Update:As I was typing this I noticed that I got a new error: Did not see that error before (related to this issue I mean). Got a drive that failed on me last year but that was since replaced by the Seagate. tower-diagnostics-20251230-1931.zip
-
nvme cache drive failure
4 passes no errors on Memtest. Restarted the server. Docker image is up and running again. Scrubbed nvme0n1 see attached screenshot. Also common problems found: Do I need to scrub the other nvme also? I did not yet started with assistance mentioned in the 'suggestion fix' from the coming problems plugin. I would like to continue with community support first. Which has been excellent. tower-diagnostics-20251230-1644.zip
-
nvme cache drive failure
Ok, have two passes done. No errors. Will let it run for a couple of hours more as I am going out of the house any way. Will keep you posted.
-
nvme cache drive failure
Will do straight away, thanks for the fast feedback. Just an extra question: let's say that some memory is bad, would that mean that the pool cannot be saved/repaired? That I have data loss? gr Piet.
-
nvme cache drive failure
Hi All, I was wondering If somebody could give me a hand with solving the errors I have with my server. It looks like that 'suddenly' one of my nvme cache drives failed. I am a bit at a loss how to proceed at this point. Any help would be much appreciated. I do have 2 cache drives (backup). First time I noticed this was because of the docker image failing to load (i.e. Sonarr was not working). I am also wondering (if it is indeed the nvme drive) how to avoid this in the future or at least be notified before failure. If this is possible of course. I started wondering about this because the SSD endurance is at 0% but this is not something that I look at frequently. regards Pieter tower-diagnostics-20251230-1040.zip
-
Unraid getting stuck (no SSH, no GUI, dockers are working)
Hi Jorge, that is the syslog just before I replaced the USB. I wanted to compare the syslog at this moment (19-12-2024) but I noticed that the syslog was disabled again. Probably because the flash backup was from before I setup the syslog server. I enabled it again and will keep an eye on it just in case I get errors again. attached the diagnostics just in case (so that is diagnostics after the restart of the server with a new flash drive). gr Poet tower-diagnostics-20241219-1510.zip
-
Unraid getting stuck (no SSH, no GUI, dockers are working)
so the server crashed again. At least this time it said that I had a corrupted flash drive so no more searching. See attached syslog. I rebooted the server with a new USB although this was just a generic 8Gb USB that I had laying around. I know for sure it was not used before (so basically new). I guess this is a topic for somewhere else but I do not understand why my USB sticks keep failing. I must have replaced around 6 of them for the past 4 years. Is it some setting that I got wrong? Something to do with the fact that I use a Ryzen? The last USB was a Verbatim 8GB - USB 2.0. Some tips (links) for this issue would be greatly appreciated. Is it a good idea to keep this syslog server running? Or is this just extra wear and tear on the cache? gr Poet syslog-192.168.2.10.log
-
Unraid getting stuck (no SSH, no GUI, dockers are working)
Hi, found the 'power supply idle control' and set it to 'typical current idle'. As mentioned in link you provided in the last post. Let's see if this improves the situation. Any difference if I postpone the parity check for now? gr Poet
-
Unraid getting stuck (no SSH, no GUI, dockers are working)
Memory test I did yesterday after restart. That passed without problems. Will take a look at the link you provided. Thanks. Will post if something happens again. I have to note that this is a 'new' occurrence. The server was running fine for quite some time in the past. gr Poet
-
Unraid getting stuck (no SSH, no GUI, dockers are working)
It happend again somewhere last night. Complete freeze of system. No ping, nog SSH. Nothing to see on the local attached screen (frozen on Unraid web GUI). Attached the syslog from the syslog server and latest diagnostics. gr Poet syslog-192.168.2.10.log tower-diagnostics-20241217-0904.zip
-
Unraid getting stuck (no SSH, no GUI, dockers are working)
Will do. Thanks.
-
Unraid getting stuck (no SSH, no GUI, dockers are working)
I decided to to a power cycle to get things working again. After the server was up I downloaded the diagnostics again. I also setup the syslog server. Although, I do not have access to a second server. I setup the syslog as described in by Frank1940. I've used the third option. So if I need to do a power cycle again I will at least have something to look at. Attached the latest diagnostics. tower-diagnostics-20241216-1052.zip gr Poet
-
Unraid getting stuck (no SSH, no GUI, dockers are working)
Good morning, I just checked. The monitor connected to the server is full black (you see the "_" in the top left corner). No response. I can still ping the server. I still get a connection refused error when I try: SSH [email protected] ssh: connect to host 192.168.2.10 port 22: Connection refused I do not get the option to write a password how you normally do.
-
Unraid getting stuck (no SSH, no GUI, dockers are working)
Hi Guys, I have been putting off posting a message for some time now because every time I thought it would be oké. But actually things are getting worse now. Occasionally I experienced that the Unraid web GUI is not working anymore. The dockers are usually not working anymore when that happens. I unfortunate did not properly write down what would happen en when. So I though because at this very moment I experience problems again, lets get some help. This issue happened a couple of times now but every time after 1 or 2 days the issue would resolve by itself. Last week however the server got completely stuck (no ping, no SSH, no GUI etc) so I did a hard reset. After that I did a diagnostics (20241210) which I will post. This is not the diagnostics off the ongoing problem this very moment. Just wanted to share something already because I think the issues are related. Status this very moment 2024-12-15 20:00 Dockers are working Web GUI is not working Shares not accessible (SMB), can be accessed through Krusader docker though SSH is not working (getting a port 22: connection refused Cannot get a diagnostics because SSH and GUI are not working. Will post when I have acces to them again and tomorrow I will have access to the monitor attached to the server. If I remember correctly I started up with GUI last time. From past experience this 'issue' should be automatically disappear. Thanks in advance for any help. This is driving me nuts and because I am a sailor I do not always have (physical) access to my server. Of course I could ask the wife but better not. gr Poet tower-diagnostics-20241210-0902.zip