-
Cache disk (pool member) going bad?
Thank you for this suggestion. I will try this to see if it helps. I've had this disk in my server for about 3 months and have never seen this issue. Any reason why it would start showing it's head now. The only things different I've done with this machine recently (past week) are upgrading unraid from 6.9 to 6.11 and played around with VMs with GPU passthrough. I wouldn't think those would have caused a drive to have power state issues, but I know sometimes things that seem unrelated may actually affect one another. Thank you for your help.
-
-
Cache disk (pool member) going bad?
I have a cache pool consisting of two SSDs: ADATA SU760 (SATA 256GB) ADATA SX6000PNP (NVME 256GB) and I'm wondering if my NVME disk ADATA SX6000PNP is about to go bad. Below is a summary of my recent events. I recently replaced my GPU (about 3 days ago) due to loud idle fan noise of an AMD FirePro v3900 (I switched to an AMD/ATI Radeon HD 4550) and booted my system and continued with my day. I noticed the day after, my docker containers were all stopped, so without thinking I started them back up. Later I was looking through my logs to determine why they were stopped as I could see the exited time on them that was listed earlier put them all stopping at about 4AM. Turned out this was due to my CA Backup / Restore plugin doing a backup of my appdata. Typically my appdata backups last about 30 minutes long, but in this case it lasted 15.5 hours ending at 7:27PM. The backup job finally ended and I checked my system logs. In the logs I noticed that about 1 hour and 15 minutes before the backup job finished (6:12PM) I could see the following entires: Oct 3 18:12:52 Tower kernel: nvme nvme0: I/O 22 QID 4 timeout, aborting Oct 3 18:12:52 Tower kernel: nvme nvme0: Abort status: 0x0 Oct 3 18:13:51 Tower kernel: nvme nvme0: I/O 244 QID 6 timeout, aborting Oct 3 18:13:51 Tower kernel: nvme nvme0: Abort status: 0x0 Oct 3 18:13:53 Tower kernel: nvme nvme0: I/O 269 QID 2 timeout, aborting Oct 3 18:13:53 Tower kernel: nvme nvme0: Abort status: 0x0 Oct 3 18:20:51 Tower webGUI: Successful login user root from 192.168.4.121 Oct 3 18:34:18 Tower kernel: nvme nvme0: I/O 295 QID 2 timeout, aborting Oct 3 18:34:18 Tower kernel: nvme nvme0: Abort status: 0x0 Oct 3 18:35:09 Tower kernel: nvme nvme0: I/O 804 QID 1 timeout, aborting Oct 3 18:35:09 Tower kernel: nvme nvme0: Abort status: 0x0 Oct 3 18:43:36 Tower ool www[23554]: Successful logout user root from 192.168.4.121 Oct 3 19:25:11 Tower kernel: nvme nvme0: I/O 167 QID 7 timeout, aborting Oct 3 19:25:11 Tower kernel: nvme nvme0: Abort status: 0x0 Oct 3 19:25:20 Tower kernel: nvme nvme0: I/O 198 QID 1 timeout, aborting Oct 3 19:25:20 Tower kernel: nvme nvme0: Abort status: 0x0 Oct 3 19:25:42 Tower kernel: nvme nvme0: I/O 170 QID 7 timeout, aborting Oct 3 19:25:42 Tower kernel: nvme nvme0: Abort status: 0x0 Oct 3 19:25:50 Tower kernel: nvme nvme0: I/O 198 QID 1 timeout, reset controller Oct 3 19:26:11 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1 Oct 3 19:26:14 Tower kernel: nvme nvme0: 7/0/0 default/read/poll queues Oct 3 19:26:44 Tower kernel: nvme nvme0: I/O 198 QID 1 timeout, disable controller Oct 3 19:26:44 Tower kernel: I/O error, dev nvme0n1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0 Oct 3 19:26:44 Tower kernel: I/O error, dev nvme0n1, sector 305988224 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0 Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 2, rd 0, flush 1, corrupt 0, gen 0 Oct 3 19:26:44 Tower kernel: nvme nvme0: failed to mark controller live state Oct 3 19:26:44 Tower kernel: nvme nvme0: Removing after probe failure status: -19 Oct 3 19:26:44 Tower kernel: nvme0n1: detected capacity change from 500118192 to 0 Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5) Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 3, rd 0, flush 1, corrupt 0, gen 0 Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 4, rd 0, flush 1, corrupt 0, gen 0 Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 5, rd 0, flush 1, corrupt 0, gen 0 Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5) Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 6, rd 0, flush 1, corrupt 0, gen 0 Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): error writing primary super block to device 2 Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 6, rd 0, flush 2, corrupt 0, gen 0 Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5) Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 7, rd 0, flush 2, corrupt 0, gen 0 Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): error writing primary super block to device 2 Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 8, rd 0, flush 2, corrupt 0, gen 0 Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5) Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): error writing primary super block to device 2 Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5) Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): error writing primary super block to device 2 Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5) Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): error writing primary super block to device 2 Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5) Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): error writing primary super block to device 2 Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5) Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): error writing primary super block to device 2 Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5) Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): error writing primary super block to device 2 The BTRFS errors kept continuing and eventually I was seeing lots of entries like the following (repeating over and over๐ Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee1d0 len 4096 err no 10 Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee1d8 len 4096 err no 10 Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee1e0 len 4096 err no 10 Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee1e8 len 4096 err no 10 Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee1f0 len 4096 err no 10 Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee1f8 len 4096 err no 10 Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee200 len 4096 err no 10 Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee208 len 4096 err no 10 Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee210 len 4096 err no 10 At this point I decided to reboot the system. When it came back up, I noticed the ADATA SX6000PNP NVME had a red status (I think it was unavailable) and at the bottom the option to start the array was greyed out, but I could check a box and bring it back up without the NVME disk in the pool. I went ahead chose the option to start it without the NVME disk in the pool and then I could see the following in my logs (repeating over and over๐ Oct 3 20:59:35 Tower kernel: BTRFS info (device sdc1): found 7418 extents, stage: move data extents Oct 3 20:59:36 Tower kernel: BTRFS info (device sdc1): found 7418 extents, stage: update data pointers Oct 3 20:59:37 Tower kernel: BTRFS info (device sdc1): relocating block group 224467615744 flags data|raid1 Oct 3 21:00:49 Tower kernel: BTRFS info (device sdc1): found 7196 extents, stage: move data extents Oct 3 21:00:51 Tower kernel: BTRFS info (device sdc1): found 7196 extents, stage: update data pointers Oct 3 21:00:52 Tower kernel: BTRFS info (device sdc1): relocating block group 223393873920 flags data|raid1 Oct 3 21:01:58 Tower kernel: BTRFS info (device sdc1): found 7243 extents, stage: move data extents Oct 3 21:02:00 Tower kernel: BTRFS info (device sdc1): found 7243 extents, stage: update data pointers I let it run all night (in the logs I can see the BTRFS process ran until about 10:42 and I got a cache disk overheated alert for the ADATA SU760 SATA disk at one point) and this morning the logs seemed stable - of course my NVME disk did not appear in Unraid anywhere. At this point I went ahead and uploaded my hardware profile in case there is anything to see there and I then shut down my server. I then started my server back up and went into BIOS and in there I could see the ADATA SX6000PNP NVME disk was detected. I exited BIOS and booted Unraid and now the NVME disk was present in Unraid, but this time it showed a blue status (in the pool section) and had a message "All existing data on this device will be OVERWRITTEN when array is Started". I went ahead and started the array and it once again shows the stop option greyed out and says "Disabled -- BTRFS operation is running". My assumption here is that it is rebuilding the cache pool. Does all this indicate my ADATA SX6000PNP NVME disk is about to go bad and maybe I should replace it? ...or, is this just an instance where the file system or partition table on the disk became corrupted for some reason or another (I think I read in another post that the BTRFS file system isn't the most stable). Also, at this point I went ahead and uploaded another hardware profile. Below are some more of my system specs: CPU: AMD Ryzen 7 1700X Mobo: B450 MSI Gaming Plus Max RAM: 16GB of DDR4 (G.SKILL Ripjaws V Series 16GB, 2 x 8GB DDR4 3200) GPU: AMD/ATI Radeon HD 4550 (was recently an AMD FirePro v3900, but I switched due to fan noise) Thanks for any help anyone can provide. I can upload syslogs if need be or other data, just let me know if I need to scrub it for anything first. Again, thank you.
-
Any way to power down GPU when not in use by VM?
Like the title says - any way to power down GPU when not in use by VM? (small annoying noisy fan on my low profile card). My system specs: CPU: AMD Ryzen 7 1700X Mobo: B450 MSI Gaming Plus Max RAM: 16GB of DDR4 (G.SKILL Ripjaws V Series 16GB, 2 x 8GB DDR4 3200) GPU: AMD FirePro v3900 I recently upgraded from unraid 6.9 to 6.11 and noticed my GPU doesn't really ever turn off when my VM (with GPU passthrough) is stopped. The GPU will idle, but I can still hear the fan on it. I will admit, this probably seems like a strange post, but my GPU is a low profile card (AMD Fire Pro v3900) and the small fan is a little annoying when idling. I sit next to the server all day so I'm wondering if there is a way to power the GPU down when it's not in use. I have considered running my server headless, but it's nice to test VMs and access the BIOS every once in awhile without having to shut it all down and install the GPU. What's strange is that when I was on v6.9, and I would boot up my server, the GPU would spin up and be very noisy. I could start my VM (configured with passthrough) and then force stop it and the GPU would essentially shut itself down. I will admit, the behavior I saw while on v6.9 was probably a bug, as I was not able to get GPU passthrough to work properly, but I figured that it was due to the drivers not being a part of the kernel on v6.9, and my plan was to get a ROM for the GPU and see if that would have helped. Unfortunately, I never had the time to do that, so I'm not sure if it would have helped. With v6.11, GPU passthrough works with my card (yay!), but when stopping my VM the GPU goes into an idling mode (with annoying fan noise), where in v6.9 it would power off completely. I will admit, the working GPU passthrough and idling for this card in v6.11 are probably working as intended, and some bugs were probably worked out along the way - I appreciate all the work the devs do with unraid. But, is there a way to power off a GPU completely while the server is running and then power it back on when needed? (Maybe a command I can run?) Thank you for such a great product and all the help I've picked up from this forum. It is much appreciated!
-
Upgraded CPU, mobo, RAM and Unraid (v. 6.9.2) is now crashing
I have been using Unraid since December 2021 on an old 1st gen Intel core i7 (4c/8t) in an old Dell Precision T1500 with 12GB of DDR3 RAM. I have never experienced a crash while using this hardware for Unraid. I recently upgraded to an AMD Ryzen 7 1700X in a B450 MSI Gaming Plus Max with 16GB of DDR4 (G.SKILL Ripjaws V Series 16GB, 2 x 8GB DDR4 3200) and have had 2 crashes in under 24 hours. I am not doing any over clocking and do not have the XMP profile set. I've read to never do that with Unraid. After the first crash I enabled syslogging, but after the second crash there was nothing of relevance in the logs as the last entry was several hours before the crash happened. Both crashes happened after an extended period of idling. I've been reading other posts here and it sounds like the general consensus in previous posts for 1st gen Ryzen processors (and sometimes 2nd & 3rd gen) is to disable C-states globally in BIOS. Is this still the solution for this problem? My only concern with disabling C-states globally is that my processor would consume significantly more power, but maybe I'm wrong about that. Should I disable C-states globally to resolve this issue? If I do, will it cause my processor to consume way more power? Thank you for any help.
impact-trombone
Members
-
Joined
-
Last visited