l92 Posted January 25, 2020 Share Posted January 25, 2020 (edited) I keep having issues with devices in the cache pool disappearing from the system. This is a new server build from December, so first I got a replacement M.2 nvme to see if that would fix it, but it did not and both the other SSDs dropped multiple times. Next I had a replacement motherboard sent and just got that setup, and within 2 hours one of my SSDs just dropped out again... Jan 24 19:14:55 Media ntpd[1957]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized Jan 24 19:15:38 Media kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Jan 24 19:15:38 Media kernel: caller _nv000907rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs Jan 24 19:39:53 Media kernel: nvme nvme0: I/O 514 QID 4 timeout, aborting Jan 24 19:40:23 Media kernel: nvme nvme0: I/O 514 QID 4 timeout, reset controller Jan 24 19:40:53 Media kernel: nvme nvme0: I/O 22 QID 0 timeout, reset controller Jan 24 19:41:45 Media kernel: nvme nvme0: Device not ready; aborting reset Jan 24 19:41:45 Media kernel: nvme nvme0: Abort status: 0x7 Jan 24 19:41:57 Media nginx: 2020/01/24 19:41:57 [error] 6086#6086: *6391 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.0.103, server: , request: "POST /webGui/include/DashUpdate.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.0.7", referrer: "http://192.168.0.7/Dashboard" Jan 24 19:42:06 Media kernel: nvme nvme0: Device not ready; aborting reset Jan 24 19:42:06 Media kernel: nvme nvme0: Removing after probe failure status: -19 Jan 24 19:42:12 Media nginx: 2020/01/24 19:42:12 [error] 6086#6086: *6872 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.0.103, server: , request: "POST /webGui/include/DashUpdate.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.0.7", referrer: "http://192.168.0.7/Dashboard" Jan 24 19:42:27 Media nginx: 2020/01/24 19:42:27 [error] 6086#6086: *6229 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.0.103, server: , request: "POST /webGui/include/DashUpdate.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.0.7", referrer: "http://192.168.0.7/Dashboard" Jan 24 19:42:27 Media kernel: nvme nvme0: Device not ready; aborting reset Jan 24 19:42:27 Media kernel: print_req_error: I/O error, dev nvme0n1, sector 104580456 Jan 24 19:42:27 Media kernel: print_req_error: I/O error, dev nvme0n1, sector 97666840 Jan 24 19:42:27 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 Jan 24 19:42:27 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 Jan 24 19:42:27 Media kernel: print_req_error: I/O error, dev nvme0n1, sector 96589704 Jan 24 19:42:27 Media kernel: print_req_error: I/O error, dev nvme0n1, sector 97655216 Jan 24 19:42:27 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0 Jan 24 19:42:27 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0 Jan 24 19:42:27 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0 Jan 24 19:42:27 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0 Jan 24 19:42:27 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0 Jan 24 19:42:27 Media kernel: print_req_error: I/O error, dev nvme0n1, sector 136028264 Jan 24 19:42:27 Media kernel: print_req_error: I/O error, dev nvme0n1, sector 97654272 Jan 24 19:42:27 Media kernel: print_req_error: I/O error, dev nvme0n1, sector 97655440 Jan 24 19:42:27 Media kernel: print_req_error: I/O error, dev nvme0n1, sector 104523792 Jan 24 19:42:27 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0 Jan 24 19:42:27 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0 Jan 24 19:42:27 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0 Jan 24 19:42:27 Media kernel: print_req_error: I/O error, dev nvme0n1, sector 96588568 Jan 24 19:42:27 Media kernel: print_req_error: I/O error, dev nvme0n1, sector 104555408 Jan 24 19:42:27 Media kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 Jan 24 19:42:27 Media kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 3 Jan 24 19:42:27 Media kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 Jan 24 19:42:27 Media kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 3 Jan 24 19:42:27 Media kernel: nvme nvme0: failed to set APST feature (-19) Jan 24 19:42:27 Media kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 ### [PREVIOUS LINE REPEATED 2 TIMES] ### Jan 24 19:42:27 Media kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 3 Jan 24 19:42:27 Media kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 Jan 24 19:42:27 Media kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 3 Jan 24 19:42:27 Media kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 Jan 24 19:42:27 Media kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 3 Jan 24 19:42:27 Media kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 ### [PREVIOUS LINE REPEATED 2 TIMES] ### Jan 24 19:42:27 Media kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 3 ### [PREVIOUS LINE REPEATED 47 TIMES] ### Jan 24 19:42:32 Media kernel: btrfs_dev_stat_print_on_error: 1806 callbacks suppressed Jan 24 19:42:32 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1759, rd 5, flush 53, corrupt 0, gen 0 Jan 24 19:42:32 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1760, rd 5, flush 53, corrupt 0, gen 0 Jan 24 19:42:32 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1761, rd 5, flush 53, corrupt 0, gen 0 Jan 24 19:42:32 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1761, rd 5, flush 54, corrupt 0, gen 0 Jan 24 19:42:32 Media kernel: btrfs_end_buffer_write_sync: 67 callbacks suppressed Jan 24 19:42:32 Media kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 Jan 24 19:42:32 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1762, rd 5, flush 54, corrupt 0, gen 0 Jan 24 19:42:32 Media kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 3 Jan 24 19:42:32 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1763, rd 5, flush 54, corrupt 0, gen 0 Jan 24 19:42:32 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1764, rd 5, flush 54, corrupt 0, gen 0 Jan 24 19:42:32 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1765, rd 5, flush 54, corrupt 0, gen 0 Jan 24 19:42:32 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1766, rd 5, flush 54, corrupt 0, gen 0 Jan 24 19:42:37 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1767, rd 5, flush 54, corrupt 0, gen 0 Jan 24 19:42:37 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1768, rd 5, flush 54, corrupt 0, gen 0 Jan 24 19:42:37 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1769, rd 5, flush 54, corrupt 0, gen 0 Jan 24 19:42:37 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1769, rd 5, flush 55, corrupt 0, gen 0 Jan 24 19:42:37 Media kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 Jan 24 19:42:37 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1770, rd 5, flush 55, corrupt 0, gen 0 Jan 24 19:42:37 Media kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 3 Jan 24 19:42:37 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1771, rd 5, flush 55, corrupt 0, gen 0 Jan 24 19:42:37 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1772, rd 5, flush 55, corrupt 0, gen 0 Jan 24 19:42:37 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1772, rd 5, flush 56, corrupt 0, gen 0 Jan 24 19:42:37 Media kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 Jan 24 19:42:37 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1773, rd 5, flush 56, corrupt 0, gen 0 Jan 24 19:42:37 Media kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 3 Jan 24 19:42:37 Media kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1774, rd 5, flush 56, corrupt 0, gen 0 Jan 24 19:42:37 Media kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 Jan 24 19:42:37 Media kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 3 Jan 24 19:42:37 Media kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 Jan 24 19:42:37 Media kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 3 Jan 24 19:42:37 Media kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 Jan 24 19:42:37 Media kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 3 Jan 24 19:42:37 Media kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 Jan 24 19:42:37 Media kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 3 Jan 24 19:42:37 Media kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 Jan 24 19:42:37 Media kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 3 Jan 24 19:42:39 Media kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 Jan 24 19:42:39 Media kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 3 Jan 24 19:42:41 Media kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 Jan 24 19:42:41 Media kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 3 Jan 24 19:42:42 Media kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 Jan 24 19:42:42 Media kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 3 Jan 24 19:42:42 Media kernel: btrfs_dev_stat_print_on_error: 42 callbacks suppressed media-diagnostics-20200124-1948.zip Edited January 25, 2020 by l92 Quote Link to comment
l92 Posted January 25, 2020 Author Share Posted January 25, 2020 After rebooting and adding the dropped disk to the cache pool again (then another reboot), the other disk dropped out this time. The first time I received a notification at 7:43 PM The second time I received a notification about a missing device was at 10:38 PM (approximately 2 hours after the first incident was resolved) media-diagnostics-20200124-2243.zip Quote Link to comment
JorgeB Posted January 25, 2020 Share Posted January 25, 2020 Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append" nvme_core.default_ps_max_latency_us=0 Reboot and see if it makes a difference. Quote Link to comment
l92 Posted January 25, 2020 Author Share Posted January 25, 2020 Thank you for the reply. I will report back with the results Quote Link to comment
chaoskoala Posted January 25, 2020 Share Posted January 25, 2020 To piggyback, I've suddenly been getting this problem, but with SATA cache drives. Just randomly dropping one of two cache drives. Quote Link to comment
JorgeB Posted January 26, 2020 Share Posted January 26, 2020 13 hours ago, chaoskoala said: To piggyback, I've suddenly been getting this problem, but with SATA cache drives. Just randomly dropping one of two cache drives. If they are SSDs it's most commonly a cable issue, unless they are M.2 devices, but without diags can't say much more. Quote Link to comment
chaoskoala Posted January 28, 2020 Share Posted January 28, 2020 This is what I'm seeing. This is my second set of ssds since I thought it was the ssd at first. I've re-seated the cables, and made sure that my cache is direct to motherboard, not through HBA. Second restart and I'm still getting this. The following is just a snippet on the disk log. Jan 26 10:32:01 Starswirl emhttpd: import 30 cache device: (sdf) SPCC_Solid_State_Disk_A472079B020C00262699 Jan 26 10:34:01 Starswirl kernel: BTRFS info (device sdf1): disk space caching is enabled Jan 26 10:34:01 Starswirl kernel: BTRFS info (device sdf1): has skinny extents Jan 26 10:34:01 Starswirl kernel: BTRFS info (device sdf1): bdev /dev/sdh1 errs: wr 263018, rd 23717, flush 2409, corrupt 0, gen 0 Jan 26 10:34:01 Starswirl kernel: BTRFS error (device sdf1): bad tree block start, want 73693298688 have 0 Jan 26 10:34:01 Starswirl kernel: BTRFS error (device sdf1): bad tree block start, want 73693331456 have 0 Jan 26 10:34:01 Starswirl kernel: BTRFS info (device sdf1): enabling ssd optimizations Jan 26 10:34:01 Starswirl kernel: BTRFS info (device sdf1): resizing devid 1 Jan 26 10:34:01 Starswirl kernel: BTRFS info (device sdf1): new size for /dev/sdf1 is 128035643392 Jan 26 10:34:01 Starswirl kernel: BTRFS info (device sdf1): resizing devid 2 Jan 26 10:34:01 Starswirl kernel: BTRFS info (device sdf1): new size for /dev/sdh1 is 128035643392 Jan 26 10:34:12 Starswirl kernel: BTRFS error (device sdf1): parent transid verify failed on 73693413376 wanted 69368 found 68134 Jan 26 10:34:12 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73693413376 (dev /dev/sdh1 sector 28417056) Jan 26 10:34:12 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73693417472 (dev /dev/sdh1 sector 28417064) Jan 26 10:34:12 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73693421568 (dev /dev/sdh1 sector 28417072) Jan 26 10:34:12 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73693425664 (dev /dev/sdh1 sector 28417080) Jan 26 10:34:12 Starswirl kernel: BTRFS error (device sdf1): bad tree block start, want 73693282304 have 0 Jan 26 10:34:12 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73693282304 (dev /dev/sdh1 sector 28416800) Jan 26 10:34:12 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73693286400 (dev /dev/sdh1 sector 28416808) Jan 26 10:34:12 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73693290496 (dev /dev/sdh1 sector 28416816) Jan 26 10:34:12 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73693294592 (dev /dev/sdh1 sector 28416824) Jan 26 10:34:12 Starswirl kernel: BTRFS error (device sdf1): bad tree block start, want 73693315072 have 0 Jan 26 10:34:12 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73693315072 (dev /dev/sdh1 sector 28416864) Jan 26 10:34:12 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73693319168 (dev /dev/sdh1 sector 28416872) Jan 26 10:34:12 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 176128 csum 0x8acff355 expected csum 0xb8b17d18 mirror 1 Jan 26 10:34:12 Starswirl kernel: BTRFS error (device sdf1): parent transid verify failed on 73673310208 wanted 69339 found 68104 Jan 26 10:34:12 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 180224 csum 0xd11a6c45 expected csum 0xfe22cdb7 mirror 1 Jan 26 10:34:12 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 184320 csum 0x057f09ad expected csum 0x0702dba3 mirror 1 Jan 26 10:34:12 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 188416 csum 0xf937f2c0 expected csum 0x4dda44e4 mirror 1 Jan 26 10:34:12 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 176128 csum 0x8acff355 expected csum 0xb8b17d18 mirror 1 Jan 26 10:34:12 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 180224 csum 0xd11a6c45 expected csum 0xfe22cdb7 mirror 1 Jan 26 10:34:12 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 184320 csum 0x057f09ad expected csum 0x0702dba3 mirror 1 Jan 26 10:34:12 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 188416 csum 0xf937f2c0 expected csum 0x4dda44e4 mirror 1 Jan 26 10:34:12 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 196608 csum 0x7fac8920 expected csum 0xf5a14b2a mirror 1 Jan 26 10:34:12 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 225280 csum 0xe58eeb77 expected csum 0x5323f039 mirror 1 Jan 26 10:34:13 Starswirl kernel: BTRFS error (device sdf1): parent transid verify failed on 73675341824 wanted 69339 found 68104 Jan 26 10:34:13 Starswirl kernel: BTRFS error (device sdf1): bad tree block start, want 73675440128 have 0 Jan 26 10:34:13 Starswirl kernel: BTRFS error (device sdf1): parent transid verify failed on 73690021888 wanted 69364 found 68131 Jan 26 10:34:13 Starswirl kernel: BTRFS error (device sdf1): bad tree block start, want 73692643328 have 0 Jan 26 10:34:16 Starswirl kernel: BTRFS error (device sdf1): parent transid verify failed on 73925115904 wanted 67698 found 66117 Jan 26 10:34:16 Starswirl kernel: BTRFS error (device sdf1): parent transid verify failed on 73924788224 wanted 67698 found 66113 Jan 26 10:36:12 Starswirl kernel: BTRFS error (device sdf1): bad tree block start, want 73675456512 have 0 Jan 26 10:36:12 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73675456512 (dev /dev/sdh1 sector 28381984) Jan 26 10:36:12 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73675460608 (dev /dev/sdh1 sector 28381992) Jan 26 10:36:12 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73675464704 (dev /dev/sdh1 sector 28382000) Jan 26 10:36:12 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73675468800 (dev /dev/sdh1 sector 28382008) Jan 26 10:36:12 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 157138944 csum 0xb4a334e4 expected csum 0xba584add mirror 1 Jan 26 10:36:12 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 275 off 157138944 (dev /dev/sdh1 sector 2461520) Jan 26 10:55:27 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 157220864 csum 0xd944f41f expected csum 0x13533a17 mirror 1 Jan 26 10:55:27 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 157237248 csum 0x23f5ee5c expected csum 0x4e3b7d58 mirror 1 Jan 26 10:55:27 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 157290496 csum 0xded640ab expected csum 0x183e388b mirror 1 Jan 26 10:55:27 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 157274112 csum 0x7ce00715 expected csum 0x762b032e mirror 1 Jan 26 10:55:27 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 157278208 csum 0x50e30ace expected csum 0x070295be mirror 1 Jan 26 10:55:27 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 157253632 csum 0x2455a7a9 expected csum 0x627fc70a mirror 1 Jan 26 10:55:27 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 157249536 csum 0x746a5ce1 expected csum 0x5a44453c mirror 1 Jan 26 10:55:27 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 275 off 157220864 (dev /dev/sdh1 sector 25926976) Jan 26 10:55:27 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 157274112 csum 0x7ce00715 expected csum 0x762b032e mirror 1 Jan 26 10:55:27 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 157278208 csum 0x50e30ace expected csum 0x070295be mirror 1 Jan 26 10:55:27 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 275 off 157237248 (dev /dev/sdh1 sector 25929592) Jan 26 10:55:27 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 275 off 157290496 (dev /dev/sdh1 sector 25697328) Jan 26 10:55:27 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 275 off 157249536 (dev /dev/sdh1 sector 25929616) Jan 26 10:55:27 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 275 off 157253632 (dev /dev/sdh1 sector 25925432) Jan 26 10:55:27 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 275 off 157274112 (dev /dev/sdh1 sector 25700232) Jan 26 10:55:27 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 275 off 157278208 (dev /dev/sdh1 sector 25700240) Jan 26 17:09:13 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 60366848 csum 0x65eca540 expected csum 0x85b3eb5a mirror 2 Jan 26 17:09:13 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 275 off 60366848 (dev /dev/sdh1 sector 30345872) Jan 26 17:09:13 Starswirl kernel: BTRFS error (device sdf1): parent transid verify failed on 73690988544 wanted 69368 found 68131 Jan 26 17:09:13 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73690988544 (dev /dev/sdh1 sector 28412320) Jan 26 17:09:13 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73690992640 (dev /dev/sdh1 sector 28412328) Jan 26 17:09:13 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73690996736 (dev /dev/sdh1 sector 28412336) Jan 26 17:09:13 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 0 off 73691000832 (dev /dev/sdh1 sector 28412344) Jan 26 17:09:13 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 161538048 csum 0x81d538a7 expected csum 0xe92d2060 mirror 1 Jan 26 17:09:13 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 275 off 161538048 (dev /dev/sdh1 sector 2554376) Jan 26 17:09:13 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 161529856 csum 0x98f94189 expected csum 0x5f2f3f20 mirror 1 Jan 26 17:09:13 Starswirl kernel: BTRFS info (device sdf1): read error corrected: ino 275 off 161529856 (dev /dev/sdh1 sector 2599296) Jan 26 17:09:13 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 34754560 csum 0xcf367eb5 expected csum 0x97e1b66c mirror 2 Jan 26 17:09:13 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 34770944 csum 0x0601253f expected csum 0x2ba6b50c mirror 2 Jan 26 17:09:13 Starswirl kernel: BTRFS warning (device sdf1): csum failed root 5 ino 275 off 34750464 csum 0x2b6854a2 expected csum 0x05cb278a mirror 2 starswirl-diagnostics-20200127-1737.zip Quote Link to comment
JorgeB Posted January 28, 2020 Share Posted January 28, 2020 There are lots of read/write errors on cache2: Jan 26 10:34:01 Starswirl kernel: BTRFS info (device sdf1): bdev /dev/sdh1 errs: wr 263018, rd 23717, flush 2409, corrupt 0, gen 0 With SSDs this is usually a cable problem, replace both cables and run a scrub, see here for more info. Quote Link to comment
l92 Posted January 30, 2020 Author Share Posted January 30, 2020 Back to the original thread topic. I just noticed that yesterday I had a cache drive drop out again. Attached are the logs. I would really appreciate any help to fix this as it has been very frustrating switching out all this hardware trying to resolve this media-diagnostics-20200130-1330.zip Quote Link to comment
l92 Posted January 30, 2020 Author Share Posted January 30, 2020 (edited) Even after 2 reboots the system would not let me assign the drive (it is not listed in UI) Jan 30 14:47:09 Media kernel: nvme nvme0: Device not ready; aborting initialisation Jan 30 14:47:09 Media kernel: nvme nvme0: Removing after probe failure status: -19 lscpi outpu shows that only one of the drive has loaded a driver 01:00.0 Non-Volatile memory controller: Realtek Semiconductor Co., Ltd. Device 5762 (rev 01) (prog-if 02 [NVM Express]) Subsystem: Realtek Semiconductor Co., Ltd. Device 5762 Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 54 NUMA node: 0 Region 0: Memory at f7e00000 (64-bit, non-prefetchable) [size=16K] Region 5: Memory at f7e04000 (32-bit, non-prefetchable) [size=8K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [70] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 512 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <64us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s (ok), Width x4 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR+, OBFF Via message/WAKE# AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled AtomicOpsCtl: ReqEn- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [b0] MSI-X: Enable- Count=8 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00003000 Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [148 v1] Device Serial Number 00-00-00-01-00-4c-e0-00 Capabilities: [158 v1] Secondary PCI Express <?> Capabilities: [178 v1] Latency Tolerance Reporting Max snoop latency: 1048576ns Max no snoop latency: 1048576ns Capabilities: [180 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=60us PortTPowerOnTime=60us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=32768ns L1SubCtl2: T_PwrOn=60us Kernel modules: nvme, rsnvme 04:00.0 Non-Volatile memory controller: Realtek Semiconductor Co., Ltd. Device 5762 (rev 01) (prog-if 02 [NVM Express]) Subsystem: Realtek Semiconductor Co., Ltd. Device 5762 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 32 NUMA node: 0 Region 0: Memory at f2100000 (64-bit, non-prefetchable) [size=16K] Region 5: Memory at f2104000 (32-bit, non-prefetchable) [size=8K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [70] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <64us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s (ok), Width x4 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR+, OBFF Via message/WAKE# AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled AtomicOpsCtl: ReqEn- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [b0] MSI-X: Enable+ Count=8 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00003000 Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [148 v1] Device Serial Number 00-00-00-01-00-4c-e0-00 Capabilities: [158 v1] Secondary PCI Express <?> Capabilities: [178 v1] Latency Tolerance Reporting Max snoop latency: 1048576ns Max no snoop latency: 1048576ns Capabilities: [180 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=60us PortTPowerOnTime=60us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=32768ns L1SubCtl2: T_PwrOn=60us Kernel driver in use: nvme Kernel modules: nvme, rsnvme media-diagnostics-20200130-1457.zip Edited January 30, 2020 by l92 Quote Link to comment
JorgeB Posted January 31, 2020 Share Posted January 31, 2020 9 hours ago, l92 said: Back to the original thread topic. I just noticed that yesterday I had a cache drive drop out again. Attached are the logs. I would really appreciate any help to fix this as it has been very frustrating switching out all this hardware trying to resolve this media-diagnostics-20200130-1330.zip 331.33 kB · 0 downloads If disabling power states didn't help it's likely a hardware/bios problem, either the board or the NVMe device. Quote Link to comment
l92 Posted February 3, 2020 Author Share Posted February 3, 2020 I am not convinced that it is a problem with the hardware. Should I be seeing logs like this when is have set nvme_core.default_ps_max_latency_us=0 Feb 2 00:04:04 Media kernel: nvme nvme0: failed to set APST feature (-19) Quote Link to comment
JorgeB Posted February 3, 2020 Share Posted February 3, 2020 7 hours ago, l92 said: I am not convinced that it is a problem with the hardware. I don't mean bad hardware necessarily, I mean some hardware/bios compatibility issue with Linux, but if disabling APST and/or updating bios/firmware doesn't help don't know what else to suggest, except using different hardware. Quote Link to comment
l92 Posted February 7, 2020 Author Share Posted February 7, 2020 What are the recommended nvme drive brands? Sumsung, WD, Crucial? Or are there any brands I should avoid? I appreciate any help, this has been quite frustrating and I'd like to try and avoid it with the replacement drives I need to buy, Quote Link to comment
JorgeB Posted February 7, 2020 Share Posted February 7, 2020 Samsung are probably the most common and usually work well, avoid ADATA, many users with trouble with those. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.