Jump to content

I think my boot USB may be dying, Can someone confirm?


Recommended Posts

Hi guys, Just wondering if someone can confirm my suspicion here.

 

I recently built a new unraid nas, and it's been running great for a few weeks now.

At least until 2 nights ago, when the system randomly became unresponsive.

I could still access mounted shares just fine, But the web UI, ssh and my docker apps were all unresponsive.

In the end, I had to do a hard reset. This prompted me to finally get remote syslog up and running, as well as Telegraf.

So when it happened again last night, I actually got some useful information.

 

This is what the syslog shows immediately before the lockup. (sdc is my boot USB)
 

Spoiler
May 21 20:53:07 Data kernel: sd 2:0:0:0: [sdc] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=0s
May 21 20:53:07 Data kernel: sd 2:0:0:0: [sdc] tag#0 Sense Key : 0x3 [current] 
May 21 20:53:07 Data kernel: sd 2:0:0:0: [sdc] tag#0 ASC=0x11 ASCQ=0x0 
May 21 20:53:07 Data kernel: sd 2:0:0:0: [sdc] tag#0 CDB: opcode=0x28 28 00 00 43 72 12 00 00 f0 00
May 21 20:53:07 Data kernel: critical medium error, dev sdc, sector 4420114 op 0x0:(READ) flags 0x84700 phys_seg 30 prio class 2
May 21 21:03:50 Data kernel: sd 2:0:0:0: [sdc] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=0s
May 21 21:03:50 Data kernel: sd 2:0:0:0: [sdc] tag#0 Sense Key : 0x3 [current] 
May 21 21:03:50 Data kernel: sd 2:0:0:0: [sdc] tag#0 ASC=0x11 ASCQ=0x0 
May 21 21:03:50 Data kernel: sd 2:0:0:0: [sdc] tag#0 CDB: opcode=0x28 28 00 00 47 6f 1a 00 00 f0 00
May 21 21:03:50 Data kernel: critical medium error, dev sdc, sector 4681498 op 0x0:(READ) flags 0x84700 phys_seg 30 prio class 2
May 21 21:08:19 Data kernel: sd 2:0:0:0: [sdc] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=0s
May 21 21:08:19 Data kernel: sd 2:0:0:0: [sdc] tag#0 Sense Key : 0x3 [current] 
May 21 21:08:19 Data kernel: sd 2:0:0:0: [sdc] tag#0 ASC=0x11 ASCQ=0x0 
May 21 21:08:19 Data kernel: sd 2:0:0:0: [sdc] tag#0 CDB: opcode=0x28 28 00 00 44 f2 ea 00 00 f0 00
May 21 21:08:19 Data kernel: critical medium error, dev sdc, sector 4518634 op 0x0:(READ) flags 0x84700 phys_seg 30 prio class 2
May 21 21:37:04 Data nginx: 2023/05/21 21:37:04 [alert] 18639#18639: *387236 open socket #24 left in connection 11
May 21 21:37:04 Data nginx: 2023/05/21 21:37:04 [alert] 18639#18639: *387297 open socket #25 left in connection 17
May 21 21:37:04 Data nginx: 2023/05/21 21:37:04 [alert] 18639#18639: *387219 open socket #5 left in connection 20
May 21 21:37:04 Data nginx: 2023/05/21 21:37:04 [alert] 18639#18639: *387245 open socket #6 left in connection 21
May 21 21:37:04 Data nginx: 2023/05/21 21:37:04 [alert] 18639#18639: *387242 open socket #28 left in connection 23
May 21 21:37:04 Data nginx: 2023/05/21 21:37:04 [alert] 18639#18639: aborting
May 21 21:43:53 Data kernel: SQUASHFS error: xz decompression failed, data probably corrupt
May 21 21:43:53 Data kernel: SQUASHFS error: Failed to read block 0x807744: -5
May 21 21:43:53 Data kernel: SQUASHFS error: xz decompression failed, data probably corrupt
May 21 21:43:53 Data kernel: SQUASHFS error: Failed to read block 0x807744: -5
May 21 21:43:53 Data kernel: SQUASHFS error: xz decompression failed, data probably corrupt
May 21 21:43:53 Data kernel: SQUASHFS error: Failed to read block 0x807744: -5
May 21 21:43:53 Data kernel: SQUASHFS error: xz decompression failed, data probably corrupt
May 21 21:43:53 Data kernel: SQUASHFS error: Failed to read block 0x807744: -5
May 21 21:43:53 Data kernel: SQUASHFS error: xz decompression failed, data probably corrupt
May 21 21:43:53 Data kernel: SQUASHFS error: Failed to read block 0x807744: -5
May 21 21:44:05 Data kernel: SQUASHFS error: xz decompression failed, data probably corrupt
May 21 21:44:05 Data kernel: SQUASHFS error: Failed to read block 0x807744: -5
May 21 21:44:05 Data kernel: SQUASHFS error: xz decompression failed, data probably corrupt
May 21 21:44:05 Data kernel: SQUASHFS error: Failed to read block 0x807744: -5
May 21 21:44:05 Data kernel: SQUASHFS error: xz decompression failed, data probably corrupt
May 21 21:44:05 Data kernel: SQUASHFS error: Failed to read block 0x807744: -5
May 21 21:44:05 Data kernel: SQUASHFS error: xz decompression failed, data probably corrupt
May 21 21:44:05 Data kernel: SQUASHFS error: Failed to read block 0x807744: -5
May 21 21:44:05 Data kernel: SQUASHFS error: xz decompression failed, data probably corrupt
May 21 21:44:05 Data kernel: SQUASHFS error: Failed to read block 0x807744: -5

 

The Telegraf data seems to support this. The last thing it shows is a sharp spike in ioWait
74503  

 

I initially assumed this was caused by a docker container I installed a few hours before the first lockup. But this data seems to point squarely at my boot USB.

The USB is a Cruzer Fit 16GB which worked flawlessly in my previous unraid NAS for several years. 

 

The first thing I did the first time this happened was create a flash backup so worst case I can recover. I'm just looking for a second opinion. 

 

I have attached my syslog and diagnostics from immediately after the last lockup.

data-diagnostics-20230521-2249.zip syslog-10.0.0.133.log

Link to comment
2 minutes ago, itimpi said:

Not sure why you think that suggests a problem with the flash drive?   The snippet you posted refers to sdc whereas the flash drive is sda.   It COULD be the flash drive but until you get something pointing more directly to it that may be a hasty conclusion.

The boot drive is definitely sdc

7bd2f

I currently have 2 other flash drives installed which are using sda and sdb. One of those is my dummy array (I'm using a raidz2 pool as my main storage)

Link to comment
On 5/22/2023 at 2:57 PM, itimpi said:

I wad going by the diagnostics posted that had a flash drive as sda and show a 4TB drive as sdc and no other flash drives.  Not sure why they do not agree.   If you are correct it may well be the flash drive

I think the sdc errors were a bit of a red herring. 

I have done some more investigation, and it looks like the issue is the last docker container I added.  
It seems to have a memory leak or something, because it slowly consumes more and more ram until the system eventually locks up.
The thing that threw me off is telegraf. It looks like there is around a gig of ram free, But apparently that's not the case.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...