CLI shows disk offline but GUI shows disk fine


dexdiman
Go to solution Solved by JorgeB,

Recommended Posts

One of my disks is constantly going offline, but it's not being reflected in the GUI or logs. The disk will show up fine in the GUI with a green indicator saying the device is active. It's passing SMART, and the disk logs aren't showing any errors. If I launch the CLI type "mc" and navigate to /mnt/ to see disks it shows up red (?diskX) or if I browse the disk via the GUI it says "No listing: Too many files".

 

This has been happening a couple times a day. If I stop and start the array the disk comes back fine again until it eventually just goes back "down". I initially discovered this when sonarr wasn't downloading new episodes one day. I went to investigate and the sonarr logs were filled with array read/write errors. NZBget also has tons of read/write errors. Usually when a disk goes down the data is emulated and still "visible" at least, but this is totally different. The data is gone from the array until I stop and start the array and the disk starts working again.

 

I've tried using a different SAS cable (internal and external) and plugging it into a different port on my HBA card. My disks are in a JBOD.

disk_cli.png

disk_gui.png

Link to comment

Over the weekend I did some testing.

 

Replaced internal SAS cable - same problem

Replaced external SAS cable - same problem

Plugged into another HBA - same problem

Swapped with a hot spare - same problem

 

Currently, it's rebuilding the data on the hot spare and having the issue. GUI is rebuilding the drive fine but the CLI says the drive is offline.

1791776719_2022-09-2013_20_05-Coruscant_MainMozillaFirefox.png.d1ae44f3c7986a078f00b8f4e957e551.png

243080519_2022-09-2013_25_41-bash--login(Coruscant)MozillaFirefox.png.a8bf24652ded7d1dadc84843584a1dc5.png

 

1101228412_2022-09-2013_27_41-Coruscant_MainMozillaFirefox.png.effc7f36a4ebdb40c8f9432a34c76896.png

coruscant-diagnostics-20220920-1328.zip

Edited by dexdiman
Added diagnostic
Link to comment
6 hours ago, trurl said:

Those screenshots show the disk is invalid, because it is rebuilding. Diagnostics show disk7 is mounted and 78% full. In what way is it "offline"?

 

 

That's the problem. The GUI and logs all say the drive is perfectly fine and responding but...

 

If I try to view the contents of the drive via the GUI - I see this:

165666569_2022-09-2020_23_57-Coruscant_BrowseMozillaFirefox.png.41bb9126fee67a292abdf450153b1bf9.png

 

If I try and access the drive via Windows Explorer I can't:

743783307_2022-09-2020_25_12-NetworkError.png.7c0396221f9fb5212f0a0d3b16bec3bf.png

 

Anything on that drive is gone from the array (it's not being emulated or anything). One example is my movies share is missing anything on disk7 when it's down.

 

Number of files when disk7 is working:

36382702_2022-09-2020_31_41-movies.png.c3e7c9d3856aec3a872e607d693042cc.png

 

Number of files when disk7 is down:

33370758_2022-09-2020_27_21-movies.png.86fc69da6dc13eafe8c0b94af8d8c1fb.png

 

I don't have any screenshots but sonarr, radarr, and nzbget all show read/write errors and won't function when disk7 goes down. Nextcloud has issues as well.

 

The only way to reliably get disk7 to work after it's gone down is to stop and start the array. Then disk7 comes back and works for an undetermined amount of time before it goes down. An easy and reliable way I can tell if it goes down is by launching the CLI typing mc and navigating to /mnt to see if disk7 is red with a ? in front of it. Even rebuilding the disk causes what I showed in my previous post. The GUI and logs saying disk7 is fine and being rebuild but the CLI showing a red ?disk7 and anything on it not accessible.

Link to comment
On 9/21/2022 at 1:12 AM, JorgeB said:
Sep 19 23:36:23 Coruscant kernel: XFS (md7): Internal error rec.ir_free != frec->ir_free || rec.ir_freecount != frec->ir_freecount at line 1558 of file fs/xfs/libxfs/xfs_ialloc.c.  Caller xfs_dialloc_ag_update_inobt+0xd0/0x121 [xfs]

 

Check filesystem on disk7.

I ran that on disk 7 and today disk 7 hasn't had any issues. I know I've ran that before but it didn't fix the issue so whatever. I'm just glad it's fixed. Thanks.

Link to comment
1 hour ago, dexdiman said:

I ran that on disk 7 and today disk 7 hasn't had any issues. I know I've ran that before but it didn't fix the issue so whatever. I'm just glad it's fixed. Thanks.

I wonder if perhaps  last time you forgot to run without removing the -n option so that the disk got checked but not repaired?

Link to comment
On 9/22/2022 at 9:20 PM, itimpi said:

I wonder if perhaps  last time you forgot to run without removing the -n option so that the disk got checked but not repaired?

Yeah maybe. Who knows. I work in IT Infrastructure and see this kind of thing all the time. User tries a bunch of troublshooting - they call me - I try the same things they did and one of them fixes the issue. LOL.

Edited by dexdiman
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.