SMART test won't run and General Protection Faults


Go to solution Solved by JorgeB,

Recommended Posts

I was in the middle of a scheduled parity check and 66% of the way through it encountered 2112 errors on a drive on the array and paused. I ran a short SMART check which the drive passed, then tried to run an extended test several times. I may have accidentally cancelled once, but the next time I saw it at 50% and then it looked like it had never been run.

 

Right now, if I click to run a smart test, nothing happens, and nothing is shown in the log. Additionally, the drive that has the red X is also displayed under Unassigned Devices.

 

When I check the logs I have loads of errors such as below:

Mar  7 21:03:37 Tower kernel: traps: lsof[7196] general protection fault ip:14fe13e634ee sp:6690f0b91e1b046c error:0 in libc-2.36.so[14fe13e4b000+16b000]
Mar  7 21:12:23 Tower kernel: traps: lsof[29820] general protection fault ip:151d585114ee sp:86baf5ee546675bc error:0 in libc-2.36.so[151d584f9000+16b000]
Mar  7 21:24:35 Tower kernel: traps: lsof[29136] general protection fault ip:152f3c6314ee sp:152a85e9f914a6b9 error:0 in libc-2.36.so[152f3c619000+16b000]
Mar  7 21:41:46 Tower kernel: traps: lsof[3668] general protection fault ip:14fd3f0824ee sp:780278895a35ec7 error:0 in libc-2.36.so[14fd3f06a000+16b000]

 

Any thoughts on what might be causing any of this? Attached are my diagnostics.

tower-diagnostics-20230307-2048.zip

Link to comment
5 hours ago, JorgeB said:

Dis4 dropped offline, it's not logged as a disk problem, looks more like a power/connection problem.

 

A few quick follow-up questions:

 

Do I have to shut down the server to deal with this?

 

If not, should I stop the parity check, take the array offline, check connections, and restart array?

Or what are the next steps?

 

 

Link to comment
9 hours ago, JorgeB said:

I would recommend at least checking the cables, both power and SATA, or ideally replacing them to rule them out.

 

Drive rebuild is underway, no issues there.

 

Went through my logs and have found another instance of this error.

 

Mar  8 15:09:36 Tower kernel: traps: lsof[26299] general protection fault ip:14ec3865f4ee sp:9d0015fe713527f1 error:0 in libc-2.36.so[14ec38647000+16b000]

 

Is this unrelated? If so happy to discuss separately.

Link to comment
On 3/8/2023 at 8:31 AM, JorgeB said:

I would recommend at least checking the cables, both power and SATA, or ideally replacing them to rule them out.

 

Data rebuild completed, however I received a yellow notification (warning) saying it had successfully completed but the description was 'Canceled'. From the archived notices:

10-03-2023 02:21	Unraid Data-Rebuild	Notice [TOWER] - Data-Rebuild finished (0 errors)	Canceled	warning

 

However, when I go into the system log, I see this:

Mar 10 02:20:47 Tower kernel: md: sync done. time=130479sec
Mar 10 02:20:47 Tower kernel: md: recovery thread: exit status: 0

 

Is the cancelled notice an error or bug or something? Not sure how to reconcile these.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.