Jump to content

4 drive failures in a month fixed drives/array but logs now keeps filling up


Clayton

Recommended Posts

hey, curious if i have a controller failure. i have had 4 disk failures this month. first two were disabled in the array and unmountable. i replaced the drives and rebuilt the array. when finished, the two disks were still un-mountable. i used the filesystem check and ended up deleting the master log file. i was back up and running. i still have the two original disks. fast forward a week or two and i lost two different disks. this time they were only disabled but mountable. i removed them from the array and added them back in after moving them to a different drive bay. now that is fixed after rebuild. today my log space keeps filling up. i see errors i think are related to my pair of cache drives. all the problematic disks and my cache drives use a expansion/raid card in jbod mode. i forget the name off the top of my head. would someone be able to give any incite if this is a card failure. its a 12gb card, 8 sata connections. if im having issues only with my SSD's at this point since my last reboot, those are the only disks on that card. the rest of my drives are on the sata headers on my system board.

darktower-diagnostics-20211029-1307.zip

Link to comment

Looks more like a board/PCIe problem:

 

Oct 29 00:47:47 DarkTower kernel: pcieport 0000:00:02.2: AER: Uncorrected (Fatal) error received: 0000:00:02.2
Oct 29 00:47:47 DarkTower kernel: pcieport 0000:00:02.2: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
Oct 29 00:47:47 DarkTower kernel: pcieport 0000:00:02.2:   device [8086:2f06] error status/mask=00000020/00000000
Oct 29 00:47:47 DarkTower kernel: pcieport 0000:00:02.2:    [ 5] SDES                   (First)
Oct 29 00:47:47 DarkTower kernel: mpt3sas_cm0: PCI error: detected callback, state(2)!!
Oct 29 00:47:48 DarkTower kernel: pcieport 0000:00:02.2: AER: Root Port link has been reset
Oct 29 00:47:48 DarkTower kernel: mpt3sas_cm0: PCI error: resume callback!!
Oct 29 00:47:48 DarkTower kernel: pcieport 0000:00:02.2: AER: device recovery successful

 

Try using a different PCIe slot if available.

Link to comment

reseated the card last night, same slot. went through a mover cycle and so far no log entries. looking good so far but will not count my chickens until it goes a few days to a week.

 

Kinda a random question, since i had so many drive failures which were unlikely drive failures but i replaced at least two of them. if i have a drive failure in the future that is on a 12GB drive, can it be replaced by a 10TB if the 12 is not used in full, under 10TB?

Link to comment
22 minutes ago, Clayton said:

if i have a drive failure in the future that is on a 12GB drive, can it be replaced by a 10TB if the 12 is not used in full, under 10TB?

You can only rebuild to a disk of the same size or larger. The entire disk is rebuilt, regardless of contents. Parity doesn't know anything about filesystems, only bits.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...