Jump to content

Is my SSD in need of replacement?


KDP

Recommended Posts

Posted (edited)

Two days ago I swapped out some drives from my array. When I rebooted the server I started getting a lot of errors on one of my SSD drives in my cache pool and my dockerimage was corrupted. I went back in to my server and made sure that all of my cables were secure on the SSD drives. I have not seen any errors in my log since and have rebuilt all my dockers. I have run btrfs dev stats /mnt/cache/ and it shows the following

 

[/dev/sdb1].write_io_errs    32262152
[/dev/sdb1].read_io_errs     30052334
[/dev/sdb1].flush_io_errs    90551
[/dev/sdb1].corruption_errs  460309
[/dev/sdb1].generation_errs  4267
[/dev/sdc1].write_io_errs    0
[/dev/sdc1].read_io_errs     0
[/dev/sdc1].flush_io_errs    0
[/dev/sdc1].corruption_errs  0
[/dev/sdc1].generation_errs  0

 

Those numbers have not changed at all for the last 24 hours that the system has been running. I also ran a SMART extended test and it shows

 

1	Raw read error rate	0x0032	100	100	050	Old age	Always	Never	0
5	Reallocated sector count	0x0032	100	100	050	Old age	Always	Never	0
9	Power on hours	0x0032	100	100	050	Old age	Always	Never	34814 (3y, 11m, 17d, 14h)
12	Power cycle count	0x0032	100	100	050	Old age	Always	Never	34
160	Unknown attribute	0x0032	100	100	050	Old age	Always	Never	0
161	Unknown attribute	0x0033	100	100	050	Pre-fail	Always	Never	100
163	Unknown attribute	0x0032	100	100	050	Old age	Always	Never	10
164	Unknown attribute	0x0032	100	100	050	Old age	Always	Never	286621
165	Unknown attribute	0x0032	100	100	050	Old age	Always	Never	2130
166	Unknown attribute	0x0032	100	100	050	Old age	Always	Never	339
167	Unknown attribute	0x0032	100	100	050	Old age	Always	Never	581
168	Unknown attribute	0x0032	100	100	050	Old age	Always	Never	7000
169	Unknown attribute	0x0032	100	100	050	Old age	Always	Never	92
175	Program fail count chip	0x0032	100	100	050	Old age	Always	Never	0
176	Erase fail count chip	0x0032	100	100	050	Old age	Always	Never	0
177	Wear leveling count	0x0032	100	100	050	Old age	Always	Never	0
178	Used rsvd block count chip	0x0032	100	100	050	Old age	Always	Never	0
181	Program fail count total	0x0032	100	100	050	Old age	Always	Never	0
182	Erase fail count total	0x0032	100	100	050	Old age	Always	Never	0
192	Power-off retract count	0x0032	100	100	050	Old age	Always	Never	15
194	Temperature celsius	0x0022	100	100	050	Old age	Always	Never	40
195	Hardware ECC recovered	0x0032	100	100	050	Old age	Always	Never	278670
196	Reallocated event count	0x0032	100	100	050	Old age	Always	Never	0
197	Current pending sector	0x0032	100	100	050	Old age	Always	Never	0
198	Offline uncorrectable	0x0032	100	100	050	Old age	Always	Never	0
199	UDMA CRC error count	0x0032	100	100	050	Old age	Always	Never	1
232	Available reservd space	0x0032	100	100	050	Old age	Always	Never	100
241	Total lbas written	0x0030	100	100	050	Old age	Offline	Never	2296155
242	Total lbas read	0x0030	100	100	050	Old age	Offline	Never	1118873
245	Unknown attribute	0x0032	100	100	050	Old age	Always	Never	2687976

 

Warning: ATA error count 0 inconsistent with error log pointer 1

ATA Error Count: 0
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error -4 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 00 00 00 00 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  e5 00 00 00 00 00 00 08      00:00:00.000  CHECK POWER MODE
  b0 d5 01 00 4f c2 00 08      00:00:00.000  SMART READ LOG
  b0 d1 01 01 4f c2 00 08      00:00:00.000  SMART READ ATTRIBUTE THRESHOLDS [OBS-4]
  ec 00 01 00 00 00 00 08      00:00:00.000  IDENTIFY DEVICE
  b0 d5 01 01 4f c2 00 08      00:00:00.000  SMART READ LOG

 

I will keep a close eye on it for a few more days, but can I toss it up to my bumbling in the case or should I be concerned with the drive?

Edited by KDP
Posted

Server was rebooted after the problem so we can't see what happened, most often when a device drops offline like that it's a cable problem, suggest replacing those to rule that out them run a scrub, also see here for better pool monitoring.

Posted

I have already run a scrub and implemented the noted script. I believe that I dislodged a cable while in the case and am hoping after reseating them that this was indeed the issue. I just wanted a second opinion. Thank you for taking the time to take a look at everything!

  • Like 1

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...