Recent corruption issues

johnodon · December 4, 2014

I never had data corruption issues like I have had recently. So far I think I have had to run reiserfsck against 3 different disks in the last month or so and one of those disks I have had to run it 3 times. I would like to take a proactive stance and run reiserfsck against all of my drives. My questions...

1. Is this a bad thing to do?

2. Can I run more than one instance of reiserfsck at a time (in separate SSH sessions)? Optimally, I would like to run 12 instances. Is this a seriously bad idea?

EDIT: reiserfsck finished checking disk4. Should it have found something to fix? This is two days in a row that this disk has gone read-only.

EDIT#2: Syslog can be found here: https://onedrive.live.com/redir?resid=73ECE0E19A13499E%211865

root@unRAID:~# reiserfsck --check /dev/md4
reiserfsck 3.6.24

Will read-only check consistency of the filesystem on /dev/md4
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Thu Dec  4 08:55:59 2014
###########
Replaying journal: Trans replayed: mountid 518, transid 291069, desc 5914, len 1, commit 5916, next tr
Trans replayed: mountid 518, transid 291070, desc 5917, len 1, commit 5919, next trans offset 5902
Trans replayed: mountid 518, transid 291071, desc 5920, len 1, commit 5922, next trans offset 5905
Trans replayed: mountid 518, transid 291072, desc 5923, len 1, commit 5925, next trans offset 5908
Trans replayed: mountid 518, transid 291073, desc 5926, len 1, commit 5928, next trans offset 5911
Trans replayed: mountid 518, transid 291074, desc 5929, len 1, commit 5931, next trans offset 5914
Trans replayed: mountid 518, transid 291075, desc 5932, len 35, commit 5968, next trans offset 5951
Trans replayed: mountid 518, transid 291076, desc 5969, len 59, commit 6029, next trans offset 6012
Trans replayed: mountid 518, transid 291077, desc 6030, len 36, commit 6067, next trans offset 6050
Trans replayed: mountid 518, transid 291078, desc 6068, len 26, commit 6095, next trans offset 6078
Trans replayed: mountid 518, transid 291079, desc 6096, len 15, commit 6112, next trans offset 6095
Trans replayed: mountid 518, transid 291080, desc 6113, len 38, commit 6152, next trans offset 6135
Trans replayed: mountid 518, transid 291081, desc 6153, len 36, commit 6190, next trans offset 6173
Trans replayed: mountid 518, transid 291082, desc 6191, len 20, commit 6212, next trans offset 6195
Trans replayed: mountid 518, transid 291083, desc 6213, len 38, commit 6252, next trans offset 6235
Trans replayed: mountid 518, transid 291084, desc 6253, len 34, commit 6288, next trans offset 6271
Trans replayed: mountid 518, transid 291085, desc 6289, len 33, commit 6323, next trans offset 6306
Trans replayed: mountid 518, transid 291086, desc 6324, len 44, commit 6369, next trans offset 6352
Trans replayed: mountid 518, transid 291087, desc 6370, len 64, commit 6435, next trans offset 6418
Trans replayed: mountid 518, transid 291088, desc 6436, len 32, commit 6469, next trans offset 6452
Trans replayed: mountid 518, transid 291089, desc 6470, len 43, commit 6514, next trans offset 6497
Trans replayed: mountid 518, transid 291090, desc 6515, len 61, commit 6577, next trans offset 6560
Trans replayed: mountid 518, transid 291091, desc 6578, len 73, commit 6652, next trans offset 6635
Trans replayed: mountid 518, transid 291092, desc 6653, len 48, commit 6702, next trans offset 6685
Trans replayed: mountid 518, transid 291093, desc 6703, len 142, commit 6846, next trans offset 6829
Trans replayed: mountid 518, transid 291094, desc 6847, len 36, commit 6884, next trans offset 6867
Trans replayed: mountid 518, transid 291095, desc 6885, len 34, commit 6920, next trans offset 6903
Trans replayed: mountid 518, transid 291096, desc 6921, len 164, commit 7086, next trans offset 7069
Trans replayed: mountid 518, transid 291097, desc 7087, len 39, commit 7127, next trans offset 7110
Trans replayed: mountid 518, transid 291098, desc 7128, len 153, commit 7282, next trans offset 7265
Trans replayed: mountid 518, transid 291099, desc 7283, len 30, commit 7314, next trans offset 7297
Trans replayed: mountid 518, transid 291100, desc 7315, len 31, commit 7347, next trans offset 7330
Trans replayed: mountid 518, transid 291101, desc 7348, len 33, commit 7382, next trans offset 7365
Trans replayed: mountid 518, transid 291102, desc 7383, len 30, commit 7414, next trans offset 7397
Trans replayed: mountid 518, transid 291103, desc 7415, len 34, commit 7450, next trans offset 7433
Trans replayed: mountid 518, transid 291104, desc 7451, len 30, commit 7482, next trans offset 7465
Trans replayed: mountid 518, transid 291105, desc 7483, len 31, commit 7515, next trans offset 7498
Trans replayed: mountid 518, transid 291106, desc 7516, len 32, commit 7549, next trans offset 7532
Trans replayed: mountid 518, transid 291107, desc 7550, len 27, commit 7578, next trans offset 7561
Trans replayed: mountid 518, transid 291108, desc 7579, len 279, commit 7859, next trans offset 7842
Trans replayed: mountid 518, transid 291109, desc 7860, len 30, commit 7891, next trans offset 7874
Trans replayed: mountid 518, transid 291110, desc 7892, len 27, commit 7920, next trans offset 7903
Trans replayed: mountid 518, transid 291111, desc 7921, len 28, commit 7950, next trans offset 7933
Trans replayed: mountid 518, transid 291112, desc 7951, len 27, commit 7979, next trans offset 7962
Trans replayed: mountid 518, transid 291113, desc 7980, len 27, commit 8008, next trans offset 7991
Trans replayed: mountid 518, transid 291114, desc 8009, len 31, commit 8041, next trans offset 8024
Trans replayed: mountid 518, transid 291115, desc 8042, len 30, commit 8073, next trans offset 8056
Trans replayed: mountid 518, transid 291116, desc 8074, len 28, commit 8103, next trans offset 8086
Trans replayed: mountid 518, transid 291117, desc 8104, len 27, commit 8132, next trans offset 8115
Trans replayed: mountid 518, transid 291118, desc 8133, len 26, commit 8160, next trans offset 8143
Trans replayed: mountid 518, transid 291119, desc 8161, len 22, commit 8184, next trans offset 8167
Trans replayed: mountid 518, transid 291120, desc 8185, len 296, commit 290, next trans offset 273
Trans replayed: mountid 518, transid 291121, desc 291, len 26, commit 318, next trans offset 301
Trans replayed: mountid 518, transid 291122, desc 319, len 270, commit 590, next trans offset 573
Trans replayed: mountid 518, transid 291123, desc 591, len 57, commit 649, next trans offset 632
Trans replayed: mountid 518, transid 291124, desc 650, len 65, commit 716, next trans offset 699
Trans replayed: mountid 518, transid 291125, desc 717, len 57, commit 775, next trans offset 758
Trans replayed: mountid 518, transid 291126, desc 776, len 57, commit 834, next trans offset 817
Trans replayed: mountid 518, transid 291127, desc 835, len 48, commit 884, next trans offset 867
Trans replayed: mountid 518, transid 291128, desc 885, len 65, commit 951, next trans offset 934
Trans replayed: mountid 518, transid 291129, desc 952, len 35, commit 988, next trans offset 971
Trans replayed: mountid 518, transid 291130, desc 989, len 28, commit 1018, next trans offset 1001
Trans replayed: mountid 518, transid 291131, desc 1019, len 28, commit 1048, next trans offset 1031
Trans replayed: mountid 518, transid 291132, desc 1049, len 26, commit 1076, next trans offset 1059
Trans replayed: mountid 518, transid 291133, desc 1077, len 442, commit 1520, next trans offset 1503
Trans replayed: mountid 518, transid 291134, desc 1521, len 186, commit 1708, next trans offset 1691
Trans replayed: mountid 518, transid 291135, desc 1709, len 60, commit 1770, next trans offset 1753
Trans replayed: mountid 518, transid 291136, desc 1771, len 28, commit 1800, next trans offset 1783
Trans replayed: mountid 518, transid 291137, desc 1801, len 28, commit 1830, next trans offset 1813
Trans replayed: mountid 518, transid 291138, desc 1831, len 25, commit 1857, next trans offset 1840
Trans replayed: mountid 518, transid 291139, desc 1858, len 23, commit 1882, next trans offset 1865
Trans replayed: mountid 518, transid 291140, desc 1883, len 28, commit 1912, next trans offset 1895
Trans replayed: mountid 518, transid 291141, desc 1913, len 25, commit 1939, next trans offset 1922
Trans replayed: mountid 518, transid 291142, desc 1940, len 28, commit 1969, next trans offset 1952
Trans replayed: mountid 518, transid 291143, desc 1970, len 25, commit 1996, next trans offset 1979
Trans replayed: mountid 518, transid 291144, desc 1997, len 15, commit 2013, next trans offset 1996
Trans replayed: mountid 518, transid 291145, desc 2014, len 48, commit 2063, next trans offset 2046
Trans replayed: mountid 518, transid 291146, desc 2064, len 208, commit 2273, next trans offset 2256
Trans replayed: mountid 518, transid 291147, desc 2274, len 213, commit 2488, next trans offset 2471
Trans replayed: mountid 518, transid 291148, desc 2489, len 34, commit 2524, next trans offset 2507
Trans replayed: mountid 518, transid 291149, desc 2525, len 33, commit 2559, next trans offset 2542
Trans replayed: mountid 518, transid 291150, desc 2560, len 35, commit 2596, next trans offset 2579
Replaying journal: Done.
Reiserfs journal '/dev/md4' in blocks [18..8211]: 82 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished
No corruptions found
There are on the filesystem:
        Leaves 385273
        Internal nodes 2378
        Directories 2910
        Other files 50679
        Data block pointers 382367072 (9812 of them are zero)
        Safe links 0
###########
reiserfsck finished at Thu Dec  4 09:35:13 2014
###########
root@unRAID:~#

(on a side note...I will NEVER NEVER NEVER EVER again buy a Western Digital green drive!)

John

WeeboTech · December 4, 2014

Look for something common between the drives. Controller, wiring, backplane.

Capture smart reports for each drive in question.

I would suggest disabling the spindown timers, doing the smart long test for drives you feel are questionable.

This will provide a level of confidence for the drive itself.

Remember to disable the spin down timers.

I would suggest stopping the array or stopping use of the array while these drives are in test mode so as not to taint anything.

After the long test compare the starting smart report with the ending smart report (post here if you want another set of eyes).

You're looking for pending sectors, reallocated sectors, offline uncorrectable , Reallocated Sector Ct

There are others, but these come to mind for now.

johnodon · December 4, 2014

OK...here is what I see for disk4 (the one that experienced corruption 2 days in a row). BTW...I don't spin down any of my drives.

I am running an extended smart test now (array is in maintenance mode...is that OK?) and will post the results when finished. I'm at work now so I can't check cabling.

WeeboTech · December 4, 2014

The drive looks fine so far.

Make sure you disable spin down timers for the drive being tested.

If you have the browser open on the smart test window, fire up another tab or window.

I'm not sure of the effects of disconnecting the browser window issuing the test.

You should also post a full syslog, as there may be other ATA errors not being picked up. (Which would point towards a controller/bus/cabling issue).

johnodon · December 4, 2014

The drive looks fine so far.

Make sure you disable spin down timers for the drive being tested.

If you have the browser open on the smart test window, fire up another tab or window.

I'm not sure of the effects of disconnecting the browser window issuing the test.

You should also post a full syslog, as there may be other ATA errors not being picked up. (Which would point towards a controller/bus/cabling issue).

See above...I'm not use spindown.

You can actually leave the smart test window and coming back to it and see the progress.

Should I post the syslog after this test finishes (with all of the read only file system errors) or after a fresh reboot?

John

WeeboTech · December 4, 2014

The drive looks fine so far.

Make sure you disable spin down timers for the drive being tested.

If you have the browser open on the smart test window, fire up another tab or window.

I'm not sure of the effects of disconnecting the browser window issuing the test.

You should also post a full syslog, as there may be other ATA errors not being picked up. (Which would point towards a controller/bus/cabling issue).

See above...I'm not use spindown.

You can actually leave the smart test window and coming back to it and see the progress.

Should I post the syslog after this test finishes (with all of the read only file system errors) or after a fresh reboot?

John

The syslog with the corruptions logged. (read only file system errors)

Don't reboot if you don't have to.

johnodon · December 4, 2014

Added a link to the syslog in the OP. It was too large to attach.

johnodon · December 8, 2014

I still have not had a chance to run an extended smart test on this drive and it went read-only again.

Rather than continue to mess with it, I am going to swap it out with a spare drive that I have. My question is do I need to run reiserfsck on the old drive prior to replacing it (thinking that parity also needs to be repaired)?

John

WeeboTech · December 8, 2014

If you've already done the reiserfsck successfully, I would do it again before replacing it.

itimpi · December 8, 2014

I still have not had a chance to run an extended smart test on this drive and it went read-only again.

Rather than continue to mess with it, I am going to swap it out with a spare drive that I have. My question is do I need to run reiserfsck on the old drive prior to replacing it (thinking that parity also needs to be repaired)?

You cannot 'repair' parity except by doing a parity sync.

You could simply rebuild onto your spare disk. As a rebuild does not fix any file system corruption that was present before the corruption, then it can still be worth doing a reiserfsck check after the rebuild. This approach also keeps the original disk intact in case any issues arise and you need to try and recover data from it.

RobJ · December 8, 2014

Confirming what the others have said, there does not appear to be anything physically wrong with the drive, and you have completed one reiserfsck check against it without issues, but it can't hurt to do another. If it too is fine, then I'm afraid you may have hit a bug in the ReiserFS, perhaps caused by that same unwarranted refactoring that caused the big data corruption bug recently. A rebuild here is not going to help you, because it will be an exact copy of the current Reiser file system on Disk 4. I'd recommend adding that spare drive to your array, perhaps formatted as XFS, then copying everything from Disk 4 to the new one.

razorslinky · December 9, 2014

Confirming what the others have said, there does not appear to be anything physically wrong with the drive, and you have completed one reiserfsck check against it without issues, but it can't hurt to do another. If it too is fine, then I'm afraid you may have hit a bug in the ReiserFS, perhaps caused by that same unwarranted refactoring that caused the big data corruption bug recently. A rebuild here is not going to help you, because it will be an exact copy of the current Reiser file system on Disk 4. I'd recommend adding that spare drive to your array, perhaps formatted as XFS, then copying everything from Disk 4 to the new one.

This is exactly what I was going to say.

Did you ever run Unraid 6.0 beta6 and/or 7? Those two version had the reiserfs bug (in the kernel not from unraid) which toasted the filesystem. I've had data corruption and kernel locks all the time, while running Mover or even writing data to /mnt/user.. I'm slowly moving all of my data from reiserfs to XFS and feel pretty confident that it's a reiserfs issue since I've never had one issue copying data to any newly formatted XFS drive. So far I've copied about 15 TB from reiserfs to XFS using the following command under Screen (so I can close the putty session): "rsync -arv --stats --remove-source-files --progress /mnt/disk5/ /mnt/disk7/" (some of the arguments might be redundant but I've had really good success this way)

Here's my post with a ton of information and determining it's a reiserfs issue: http://lime-technology.com/forum/index.php?topic=35788.0

WeeboTech · December 9, 2014

It's interesting that there are no reiserfs warnings before the drive goes readonly. Only SHFS failures.

Also noticed one the cache drives has issues with some files.

Dec 4 08:55:16 unRAID emhttp: KINGSTON_SVP100S2128G_X0AY100YY1BK (sdb) 125034840

...

Dec 4 07:08:34 unRAID kernel: BTRFS error (device sdb1): csum failed ino 3482 off 12765528064 csum 4102380536 expected csum 2230946545

Dec 4 07:08:36 unRAID kernel: BTRFS error (device sdb1): csum failed ino 3482 off 12765528064 csum 3941297839 expected csum 2230946545

Dec 4 07:08:41 unRAID kernel: BTRFS error (device sdb1): csum failed ino 3482 off 982966272 csum 3486317105 expected csum 440998271

These ino's need to be reviewed as BTRFS is reporting checksum errors.

find /dev/sdb1 -inum inode -ls

Dec 4 07:42:21 unRAID shfs/user: shfs_truncate: truncate: /mnt/disk4/Downloads/usenet/inter/Phineas.and.Ferb.S01E40-41.720p.HDTV.x264-DVSKY.#308/3606.out.tmp (5) Input/output error

Dec 4 07:42:21 unRAID shfs/user: shfs_open: open: /mnt/disk4/Downloads/usenet/inter/Phineas.and.Ferb.S01E40-41.720p.HDTV.x264-DVSKY.#308/3606.out.tmp (30) Read-only file system

Dec 4 07:42:21 unRAID shfs/user: shfs_unlink: unlink: /mnt/disk4/Downloads/usenet/inter/Phineas.and.Ferb.S01E40-41.720p.HDTV.x264-DVSKY.#308/3606.out.tmp (30) Read-only file system

Dec 4 07:42:21 unRAID shfs/user: shfs_open: open: /mnt/disk4/Downloads/usenet/inter/Phineas.and.Ferb.S01E40-41.720p.HDTV.x264-DVSKY.#308/3606.out.tmp (30) Read-only file system

Dec 4 07:42:21 unRAID shfs/user: shfs_truncate: truncate: /mnt/disk4/Downloads/usenet/inter/Phineas.and.Ferb.S01E40-41.720p.HDTV.x264-DVSKY.#308/3606.out.tmp (30) Read-only file system

johnodon · December 9, 2014

I never ran the version of unRAID that had the reiserfs bug. My introduction to v6 was beta10a.

BTW...I had another hard crash last night. Of course I forgot to leave the syslog open so I can't see what happened. When I brought the server back online this morning, DISK4 had actually been disabled this time. Inserted the original disk and performing a rebuild now. I'm done with reiserfs. Once this rebuild is finished, I am going to start the process of moving all drives to XFS.

RE: the cache drive corruption...I have had that for some time and just deal with it. Unfortunately, the corruption is in qcow2 image so I will most likely need to rebuild that VM. Honestly,I think I may also ditch BTRFS and use separate XFS drives for cache.

Questions about data migration...I'll use razor's rsync command above.

- Does writing to a DISK share also write to parity?

- How do you prevent anything being written to the source disk during the copy so nothing is missed? Is it just do the best you can to manage?

- Once the data is copied from DISKx to DISKy, do you delete the contents of DISKx and then use that disk as the next target for the migration (after it has been formatted to XFS)?

John

SSD · December 9, 2014

Have you run a memory test?

johnodon · December 9, 2014

Not yet...I'll add that to the list. Can I do that while the rebuild is occurring? Even if yes, is that a bad idea?

SSD · December 9, 2014

No - when you boot there is an option to run the memory test.

If you have a memory error, anything you do is subject to subtle corruption. If would explain crashes, data corruption, and other symptoms.

WeeboTech · December 9, 2014

I'm not convinced it's totally a reiserfs issue. The syslog has no reiserfs messages being posted. (or at least they are not being caught in any syslog).

A reiserfs issue shouldn't make the disk go offline. A PSU or faulty wiring/controller would.

Was a new disk added recently? Could there be an issue with the PSU?

As suggested memtest should be done as that could reveal an issue in memory which could have a corruption effect on buffers and/or filesystem structures.

johnodon · December 9, 2014

I'm not convinced it's totally a reiserfs issue. The syslog has no reiserfs messages being posted. (or at least they are not being caught in any syslog).

A reiserfs issue shouldn't make the disk go offline. A PSU or faulty wiring/controller would.

Was a new disk added recently? Could there be an issue with the PSU?

As suggested memtest should be done as that could reveal an issue in memory which could have a corruption effect on buffers and/or filesystem structures.

2 different disks with new cables and new backplanes (I moved the guts to a new chassis). At this point the only thing that I can say is consistent is that it is always DISK4 (in both the old and new chassis) that has experienced the failure. But that may be just because it is the disk with the most space so it is getting pretty much all of the writes.

I'll definitely reboot/memtest once this rebuild finishes.

Thanks guys...I really appreciate it!

John

WeeboTech · December 9, 2014

Try moving it to a different slot and/or backplane.

I've seen issues where a heavily used drive would cause intermittent issues due to vibrarion, however there are no ATA failure messages either.

It's crucial that you have something monitoring the syslog as any ATA or Reiserfs issues should be displayed.

johnodon · December 9, 2014

Try moving it to a different slot and/or backplane.

I've seen issues where a heavily used drive would cause intermittent issues due to vibrarion, however there are no ATA failure messages either.

It's crucial that you have something monitoring the syslog as any ATA or Reiserfs issues should be displayed.

It has been in 3 different slots so far.

WeeboTech · December 9, 2014

Questions about data migration...I'll use razor's rsync command above.

- Does writing to a DISK share also write to parity?

- How do you prevent anything being written to the source disk during the copy so nothing is missed? Is it just do the best you can to manage?

- Once the data is copied from DISKx to DISKy, do you delete the contents of DISKx and then use that disk as the next target for the migration (after it has been formatted to XFS)?

I would do this very slowly. Since disk4 is showing issue, I would try and move that data elsewhere with rsync.

Once your data is safe, I might take it out of the array or exclude it from any write operations on the shares tab.

Then do badblocks in readonly mode. See if it reports anything.

That will read the drive from start to finish directly without any filesystem access methods. Just a raw device read.

I don't expect anything to be physically wrong, there are no ATA errors, no md: sector errors, no reiserfs errors reported in that syslog.

I.E. unless we've missed it all.

Your questions

- Does writing to a DISK share also write to parity? Yes

- How do you prevent anything being written to the source disk during the copy so nothing is missed? Is it just do the best you can to manage?

you can do the rsync multiple times, it will only copy new files, once it stops reporting new files you know you've rsynced everything

you should probably stop any processes that are writing to the array while doing this and/or set the disk to be excluded from the user share in the shares tab

- Once the data is copied from DISKx to DISKy, do you delete the contents of DISKx and then use that disk as the next target for the migration (after it has been formatted to XFS)?

I wouldn't do anything with these questionable disks until a smart long test and/or badblocks in readonly mode was done.

Once I was totally convinced the drive is healthy, I might do a pre-clear and add it to the array again, then format it as XFS.

I'm also really anal about making sure drives are healthy I run them through the 4 pass badblocks write mode to insure that the drive is going to be error free.

When adding a drive to the array I do

1. capture smart log

2. conveyance test (This is for new drives that have recently been shipped)

3. smart long test (firmware surface scan)

4. capture smart log again (review for pending/reallocated/lba errors)

5. badlocks in read/write mode (4 passes, 0xaa,0x55,0xff,0x00) which sets the stage for preclear

6. smart long test again so it marks it in the self test log.

7. capture smart log again.

8. preclear to write signature.

preclear does the equivalent of all of this, however I like to use the various bit patterns of badblocks. (that's just me being extra careful).

[Yes I've been able to weed out marginal drives with this procedure]

This takes days to do. Which is why I usually do it on a separate HP microserver.

It's also prudent to review the syslog for ATA errors.

I.E. the kernel can do retries and thus not return a failed block to badblocks.

If the output of badblocks does not count any errors then error recovery was successful.

However if there were badblocks counted the drive is not a worthy candidate for use in a raid array.

-

You may not need to go through my lengthy procedure or a full pre-clear if badblocks in read-only mode is good and smart long test is error free.

In that case you may leave the drive in the array, exclude it from the usershares and do the read tests.

I don't know exactly how to switch from reiserfs to XFS without clearing the drive to an unformatted status.

Perhaps someone else will have advice on that part.

RobJ · December 9, 2014

It's interesting that there are no reiserfs warnings before the drive goes readonly. Only SHFS failures.

Dec 4 07:42:21 unRAID shfs/user: shfs_open: open: /mnt/disk4/Downloads/usenet/inter/Phineas.and.Ferb.S01E40-41.720p.HDTV.x264-DVSKY.#308/3606.out.tmp (30) Read-only file system
Dec 4 07:42:21 unRAID shfs/user: shfs_truncate: truncate: /mnt/disk4/Downloads/usenet/inter/Phineas.and.Ferb.S01E40-41.720p.HDTV.x264-DVSKY.#308/3606.out.tmp (30) Read-only file system

Dec 4 07:42:21 unRAID shfs/user: shfs_open: open: /mnt/disk4/Downloads/usenet/inter/Phineas.and.Ferb.S01E40-41.720p.HDTV.x264-DVSKY.#308/3606.out.tmp (30) Read-only file system

Dec 4 07:42:21 unRAID kernel: REISERFS error (device md4): vs-4080 _reiserfs_free_block: block 446052105: bit already cleared

Dec 4 07:42:21 unRAID kernel: REISERFS (device md4): Remounting filesystem read-only

Dec 4 07:42:21 unRAID shfs/user: shfs_open: open: /mnt/disk4/Downloads/usenet/nzbget.lock (30) Read-only file system

Dec 4 07:42:21 unRAID shfs/user: shfs_unlink: unlink: /mnt/disk4/Downloads/usenet/inter/Phineas.and.Ferb.S01E40-41.720p.HDTV.x264-DVSKY.#308/3606.out.tmp (30) Read-only file system

The Reiser corruption error was hiding in the middle of a bunch of shfs errors. The order of the shfs and Reiser errors struck me too, but when I thought about it more, the shfs errors appear to all be related to the file system being read-only, and as all of these errors are in the same second, I had to conclude that it's just a case of syslog line ordering. The shfs module must be faster at logging its errors than the Reiser module is. The shfs errors had to occur *after* the Reiser system had remounted the drive as read-only.

I did see the Btrfs checksum errors, but could not think of a connection. A memory issue is a good idea, but this is the second time I've seen that 'bit already cleared' error here, and johnodon's issues seem repeatable. While that doesn't rule out a memory issue, the lack of randomness would seem to point to a non-memory issue. A full memory test is certainly warranted though, to rule it out.

I never ran the version of unRAID that had the reiserfs bug. My introduction to v6 was beta10a.

I think this may be a different bug. That one behaved differently and was patched. The guy who refactored the Reiser file system seems to be a great guy, with a lot of experience and respect, but I cannot see how his changes could possibly be justified. This was a stable file system, relied upon by many, and in its sunset years. Making those changes moved it to beta status, and to be considered unstable for years, years it did not have. I really feel some consideration should be made to revert Reiser support to before the refactoring, if that is possible (it may not be possible due to kernel changes).

WeeboTech · December 9, 2014

Thanks Robj, I had missed this, I couldn't actually download it directly, it showed up in some kind of viewer.

Dec 4 07:42:21 unRAID kernel: REISERFS error (device md4): vs-4080 _reiserfs_free_block: block 446052105: bit already cleared

Dec 4 07:42:21 unRAID kernel: REISERFS (device md4): Remounting filesystem read-only

These errors confirm it's similiar to razorslinky's issue.

This is not a good sign, I would suggest getting your data off these suspect drives.

Perhaps scan the syslogs with

grep 'REISERFS error' /var/log/syslog

for other suspect drives.

This causes me great concern as it's been stated the known versions without the corruption issue were never used.

johnodon · December 10, 2014

In all of my screwing around, I lost all of the data on that drive. I think the culprit was that I lost power during a rebuild. When it was all said and done, the rebuilt drive had no data on it.

Oh well...life lesson. At least I know that all I lost was media (movies/tv) and know exactly which ones. The good news...I have since put that drive back into service, did a full smart report, rebuilt all of my VMs and docker containers (all on cache drive). I let NZBGet download overnight (about 25GB) and I did not see a single corruption issue. All of that data was written to that problematic drive.

Obviously, I will watch it closely and at the first sign of trouble I'll pull the plug and come back here.

All I ask is that LT provide a file system migration strategy...I am still very leery of RFS.

John

Recent corruption issues

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation