Recent corruption issues

johnodon · December 10, 2014

Sigh...I no sooner wrote the above and got this:

Dec 10 07:09:04 unRAID kernel: docker0: port 4(veth0a2f6fa) entered forwarding state
Dec 10 07:09:04 unRAID kernel: docker0: port 4(veth0a2f6fa) entered forwarding state
Dec 10 07:09:19 unRAID kernel: docker0: port 4(veth0a2f6fa) entered forwarding state
Dec 10 07:22:40 unRAID kernel: perf interrupt took too long (2606 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Dec 10 08:28:36 unRAID shfs/user: shfs_truncate: truncate: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (5) Input/output error
Dec 10 08:28:36 unRAID shfs/user: shfs_open: open: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID kernel: REISERFS error (device md4): vs-4080 _reiserfs_free_block: block 463857954: bit already cleared
Dec 10 08:28:36 unRAID kernel: REISERFS (device md4): Remounting filesystem read-only
Dec 10 08:28:36 unRAID shfs/user: shfs_unlink: unlink: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_open: open: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_truncate: truncate: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_open: open: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_unlink: unlink: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_open: open: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_truncate: truncate: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_open: open: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_unlink: unlink: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_open: open: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_truncate: truncate: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_open: open: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_unlink: unlink: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_open: open: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_truncate: truncate: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_open: open: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_unlink: unlink: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_open: open: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_truncate: truncate: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_open: open: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_unlink: unlink: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_open: open: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
Dec 10 08:28:36 unRAID shfs/user: shfs_truncate: truncate: /mnt/disk4/Downloads/usenet/inter/Constantine.S01E07.Blessed.Are.the.Damned.1080p.WEB-DL.DD5.1.H.264-ECI-NZBgeek.#26/1673.out.tmp (30) Read-only file system
|
|
truncated
|
|
Dec 10 08:38:36 unRAID php: /usr/bin/docker stop NZBGet
Dec 10 08:38:37 unRAID shfs/user: shfs_unlink: unlink: /mnt/disk4/Downloads/.fuse_hidden000000e600000002 (30) Read-only file system
Dec 10 08:38:37 unRAID php: NZBGet
Dec 10 08:38:37 unRAID php: 
Dec 10 08:38:37 unRAID avahi-daemon[1984]: Withdrawing workstation service for veth6d788d5.
Dec 10 08:38:37 unRAID kernel: docker0: port 2(veth6d788d5) entered disabled state
Dec 10 08:38:37 unRAID kernel: device veth6d788d5 left promiscuous mode
Dec 10 08:38:37 unRAID kernel: docker0: port 2(veth6d788d5) entered disabled state

OK...I am at your mercy guys. At this point I have only stopped NZBGet and I am running a reiserfsck --check now. Once that is done, can I move that 25GB off of DISK4 to another disk and just remove that disk from the array? That way I can fire things up again and see if the errors persist on another disk.

EDIT#1: First pass of reiserfsck --check /dev/md4. I am going to run it again.

root@unRAID:~# reiserfsck --check /dev/md4
reiserfsck 3.6.24

Will read-only check consistency of the filesystem on /dev/md4
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Wed Dec 10 08:47:57 2014
###########
Replaying journal: Trans replayed: mountid 21, transid 727, desc 7357, len 21, commit 7379, next trans offset 7362
Trans replayed: mountid 21, transid 728, desc 7380, len 20, commit 7401, next trans offset 7384
Trans replayed: mountid 21, transid 729, desc 7402, len 35, commit 7438, next trans offset 7421
Trans replayed: mountid 21, transid 730, desc 7439, len 34, commit 7474, next trans offset 7457
Trans replayed: mountid 21, transid 731, desc 7475, len 33, commit 7509, next trans offset 7492
Trans replayed: mountid 21, transid 732, desc 7510, len 29, commit 7540, next trans offset 7523
Trans replayed: mountid 21, transid 733, desc 7541, len 25, commit 7567, next trans offset 7550
Trans replayed: mountid 21, transid 734, desc 7568, len 35, commit 7604, next trans offset 7587
Trans replayed: mountid 21, transid 735, desc 7605, len 33, commit 7639, next trans offset 7622
Trans replayed: mountid 21, transid 736, desc 7640, len 34, commit 7675, next trans offset 7658
Trans replayed: mountid 21, transid 737, desc 7676, len 31, commit 7708, next trans offset 7691
Trans replayed: mountid 21, transid 738, desc 7709, len 33, commit 7743, next trans offset 7726
Trans replayed: mountid 21, transid 739, desc 7744, len 31, commit 7776, next trans offset 7759
Trans replayed: mountid 21, transid 740, desc 7777, len 33, commit 7811, next trans offset 7794
Trans replayed: mountid 21, transid 741, desc 7812, len 31, commit 7844, next trans offset 7827
Trans replayed: mountid 21, transid 742, desc 7845, len 31, commit 7877, next trans offset 7860
Trans replayed: mountid 21, transid 743, desc 7878, len 32, commit 7911, next trans offset 7894
Trans replayed: mountid 21, transid 744, desc 7912, len 22, commit 7935, next trans offset 7918
Trans replayed: mountid 21, transid 745, desc 7936, len 34, commit 7971, next trans offset 7954
Trans replayed: mountid 21, transid 746, desc 7972, len 37, commit 8010, next trans offset 7993
Trans replayed: mountid 21, transid 747, desc 8011, len 35, commit 8047, next trans offset 8030
Trans replayed: mountid 21, transid 748, desc 8048, len 35, commit 8084, next trans offset 8067
Trans replayed: mountid 21, transid 749, desc 8085, len 34, commit 8120, next trans offset 8103
Trans replayed: mountid 21, transid 750, desc 8121, len 24, commit 8146, next trans offset 8129
Trans replayed: mountid 21, transid 751, desc 8147, len 15, commit 8163, next trans offset 8146
Trans replayed: mountid 21, transid 752, desc 8164, len 36, commit 8201, next trans offset 8184
Trans replayed: mountid 21, transid 753, desc 8202, len 31, commit 42, next trans offset 25
Trans replayed: mountid 21, transid 754, desc 43, len 28, commit 72, next trans offset 55
Trans replayed: mountid 21, transid 755, desc 73, len 28, commit 102, next trans offset 85
Trans replayed: mountid 21, transid 756, desc 103, len 410, commit 514, next trans offset 497
Trans replayed: mountid 21, transid 757, desc 515, len 27, commit 543, next trans offset 526
Trans replayed: mountid 21, transid 758, desc 544, len 30, commit 575, next trans offset 558
Trans replayed: mountid 21, transid 759, desc 576, len 26, commit 603, next trans offset 586
Trans replayed: mountid 21, transid 760, desc 604, len 30, commit 635, next trans offset 618
Trans replayed: mountid 21, transid 761, desc 636, len 28, commit 665, next trans offset 648
Trans replayed: mountid 21, transid 762, desc 666, len 28, commit 695, next trans offset 678
Trans replayed: mountid 21, transid 763, desc 696, len 28, commit 725, next trans offset 708
Trans replayed: mountid 21, transid 764, desc 726, len 25, commit 752, next trans offset 735
Trans replayed: mountid 21, transid 765, desc 753, len 30, commit 784, next trans offset 767
Trans replayed: mountid 21, transid 766, desc 785, len 24, commit 810, next trans offset 793
Trans replayed: mountid 21, transid 767, desc 811, len 29, commit 841, next trans offset 824
Trans replayed: mountid 21, transid 768, desc 842, len 25, commit 868, next trans offset 851
Trans replayed: mountid 21, transid 769, desc 869, len 25, commit 895, next trans offset 878
Trans replayed: mountid 21, transid 770, desc 896, len 24, commit 921, next trans offset 904
Trans replayed: mountid 21, transid 771, desc 922, len 21, commit 944, next trans offset 927
Trans replayed: mountid 21, transid 772, desc 945, len 27, commit 973, next trans offset 956
Trans replayed: mountid 21, transid 773, desc 974, len 27, commit 1002, next trans offset 985
Trans replayed: mountid 21, transid 774, desc 1003, len 26, commit 1030, next trans offset 1013
Trans replayed: mountid 21, transid 775, desc 1031, len 27, commit 1059, next trans offset 1042
Trans replayed: mountid 21, transid 776, desc 1060, len 30, commit 1091, next trans offset 1074
Trans replayed: mountid 21, transid 777, desc 1092, len 30, commit 1123, next trans offset 1106
Trans replayed: mountid 21, transid 778, desc 1124, len 29, commit 1154, next trans offset 1137
Trans replayed: mountid 21, transid 779, desc 1155, len 27, commit 1183, next trans offset 1166
Trans replayed: mountid 21, transid 780, desc 1184, len 17, commit 1202, next trans offset 1185
Trans replayed: mountid 21, transid 781, desc 1203, len 345, commit 1549, next trans offset 1532
Trans replayed: mountid 21, transid 782, desc 1550, len 25, commit 1576, next trans offset 1559
Trans replayed: mountid 21, transid 783, desc 1577, len 353, commit 1931, next trans offset 1914
Trans replayed: mountid 21, transid 784, desc 1932, len 24, commit 1957, next trans offset 1940
Trans replayed: mountid 21, transid 785, desc 1958, len 15, commit 1974, next trans offset 1957
Trans replayed: mountid 21, transid 786, desc 1975, len 16, commit 1992, next trans offset 1975
Trans replayed: mountid 21, transid 787, desc 1993, len 25, commit 2019, next trans offset 2002
Trans replayed: mountid 21, transid 788, desc 2020, len 22, commit 2043, next trans offset 2026
Trans replayed: mountid 21, transid 789, desc 2044, len 36, commit 2081, next trans offset 2064
Trans replayed: mountid 21, transid 790, desc 2082, len 22, commit 2105, next trans offset 2088
Trans replayed: mountid 21, transid 791, desc 2106, len 23, commit 2130, next trans offset 2113
Trans replayed: mountid 21, transid 792, desc 2131, len 35, commit 2167, next trans offset 2150
Trans replayed: mountid 21, transid 793, desc 2168, len 34, commit 2203, next trans offset 2186
Trans replayed: mountid 21, transid 794, desc 2204, len 37, commit 2242, next trans offset 2225
Trans replayed: mountid 21, transid 795, desc 2243, len 35, commit 2279, next trans offset 2262
Trans replayed: mountid 21, transid 796, desc 2280, len 39, commit 2320, next trans offset 2303
Trans replayed: mountid 21, transid 797, desc 2321, len 42, commit 2364, next trans offset 2347
Trans replayed: mountid 21, transid 798, desc 2365, len 33, commit 2399, next trans offset 2382
Trans replayed: mountid 21, transid 799, desc 2400, len 31, commit 2432, next trans offset 2415
Trans replayed: mountid 21, transid 800, desc 2433, len 34, commit 2468, next trans offset 2451
Trans replayed: mountid 21, transid 801, desc 2469, len 413, commit 2883, next trans offset 2866
Trans replayed: mountid 21, transid 802, desc 2884, len 34, commit 2919, next trans offset 2902
Trans replayed: mountid 21, transid 803, desc 2920, len 21, commit 2942, next trans offset 2925
Trans replayed: mountid 21, transid 804, desc 2943, len 27, commit 2971, next trans offset 2954
Trans replayed: mountid 21, transid 805, desc 2972, len 443, commit 3416, next trans offset 3399
Trans replayed: mountid 21, transid 806, desc 3417, len 42, commit 3460, next trans offset 3443
Trans replayed: mountid 21, transid 807, desc 3461, len 33, commit 3495, next trans offset 3478
Trans replayed: mountid 21, transid 808, desc 3496, len 31, commit 3528, next trans offset 3511
Trans replayed: mountid 21, transid 809, desc 3529, len 25, commit 3555, next trans offset 3538
Trans replayed: mountid 21, transid 810, desc 3556, len 13, commit 3570, next trans offset 3553
Trans replayed: mountid 21, transid 811, desc 3571, len 11, commit 3583, next trans offset 3566
Trans replayed: mountid 21, transid 812, desc 3584, len 25, commit 3610, next trans offset 3593
Trans replayed: mountid 21, transid 813, desc 3611, len 13, commit 3625, next trans offset 3608
Trans replayed: mountid 21, transid 814, desc 3626, len 10, commit 3637, next trans offset 3620
Trans replayed: mountid 21, transid 815, desc 3638, len 51, commit 3690, next trans offset 3673
Trans replayed: mountid 21, transid 816, desc 3691, len 399, commit 4091, next trans offset 4074
Trans replayed: mountid 21, transid 817, desc 4092, len 234, commit 4327, next trans offset 4310
Trans replayed: mountid 21, transid 818, desc 4328, len 192, commit 4521, next trans offset 4504
Trans replayed: mountid 21, transid 819, desc 4522, len 28, commit 4551, next trans offset 4534
Trans replayed: mountid 21, transid 820, desc 4552, len 25, commit 4578, next trans offset 4561
Trans replayed: mountid 21, transid 821, desc 4579, len 53, commit 4633, next trans offset 4616
Trans replayed: mountid 21, transid 822, desc 4634, len 28, commit 4663, next trans offset 4646
Trans replayed: mountid 21, transid 823, desc 4664, len 27, commit 4692, next trans offset 4675
Trans replayed: mountid 21, transid 824, desc 4693, len 14, commit 4708, next trans offset 4691
Trans replayed: mountid 21, transid 825, desc 4709, len 29, commit 4739, next trans offset 4722
Trans replayed: mountid 21, transid 826, desc 4740, len 26, commit 4767, next trans offset 4750
Trans replayed: mountid 21, transid 827, desc 4768, len 24, commit 4793, next trans offset 4776
Trans replayed: mountid 21, transid 828, desc 4794, len 25, commit 4820, next trans offset 4803
Trans replayed: mountid 21, transid 829, desc 4821, len 18, commit 4840, next trans offset 4823
Trans replayed: mountid 21, transid 830, desc 4841, len 18, commit 4860, next trans offset 4843
Trans replayed: mountid 21, transid 831, desc 4861, len 26, commit 4888, next trans offset 4871
Trans replayed: mountid 21, transid 832, desc 4889, len 25, commit 4915, next trans offset 4898
Trans replayed: mountid 21, transid 833, desc 4916, len 28, commit 4945, next trans offset 4928
Trans replayed: mountid 21, transid 834, desc 4946, len 25, commit 4972, next trans offset 4955
Trans replayed: mountid 21, transid 835, desc 4973, len 26, commit 5000, next trans offset 4983
Trans replayed: mountid 21, transid 836, desc 5001, len 25, commit 5027, next trans offset 5010
Trans replayed: mountid 21, transid 837, desc 5028, len 28, commit 5057, next trans offset 5040
Replaying journal: Done.
Reiserfs journal '/dev/md4' in blocks [18..8211]: 111 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished
No corruptions found
There are on the filesystem:
        Leaves 7565
        Internal nodes 49
        Directories 326
        Other files 7853
        Data block pointers 6624725 (3442 of them are zero)
        Safe links 0
###########
reiserfsck finished at Wed Dec 10 08:51:17 2014
###########

EDIT#2: 2nd reiserfsck pass:

root@unRAID:~# reiserfsck --check /dev/md4
reiserfsck 3.6.24

Will read-only check consistency of the filesystem on /dev/md4
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Wed Dec 10 08:54:00 2014
###########
Replaying journal: Done.
Reiserfs journal '/dev/md4' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished
No corruptions found
There are on the filesystem:
        Leaves 7565
        Internal nodes 49
        Directories 326
        Other files 7853
        Data block pointers 6624725 (3442 of them are zero)
        Safe links 0
###########
reiserfsck finished at Wed Dec 10 08:56:46 2014
###########

John

WeeboTech · December 10, 2014

In all of my screwing around, I lost all of the data on that drive. I think the culprit was that I lost power during a rebuild. When it was all said and done, the rebuilt drive had no data on it.

Oh well...life lesson. At least I know that all I lost was media (movies/tv) and know exactly which ones. The good news...I have since put that drive back into service, did a full smart report, rebuilt all of my VMs and docker containers (all on cache drive). I let NZBGet download overnight (about 25GB) and I did not see a single corruption issue. All of that data was written to that problematic drive.

Obviously, I will watch it closely and at the first sign of trouble I'll pull the plug and come back here.

All I ask is that LT provide a file system migration strategy...I am still very leery of RFS.

John

While the drive may have been in the rebuild process, if parity was up to date, then it was rebuilding with the data that was already present on the drive. Which means recovery could have been possible.

We've seen the worst of conditions where people assign a data drive to the parity slot, stop a parity sync and are able to recover data with a full reiserfs scan.

you would have to do some research on this through the forums and the internet.

Commands that come to mind are

reiserfsck --scan-whole-partition --rebuild-tree

reiserfsck --rebuild-sb /dev/sda1

However don't run them blind.

many times you will see me suggest to people to use ddrescue to copy a corrupt drive to a working drive.

This way you can tinker and repair. Then once you know it's all good, remove the bad drive from service or reformat it.

With so much going on with your system, I don't even know where to begin.

WeeboTech · December 10, 2014

What was the outcome of your memory test?

I had thought you were going to pull all the data off the suspect disk and convert it to XFS?

johnodon · December 10, 2014

I had not yet the memtest. I am going to do that right now.

I really don't want to try and migrate to XFS until LT provides a clear path to do so.

WeeboTech · December 10, 2014

I had not yet the memtest. I am going to do that right now.

Stop writing to the machine until you know for 100% sure the memory is good.

Do what you can to copy your data off the suspect drive.

I really don't want to try and migrate to XFS until LT provides a clear path to do so.

Did you post a request for help on this? (I have not seen it yet) or have you reached out to limetech directly?

johnodon · December 10, 2014

I had not yet the memtest. I am going to do that right now.

Stop writing to the machine until you know for 100% sure the memory is good.

Do what you can to copy your data off the suspect drive.

I really don't want to try and migrate to XFS until LT provides a clear path to do so.

Did you post a request for help on this? (I have not seen it yet) or have you reached out to limetech directly?

memtest is running.

I actually started to write the post and then got pulled away by my kids. My fear is that I will be told that it will be part of the documentation and wait for it.

johnodon · December 10, 2014

I shot an email to LT support and asked that the check this thread (as well as provide advice on FS migration).

itimpi · December 10, 2014

I really don't want to try and migrate to XFS until LT provides a clear path to do so.

There is not really much of a strategy about this!

The steps are:

Copy the data on the drive to another drive (either on the array or external to the array).
Stop the array, and click on the disk settings to change the format to the desired one (presumable XFS).
Start the array. The disk will now show as unformatted and offer the format option. Format the disk and it is now ready to have data put back onto it.
Copy back the data onto the disk.
Repeat for each disk in turn.

johnodon · December 10, 2014

I really don't want to try and migrate to XFS until LT provides a clear path to do so.

There is not really much of a strategy about this!

The steps are:

Copy the data on the drive to another drive (either on the array or external to the array).

Stop the array, and click on the disk settings to change the format to the desired one (presumable XFS).

Start the array. The disk will now show as unformatted and offer the format option. Format the disk and it is now ready to have data put back onto it.

Copy back the data onto the disk.

Repeat for each disk in turn.

Thank you for that!

I'll do this when the memtest completes (which I imagine will take quite a bit of time).

Let me ask you guys this...

I don't mind being a guinea pig. Since I know that this issue will pop up fairly quickly and I won't have much data at risk, should I take this opportunity to prove/disprove that this is a RFS issue? What I mean, I will put the same drive back into service into the same slot but this time as XFS and continue with what I was doing (nzbget downloading, etc.).

Hell, I may even just sacrifice the 25GB of stuff that I DL'd since it was really only TV episodes and metadata and not even copy it off.

Also, RE: memtest...is one pass enough?

John

RobJ · December 10, 2014

I was about to respond with an almost identical post, itimpi beat me to it! There really is no other method, there is no way I will EVER trust an inplace conversion.

The only other points I would add are that you should verify all copies, both to and from; and secondly it will all go much faster if parity is turned off (just an option to consider, undesirable for many).

WeeboTech · December 10, 2014

I really don't want to try and migrate to XFS until LT provides a clear path to do so.

There is not really much of a strategy about this!

The steps are:

Copy the data on the drive to another drive (either on the array or external to the array).

Stop the array, and click on the disk settings to change the format to the desired one (presumable XFS).

Start the array. The disk will now show as unformatted and offer the format option. Format the disk and it is now ready to have data put back onto it.

Copy back the data onto the disk.

Repeat for each disk in turn.

Thank you for that!

I'll do this when the memtest completes (which I imagine will take quite a bit of time).

Let me ask you guys this...

I don't mind being a guinea pig. Since I know that this issue will pop up fairly quickly and I won't have much data at risk, should I take this opportunity to prove/disprove that this is a RFS issue? What I mean, I will put the same drive back into service into the same slot but this time as XFS and continue with what I was doing (nzbget downloading, etc.).

Hell, I may even just sacrifice the 25GB of stuff that I DL'd since it was really only TV episodes and metadata and not even copy it off.

John

If it were me, I would copy the data, rsync to a another disk is going to be faster then redownloading.

I would also convert the disk to XFS and beat it up to see what was going to happen.

However if you want to go through the whole process as if it were live and re-download and work on the XFS drive, I would say yes, try to utilize the drive. From what I saw of the smart data it was good.

You can post the smart data again for review. I would also do a smart long test before committing any data.

What puzzles me is that on reboot the drive red balled. That leads me to think there is still a hardware issue somewhere.

Keep in mind a power issue, ie marginal PSU could cause intermittent issues, Consider that for a split second memory was starved of sufficient power, bits could flip and as a meta data structure is written out to disk, the metadata is incorrect.

You can attempt to test this with using badblocks, or utilizing scrub on the free space of the suspect disk. Perhaps after converting to XFS.

johnodon · December 10, 2014

If it were me, I would copy the data, rsync to a another disk is going to be faster then redownloading.

I would also convert the disk to XFS and beat it up to see what was going to happen.

Is there any concern with the data still being corrupted even though reiserfsck said it was not? This was part of the reason I was thinking of starting with a clean drive. Also, if I start from scratch I will be duplicating my process that has produced the issue.

RE: PSU. Not that it could not be an issue but at least I can say I don't have a junk one. I have a SeaSonic X-850.

John

WeeboTech · December 10, 2014

OK...here is what I see for disk4 (the one that experienced corruption 2 days in a row). BTW...I don't spin down any of my drives.

In all of my screwing around, I lost all of the data on that drive. I think the culprit was that I lost power during a rebuild.

In adequate power leads to all sorts of intermittent issues. If the memtest is successful, then at least you know the chips are good.

After that I would suggest putting the machine on a UPS and/or spinning down unused drives.

If it were my machine I would set unRAID to be in maintenance mode and spin all the drives down and up a few times to see if issues crop up.

WeeboTech · December 10, 2014

If it were me, I would copy the data, rsync to a another disk is going to be faster then redownloading.

I would also convert the disk to XFS and beat it up to see what was going to happen.

Is there any concern with the data still being corrupted even though reiserfsck said it was not? This was part of the reason I was thinking of starting with a clean drive. Also, if I start from scratch I will be duplicating my process that has produced the issue.

RE: PSU. Not that it could not be an issue but at least I can say I don't have a junk one. I have a SeaSonic X-850.

John

Start from scratch then. Duplicate the exact same scenario as before just on XFS, if you still have issue, look at hardware.

johnodon · December 10, 2014

and secondly it will all go much faster if parity is turned off (just an option to consider, undesirable for many).

Do you do this by simply unassigning the parity drive?

WeeboTech · December 10, 2014

and secondly it will all go much faster if parity is turned off (just an option to consider, undesirable for many).

Do you do this by simply unassigning the parity drive?

Yes, but then you do not have parity any more and you will have to do a parity sync.

johnodon · December 10, 2014

and secondly it will all go much faster if parity is turned off (just an option to consider, undesirable for many).

Do you do this by simply unassigning the parity drive?

Yes, but then you do not have parity any more and you will have to do a parity sync.

Guessing you are not a fan of this option weebo.

JonathanM · December 10, 2014

and secondly it will all go much faster if parity is turned off (just an option to consider, undesirable for many).

Do you do this by simply unassigning the parity drive?

Yes, but then you do not have parity any more and you will have to do a parity sync.

Guessing you are not a fan of this option weebo.

Do you keep current full backups? Do you care if you lose the data currently on the server? You are having issues already, and are contemplating moving a bunch of data around, which is risky as is, because it's easy to accidentally overwrite something you need, or forget to move something you want before erasing the source. Removing parity protection may speed things up, but at an added level of risk.

WeeboTech · December 10, 2014

and secondly it will all go much faster if parity is turned off (just an option to consider, undesirable for many).

Do you do this by simply unassigning the parity drive?

Yes, but then you do not have parity any more and you will have to do a parity sync.

Guessing you are not a fan of this option weebo.

At this point in time, It doesn't matter all that much. Especially if you've had a power failure. Parity check should be done anyway.

Reformatting from reiserfs to XFS and then letting your download application repopulate on the drive in question doesn't really need to go all that fast.

Even when I move terrabytes of data, I don't sacrifice parity for the sake of copying it faster. I generally get from 20-60MB/s when copying from disk to disk on the same machine. if you are willing to give up parity for the sake of converting reiserfs to XFS in a hopping drive fashion, it would go much easier at the risk of not being able to rebuild a drive.

It all depends on how much data you have, what you are willing to loose if there's a problem, if you have backups and if you have a method to check the integrity of the copies.

These days before I migrate to a drive if it's a drive to drive copy or a rebuild/expand operation I always

Do an md5sum of the whole drive.

Do whatever operation it is to move the data

Check the copy is 100% accurate.

If I'm moving it back,

I would do the rsync again

Check the copy is 100% accurate.

Even when I do an rsync over the network I md5sum the directory try I plan to move first. Just to be on the safe side.

I just learned of a new lazy way to use rsync for poor man checksum (just last night).

rsync -ac -v source destination

This does an rsync from source to destination and uses a checksum to make sure the files match instead of mtime/size.

it will take longer.

After that I do another just to make sure they match 100%.

rsync -ac -v source destination

There is usually no more copies.

If there are you need to know why as something changed out from under you.

If I need to clear out the source directory

and there are no new copies, I do the rsync one more time with --remove-source-files

rsync -ac -v --remove-source-files source destination

Again, this will take time, but it will remove files from source that match the destination.

I found this out by accident last night.

I had two of the same directories on two different drives and I wanted to merge them.

The outcome was, files moved from the second directory were rsynced correctly and any prior files in directory 2 that matches directory 1 were removed. This saved me tons of md5sum/compare/scripted removes.

I would suggest you practice it on a few test directories before you use it to copy whole drives.

johnodon · December 10, 2014

rsync -ac -v --remove-source-files source destination

Again, this will take time, but it will remove files from source that match the destination.

In my case, I guess i don't see the advantage of the --remove-source-files switch if the plan is to format the source drive after the copy is completed. If anything, I think I would avoid this switch just in case something goes horribly wrong during what is essentially a "move" operation (like the destination drive crapping the bed). That way the source drive was left intact.

Aside from the checksum switches above, is my thought process correct?

John

WeeboTech · December 10, 2014

as long as you do the two rsyncs with the -c parameter, you do not need to --remove-source-files

-c, --checksum skip based on checksum, not mod-time & size

first rsync compares and copies.

second rsync compares/verifies and copies changed files.

third rsync compares/verifies, copies changed files and removes source.

You don't need to do the third one if you are going to erase the whole drive with a format.

usually after the third one I will do a find down the tree looking for left over files.

then I usually cleanup the directories with

find source -depth -type d -empty -ls

(look at all the directories)

find source -depth -type d -empty -delete

(Which deletes all the empty directories including source

SSD · December 10, 2014

I believe there is a serious bug in unRAID, that if a write is performed to a disk in the protected array, and that write fails, causing the disk to red ball, that the write is not properly performed on the "simulated" disk. As a result, the data being copied is corrupted. I've seen this symptom twice.

But it could be worse. If the failed write was a write to a sensitive part of the housekeeping area of the drive, and not a data sector, then the corrupted write could cause significantly more damage to the simulated disk - enough to corrupt the drive and cause it to be mounted read-only. This would occur no matter which FS were on the drive.

Could this have happened?

RobJ · December 10, 2014

I believe there is a serious bug in unRAID, that if a write is performed to a disk in the protected array, and that write fails, causing the disk to red ball, that the write is not properly performed on the "simulated" disk. As a result, the data being copied is corrupted. I've seen this symptom twice.

But it could be worse. If the failed write was a write to a sensitive part of the housekeeping area of the drive, and not a data sector, then the corrupted write could cause significantly more damage to the simulated disk - enough to corrupt the drive and cause it to be mounted read-only. This would occur no matter which FS were on the drive.

Could this have happened?

I don't think that specific scenario applies here, as there were no high level write errors reported or red balls, but the general case may be applicable to some degree. I did think about that when I saw the Reiser corruption discovered, and wondered what would happen to the attempted writes, whether there might be a possible data corruption issue here..., but then I dropped it and forgot about it!

It's different here in that the corruption occurs first, resulting in read-only status, which only then results in failed writes. These failed writes appear to me to only be visible at this lower level though, as no write errors are reported from higher levels, so it is possible that there is an incomplete write AND it is possible that a higher level function will assume the write succeeded and modify parity accordingly, creating parity error(s).

So in your scenario a write failure potentially causes corruption, and in the current scenario a corruption causes a write failure which then potentially causes additional corruption. But then again, there is no direct evidence that new corruption was created here, so there may not be any, and I could be off base.

[Warning, long winded explanation] Some may wonder what I mean by 'high level write errors'. Data reads and writes and file management are initiated by high level programs, and the calls are translated by mid-level functions into changes in fixed size buffers, which are then translated by low level routines into reads and writes of sectors. If we use the example of a bad sector (typically called a 'media error'), you can see in the syslog at the lowest level the exception handler kick in to deal with it. It will call for retries and resets, until the read or write is successful, or it gives up. It's often successful, so the higher level functions and highest level program receive a success signal, and are none the wiser. In fact, there is such a disconnect that they have no idea that any issue occurred, and the only way they could detect it is if they timed the operation and determined it took longer than normal. If however the low level routine gives up on the operation, then a failure code is returned, and typically you would see the high level program announce a read or write error. UnRAID usually reports the error in the syslog, with the name of the drive that is involved, and that is what I call a 'high level' error. The syslog may contain both the low level errors and the resulting high level errors.

johnodon · December 10, 2014

Well, I do have an update. Of course with what I am about to say I am going to completely jinx myself like I did this morning.

I cancelled my memtest and plan on running that overnight. Since that time, I had run reiserfsck --check on the suspect DISK4 twice (shown above). Since I had only ~27GB on that drive, I rsync'd it to my cache pool. I then stopped the array, formatted DISK4 as XFS and rsync'd the ~27GB back to it (parity disk still online). I did not receive a single error in the syslog.

I fired up NZBGet and starting downloading. Everything downloaded is going to DISK4 (now XFS) since it has the most space by far. 4 hours later and another ~27GB, not a single error.

Here are the last log entries from 4 hours ago when I mounted the SNAP disk:

Dec 10 11:52:33 unRAID kernel: sdl:
Dec 10 11:52:57 unRAID kernel: sdl:
Dec 10 11:55:17 unRAID kernel: sdl: sdl1
Dec 10 11:56:25 unRAID kernel: XFS (sdl1): Mounting V4 Filesystem
Dec 10 11:56:25 unRAID kernel: XFS (sdl1): Ending clean mount

Not a blip on the radar since then.

In parallel, I formatted the other suspect drive (that also gave me errors previously) as XFS and mounted it using SNAP. I have now rsync'd ~1TB of the 1.8TB of data that is on DISK1 (RFS) to it. No errors. Once it is done, I will format DISK1 as XFS and send everything on the SNAP disk back to it.

Again, I probably just pissed on the third rail but i'll let things continue and see what happens. But as of right now, NZBGet and NZBDrone are happily downloading/moving/renaming files.

John

johnodon · December 10, 2014

Was able to copy all 1.8TB from DISK1 to the SNAP disk. Formatted DISK1 as XFS and now copying back. This is going to take a while.

Recent corruption issues

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation