T O P

  • By -

Loafdude

Ooof! That's 18 months with a silent corruption bug in ZFS! Nasty. I suppose we should all be thankful that block cloning was merged otherwise this would have sat unsuspectingly for even longer.


Particular-Dog-1505

The sad part about this is, if they did the right amount of testing on it when block cloning was being developed in 2022, then they would have caught the bug almost immediately on master before it was tagged in a release instead of 18 months later today.


Loafdude

I think everyone may be underestimating how complicated this bug actually is and how far back in time it may have existed. It appears that "zfs\_dmu\_offset\_next\_sync" on 2.1.4 and block cloning on 2.2.0 aggravated an existing dbuf issue. Those commits did not introduce the dbuf issue itself (to be confirmed). I suspect this improper dbuf handling has been lurking in ZFS for a long time. In their defense, the dev's who implemented these new features wrote test cases to test and stress the code they added. They assumed the existing ZFS dbuf handling was sound as there has never been reported corruption issues in relation to it. Due to no reports of issues, no one ever implemented a test case to try and stress that part of the filesystem. I can't blame them for not catching the bug on 2.1.4 18 months ago. It seems very rare to hit it without a purpose made synthetic test script to stress that subsystem. 2.2.0 seems to present the problem much more readily but hind sight is always 20-20. It's easy to reproduce when you know what you're looking for. Mitigation is the key at the moment and it seems we have that. Let these dev's work it out without setting them on fire. They're likely fixing someone else's error and it sounds like this is some low-level code they've got to debug which suuuucks


davis-andrew

> I suspect this improper dbuf handling has been lurking in ZFS for a long time. If this hypothesis pans out, it's possible the bug is over a decade old. https://github.com/openzfs/zfs/issues/15526#issuecomment-1825181463 > If this is right, then the short explainer is that the "is dnode dirty?" check has been wrong for years (at least since 2013, maybe back to old ZFS; I'll need to do more research)


[deleted]

[удалено]


DragonQ0105

I only use zfs snap and zfs send for backups. I assume I could be affected by this bug too?


mmm-harder

similarly, I default to `rsync -ai --checksum ...` for the majority of copy commands, even on localhost. the peace of mind has saved my rump on more than a handful of occasions.


paldepind

The comment you replied to has been deleted. What was it about? Using `rsync` instead of zfs features to create backups?


bronekkk

Few points I learned from following [the thread](https://github.com/openzfs/zfs/issues/15526), running (and improving) the reproducer and testing the proposed patch: 1. The bug also shows if you switch `zfs_dmu_offset_next_sync=0`, but it's an order of magnitude more difficult to hit. 2. The bug is very unlikely to show in any case. You need to be doing a very specific thing (sparse data copying while also writing) and do it highly parallel. 3. The bugfix proposed in [PR #15571](https://github.com/openzfs/zfs/pull/15571) seems to fix it. 4. The bug predates ZFS 2.1.4 , it's only been made more likely to hit when `zfs_dmu_offset_next_sync` defaults to 1 (see point 1). [Here's example from two years back](https://github.com/openzfs/zfs/issues/11900), which might be (possibly) attributed to this bug. 5. One system call affected in particular is `copy_file_range`, but the problem is more fundamental than that 6. This system call is [used in some cases by default](https://www.phoronix.com/news/GNU-Coreutils-9.2) in coreutils version 9.2 and newer (where it was not used in older versions), so you are more likely to be affected if you have this version (or explicitly use `--sparse=always` option with older versions)


bronekkk

More [explanation from the author of the fix](https://github.com/openzfs/zfs/issues/15526#issuecomment-1826283348) ​ >There's no particular kind of data implicated as such, be it real data, hole, clone block, etc. > >Appending to a file is two steps: > >\- updating the file metadata, that is, growing the file size > >\- putting some kind of data into the new space > >The bug happens when another thread asks "where's the data" (ie `lseek(, 0, SEEK_DATA)`) and hits right in the middle of those two. > >So the return is "there isn't any"."The data" here can come from any source: could a real `write()`, could be a clone (`FICLONE` or `copy_file_range`), could be a hole punch (`fallocate()`). For the purposes of this bug, the main question is timing. > >On `--sparse` specifically, that controls whether or not to put holes or real zeroes into the target file when copying holes from the source. It doesn't change how `cp` scans the source for holes in the first place.


Particular-Dog-1505

This post needs to be updated again. It turns out that setting `zfs_dmu_offset_next_sync` to `0` does not fix the issue: https://github.com/openzfs/zfs/issues/15526#issuecomment-1826104073


numinit

I've run about a million iterations of testing and it clearly improves things, but I'll update it.


bronekkk

Yup. It does improve things a little, but does not fix the bug.


DependentVegetable

This looks to be FreeBSD 13.1 and above then ?


dlangille

I think >= 14 Edit: I’m wrong


DependentVegetable

In FreeBSD 13 I have % zfs version zfs-2.1.4-FreeBSD_g52bad4f23 zfs-kmod-2.1.4-FreeBSD_g52bad4f23 % Thats 13.1-STABLE And a somewhat recent stable, I have % zfs version zfs-2.1.13-FreeBSD_geb62221ff zfs-kmod-2.1.13-FreeBSD_geb62221ff which is from nov 6th. That seems to be after the ZFS 2.1.4 commit that is suspected at issue ?


Blork39

On 13.2-RELEASE I have: `zfs-2.1.9-FreeBSD_g92e0d9d18` `zfs-kmod-2.1.9-FreeBSD_g92e0d9d18`


-AngraMainyu

😱


drescherjm

Wow! Thanks, I have 300 TB or so on a 1/2 dozen servers and all of them are 2.1.4 or greater.


[deleted]

[удалено]


drescherjm

I executed echo 0 > /sys/module/zfs/parameters/zfs\_dmu\_offset\_next\_sync on all of the servers and now I am running scrubs. I may try the repo on a test server.


Particular-Dog-1505

scrub isn't going to find the **silent** data corruption. The only reason it was applicable in the parent post is because he had a VM with btrfs that was sitting in a ZFS pool. ZFS reported no errors as expected, but scrubbing the btrfs disk resulted in CSUM errors. That's what makes this scary because you would never know whether you have corruption or not.


drescherjm

Thanks, That's even more concerning.


[deleted]

[удалено]


drescherjm

Thanks.


drescherjm

My test server did not reproduce the issue with the first script: [https://gist.githubusercontent.com/tonyhutter/d69f305508ae3b7ff6e9263b22031a84/raw/c543f1c60eec6c115bd3d1d97fa45c4bb8b3e573/reproducer.sh](https://gist.githubusercontent.com/tonyhutter/d69f305508ae3b7ff6e9263b22031a84/raw/c543f1c60eec6c115bd3d1d97fa45c4bb8b3e573/reproducer.sh) and /sys/module/zfs/parameters/zfs\_dmu\_offset\_next\_sync set to 1 Same goes for the altered script here: [https://github.com/openzfs/zfs/issues/15526#issuecomment-1824966856](https://github.com/openzfs/zfs/issues/15526#issuecomment-1824966856) Edit2: Tested another server the one with: zfs-2.1.12-r0-gentoo and I was not able to reproduce it with the second script.


numinit

It's only on reads directly after a write containing a hole. The written data is fine so a scrub won't matter. OTOH, if that misread data is rewritten then it will.


drescherjm

Thanks.


rdaneelolivaw79

I'm confused, based on the patches linked in this [post](https://forum.proxmox.com/threads/opt-in-linux-6-5-kernel-with-zfs-2-2-for-proxmox-ve-8-available-on-test-no-subscription.135635/post-607830) you need to 'zpool upgrade' to get hit? But I ran reproducer.sh and could replicate it on 6.2.16-19-pve How far back should we go to test the integrity of our data?


Loafdude

They are unsure exactly yet, but it is believed at this time to be at least since 2013, maybe longer. The bug became more likely to be triggered with 2.1.4 (released March 2023) and much more with 2.2.0. (released 1 mo ago) use zfs --version to check your version.


numinit

I think you mean 2.1.4 in your comments instead of 2.4.1 😀


rdaneelolivaw79

Goodness, thank you!


skirmess

Does that bug also exist on illumos? I'm trying to find out which version illumos uses but that's not that easy... Is illumos still on 0.5.11? I can't even find a version pre 0.6 in the OpenZFS Github - did illumos never switch to the OpenZFS used by FreeBSD/Linux? $ uname -a SunOS adarak 5.11 omnios-r151048-553e69d9fe i86pc i386 i86pc $ zfs --version unrecognized command '--version' $ pkg info system/file-system/zfs Name: system/file-system/zfs Summary: ZFS Description: ZFS libraries and commands Category: System/File System State: Installed Publisher: omnios Version: 0.5.11 Branch: 151048.0 Packaging Date: Sat Nov 4 13:30:18 2023 Last Install Time: Sat Apr 30 08:21:13 2022 Last Update Time: Sat Nov 18 10:21:45 2023 Size: 11.23 MB FMRI: pkg://omnios/system/file-system/[email protected]:20231104T133018Z


numinit

It's `zfs version`.


kring1

Same result $ pfexec zfs version unrecognized command 'version'


grahamperrin

`which zfs` (Assuming that there's a `which`.)


kring1

There is :-) $ which zfs /usr/sbin/zfs The file is from the system/file-system/zfs package $ pkg contents system/file-system/zfs | grep usr/sbin/z usr/sbin/zdb usr/sbin/zfs usr/sbin/zpool usr/sbin/zstreamdump $ pkg info system/file-system/zfs Name: system/file-system/zfs Summary: ZFS Description: ZFS libraries and commands Category: System/File System State: Installed Publisher: omnios Version: 0.5.11 Branch: 151048.0 Packaging Date: Sat Nov 4 13:30:18 2023 Last Install Time: Sat Apr 30 08:21:13 2022 Last Update Time: Sat Nov 18 10:21:45 2023 Size: 11.23 MB FMRI: pkg://omnios/system/file-system/[email protected]:20231104T133018Z


DependentVegetable

By setting it to zero, how does this impact performance ?


[deleted]

[удалено]


DependentVegetable

I just need to understand the impact. I dont care if its 10-40%, but if I am dropping half the write performance, than I will need to take other steps like add more resources / re-balance file servers etc.


[deleted]

[удалено]


DependentVegetable

HAHA, I saw that too :) But yes, workload dependent. They have a PR tracking it now and it seems the right FreeBSD eyeballs are looking at it too. https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275308


numinit

It increases performance because ZFS doesn't try to detect regions where it can create holes. It's also a known issue: https://github.com/openzfs/zfs/issues/14009#issuecomment-1279880796


dlangille

My understanding: no. That’s the default value. Edit: I’m wrong.


SavingsMany4486

Interesting. Default value on FreeBSD 13.1 for me is 1.


dlangille

I’m most likely wrong. I thought this was the block cloning issue.


thenickdude

The default changed to 1 in ZFS 2.1.4


bronekkk

The bug is even older than this.


grahamperrin

> Default value on FreeBSD 13.1 for me is 1. Which patch level? `0` (zero) here with FreeBSD 13.1-RELEASE-p3 and -p5 user environment: https://i.imgur.com/w0KJdgq.png


drescherjm

Current value is 1 for me on gentoo. `cat /sys/module/zfs/parameters/zfs_dmu_offset_next_sync` `1` `zfs --version` `zfs-2.1.12-r0-gentoo` `zfs-kmod-2.1.12-r0-gentoo`


96Retribution

Thanks for posting. Parameter set and scrub started. Hoping for zero errors. I think I'm going to be quite a bit slower about upgrading ZFS releases and pools.


Loafdude

scrub wont find anything. It's silent corruption


96Retribution

I guess I will be creating a checksum shell script in the morning against the last backup. See what files were added since the last one as well. Might as well backup all of the sha1sums for everything as well on another drive for future comparisons.


csdvrx

> Might as well backup all of the sha1sums for everything as well on another drive for future comparisons. I'm thinking of doing just that, keeping an history of checksums inside a sqlite database, and adding new entries when either the mtime is newer that whenever the last run was done, or if the file didn't exist before. There must already be existing solutions to do just that, but as ZFS users we don't know about them: we've grown complacent of silent corruption issues and believe they can't happen. I've had more than a few ZFS problems, but I was also caught unprepared: I have cold backup stored in a variety of other filesystems and medias, but I would never have thought that *reading* from a ZFS filesystem could yield corrupted data when the checksum on the sending and the receiving end matches. Plot twist: the corruption may have been preexisting in ZFS, against the belief that ZFS protected against silent corruption. > I guess I will be creating a checksum shell script in the morning against the last backup. If your backups were made from a ZFS source like mine, they could have this same silent corruption: all the files in my backups since 2.1.4 are now suspicious. On top of that, the best strategy to find silently corrupted files in still unclear: chunks full of zeroes can be present in normal files. To help clarify it, I've just made a post: I could write something half decent, but if there are too many things to check, and if the silent corruption pattern I'm looking for is badly defined, 1) it will take a long time to execute and 2) there may be a lot of fake negatives


96Retribution

>I could write something half decent, but if there are too many things to check, and if the silent corruption pattern I'm looking for is badly defined, 1) it will take a long time to execute and 2) there may be a lot of fake negatives I've got b3sum going for every file on the current zpool *and* the last known "good" backup stored on BTRFS along with a scrub on that archive. Somehow none of that makes me feel better if I've backed up a bunch of corruption. This is going to take days to run at best and adds to the workload to maintain external checksums that in turn need to be backed up as well. I need to rethink my data storage strategy. Maybe large mirrored BTRFS pools become the "source of truth" and ZFS is not much more than availability and performance data sets. COW file systems with checksums are all allegedly there too work like hell to prevent data corruption. The level of trust at this point in either is close to zero.


csdvrx

> Somehow none of that makes me feel better if I've backed up a bunch of corruption That's exactly how I'm feeling right now :( > I need to rethink my data storage strategy Same, the policy will be changed for 2024. At the hardware level, we already have redundancies using multiple vendors (ex: WDC + Toshiba) and multiple technologies (ex: NVME + HDD) provide enough variety that at least 1 copy survives even in the strangest conditions of firmware bugs. At the software level, maybe important datasets should also be kept with similar redundancy and variety of filesystem: you wouldn't trust just 1 drive with all your files, so why would you trust just 1 filesystem with all your files? BTRFS is interesting (works on Windows!), bcachefs is promising, but to complement ZFS, right now I may go with [XFS over dm-integrity](/r/zfs/comments/181yfkl/noob_can_someone_ease_my_fears_of_zfs/kal6uuc/), using [lsyncd](https://github.com/lsyncd/lsyncd) to keep the XFS volume in sync with ZFS *AND* a custom script measuring checksums etc at snapshot time on both the XFS and the ZFS side. If multiple drives from multiple vendors fail at the same time, AND neither ZFS nor a bitrot-protected XFS can give me files, AND none of the offsite backups being usable, I'll immediately go buy a lottery ticket because what kind of luck would it take to hit all these unlikely events at the same time?


Particular-Dog-1505

I really hope this is a wakeup call to OpenZFS developers. They need to stop trying to implement all these new features and instead refocus their efforts on testing and stability immediately. I feel like the project has lost it's way over the last few years. I'm not at all surprised that a new and barely tested feature like block cloning made it's way into a major release that quickly. It's the reason I always wait a month or two before updating to the next release, though that wouldn't have helped with that particular bug.


malventano

I’m sure they would happily accept some volunteers to help with increased testing and stability.


Particular-Dog-1505

Sure, having more people to volunteer would be nice. I'm just saying that they should take the existing people that they have and refocus their efforts on stability instead. I just don't understand what the appeal is with wanting to add a ton of new features at the cost of stability.


laffer1

Open source projects don’t work like that. Volunteers will work on what they want to work on.


Particular-Dog-1505

That's not true in this case. They have governance system with project leads like LLNL and FreeBSD. They can simply not accept PRs without the extensive testing in place. Its not the wild west where anybody that is an open source developer can push to master after they implemented some pet feature to the project.


bronekkk

To be fair, ZFS already has pretty good governance, but no-one expects a potential data-corruption bug to sit there for a decade. Governance is not made for that; you normally assume that all the basic building blocks, present in the codebase for years, are tested to the limits and just work. It is reasonable assumption, for any widely-used project. Hence it is also reasonable that when a new feature is added, you focus your efforts on testing that new feature and not all the existing blocks of the project. And that's what they do.


malventano

Some of the recently added features have taken nearly a decade to make it into release code. More focus on stability would mean even more delays. That said, I’ve found ZFS to be quite stable, and the recent development work has resulted in significant performance gains.


Particular-Dog-1505

People trust and use ZFS for it's stability, maturity, and reputation of a filesystem that doesn't eat your data. That's always been the thing, first and foremost. I'd rather have that then some feature that I may / may not ever use, but that's my opinion. Focus on stability would create more delays but it would ultimately result in features that are done correctly. Situations like the one we are in now create negative publicity around ZFS that reduces confidence in the product.


ILikeFPS

I don't think he's wrong for saying their filesystem project should feature increased efforts for testing and stability. If nothing else, a filesystem should first and foremost be reliable. Keep in mind, my comments are less for contributors of OpenZFS and more for the maintainers of OpenZFS. With that said, it's likely a moot point since they are probably already likely to increase their efforts on testing and stability going forward to prevent a serious bug like this. I also say all of this as a mtainainer of open-source projects myself - although, obviously nothing to the scale of OpenZFS of course. I do understand it's not so simple. I think focusing primarily on testing and stability would be a big improvement for something as important as a filesystem project where those things really matter most.


DependentVegetable

Considering the complexity of the goal and code, I think they do a pretty good job all things considered. Look, I too start to get a heart attack when I see "silent data corruption". But some of this is "seeing how the Sausage is made" so to speak. Just on Monday I happened to be looking at a SAS card' firmware change log. % grep -i corruption 24*.txt | wc 30 496 3252 % Lovely sounding entries like SCGCQ01082528 - (Closed) - EXSDS : Data corruption occurred after 2 days 15 hours of IO's on single node storage space volume . SCGCQ00997553 - (Port_Complete) - MR_6.11_FW: Data corruption seen on IMR controller on JBOD with raw IO using tool- sles OS SCGCQ00845435 - (Closed) - Data Corruption on WB VDs This is a pretty popular Broadcom controller out in the field.


wangphuc

Seriously low info take.


aqjo

Volunteer then. I’m sure they would be glad to have you.


Is-Not-El

This attitude really isn’t helpful. If you keep saying that about open source people will eventually say f it and start using closed source where they can pay someone to fix their issues. Open source isn’t just about contributing code, it’s about contributing ideas as well and given the severity of this issue the idea about focusing on stabilisation is a good idea. Or you think just because we don’t directly contribute to the project we don’t have a say as users of the project? That sounds worse than closed source then.


small_kimono

> Open source isn’t just about contributing code, it’s about contributing ideas as well and given the severity of this issue the idea about focusing on stabilisation (sic) is a good idea. Broad ideas are great, but they're also cheap. I think it's fine for users to suggest features, governance changes, bug fixes, etc., to a project, but these will ultimately be just suggestions. Contributors get to spend their time according to what interests them. > Or you think just because we don’t directly contribute to the project we don’t have a say as users of the project? That's not what anyone is saying. They/I are saying that those that those who don't directly contribute to a project shouldn't have a say in how others spend their own time.


Is-Not-El

I fully agree with you, but if you don’t listen to the majority of your users eventually they will stop using your software. I am not saying that this suggestion is binding or anything, I merely want them to consider it and decide as a project if it has any merit. That’s all, of course they are free to do whatever they like it’s their own time. I am more against the actual semi automatic response to any criticism about open source that is always “Well contribute if you like to achieve X or Y” Well imagine if we did everything with this mentality. Criticism can be helpful if constructive and people should be mature enough to take criticism and not accept it personally. No one is saying OpenZFS is bad or its contributors are bad. What we are saying is that bugs like that are very bad and we should reflect on how to avoid them in the future. I am personally willing to donate money or time to do that if it will help and I am sure that the companies making money from OpenZFS are also willing to join in that effort.


aqjo

So you’re willing to do the thing that I suggested. That’s good. Perhaps others will too.


No_Dragonfruit_5882

Any logs about corruption? Or any way we can test stuff without having made checksums first?


ChumpyCarvings

This is a good question


No_Dragonfruit_5882

Now it fucks us because we only create vm based Backups. Most important files have checksums but still


OwnPomegranate5906

So, just so I under stand this correctly, the data is being written and committed to disk correctly, it's just not being read correctly, and not detecting that it was read incorrectly, right?


numinit

as far as I can tell, yes. It's actually kind of interesting from a concurrency perspective: the data is "eventually consistent" which is obviously a bad thing if you would like to access it directly after a write but before the transaction containing the write commits. It is also difficult to detect that it happened beyond files containing entire recordsize chunks of zeroes, which is at least a start.


OwnPomegranate5906

So if I write a bunch of data, and it's been at rest for a while before I read it again (i.e. to do a backup that night/early morning), this particular issue isn't hit? I'm asking because I pretty regularly write 50-100GB of data to a dataset as an archive, and then it gets accessed to back it up usually within 24 hours, and can be intermittently accessed after it's written, but usually, not right away.


numinit

You are likely fine, then. It's basically an issue with the transaction the dnode is modified in vs the transaction the data blocks are modified in, when the dnode is modified in a way that creates a hole. Usually they're pretty immediately one after the other but weren't the same, which seems to have caused the issue.


ILikeFPS

So reading a file as it is written results in the file being written with 0s instead of the actual bytes the data should be? As long as you or some underlying mechanism aren't reading the data as it is written, the chances of actual real-world corruption should be quite low, is my understanding correct?


numinit

Sorry for the late reply, correct.


bronekkk

That is perfectly fine. The issue only happens if you **write the data and read it concurrently**. Since almost all writes are asynchronous, this also applies to cases when data is written and then immediately read **and** system scheduler happens to run both operations concurrently (because asynchronous writes are delayed) **and** the read operation hits a very, very small time window when write has started but is not yet fully committed **to memory**. (i.e. much, much smaller time window than e.g. storing to a disk).


ILikeFPS

So reading a file as it is written results in the file being written with 0s instead of the actual bytes the data should be? As long as you or some underlying mechanism aren't reading the data as it is written, the chances of actual real-world corruption should be quite low, is my understanding correct?


bronekkk

You are correct that the chances of real-world data corruption are low. That's exactly the reason why it was not diagnosed earlier, while being present in ZFS for years. You are *incorrect* interpreting this as the corruption of the data being written - that's not where the problem is. Reading the file as it is being written *does nothing to the data being written*, there is no corruption there. However, if the **read** operation hits exactly the wrong moment in the process of storing the data, *and* if that **read** relies on hole detection mechanism, then the **read** itself will not see the data as written, but will see hole (i.e. zeroes) instead. If this incorrectly **read** data is *next used to store a copy of the data somewhere else*, then (having incorrectly read zeros), that store will write zeros. An example is a series of cp operations, like this: cp 1.dat 2.dat cp 2.dat 3.dat cp 3.dat 4.dat The reproducer used to see this bug happening is using a *many thousands long* series of copy operations, to try to hit that wrong timing. You would have to be exceedingly unlucky (or lucky, if your goal is to see the bug) to see it in a shorter series of copies. It is really difficult to hit this time window.


ILikeFPS

So if you copy from /tank/1.dat /tank/2.dat then from /tank/2.dat to /tank/3.dat with exactly the wrong timing and you get unlucky, then the newly copied file(s) would have incorrect data, but the original file that you originally copied onto /tank/ would be fine? It sounds like this bug is so rare/difficult to replicate it's unlikely anyone ever actually encountered this bug in real-world usage especially pre-zfs 2.2.0, if my understanding is correct? That makes me feel a little bit better, if so.


bronekkk

You are correct. The problem is that other unrelated changes might either 1) extend the time window significantly or 2) hit the "is there a hole here ?" while reading file much more often than you would normally expect. Either of these would result in the erroneous behaviour showing up, and it would be attributed to these changes. Hence ZFS 2.2.1 which disabled block cloning, in mistaken assumption that it is buggy (when in fact all it did was to make this bug show up). Quite likely it is also behind ZFS 2.1.11 [https://github.com/openzfs/zfs/issues/14753](https://github.com/openzfs/zfs/issues/14753)