WARNING: this is very much a work in progress. A bunch of filesystems is keeping their objects pinned in dcache. It got started as a neat hack in ramfs, but since then it had spread to many other places. As it had been growing, the scope had grown very far past the original, and at the moment it's not doing well. Original approach was to have a controlled dentry leak - e.g. ->mkdir() returns with refcount of dentry bumped by 1, which prevents its eviction. Conversely, ->rmdir() decrements the refcount by 1. Anything not removed by umount time is taken out by a special kill_anon_super() analogue - kill_litter_super(). -------------------------------------------------------------------------------- Originally it was intended for normal filesystem semantics and access patterns. That didn't last - in quite a few cases we have kernel-initiated creation and removal of objects there. Some try to simulate what the normal syscalls are doing, some do not even bother and very few do that well. In particular, rmdir(2) and unlink(2) know how to deal with something mounted on the victim in another namespace, but e.g. write(2) has no idea that writing this string to that file might do an equivalent of rm -rf on something entirely different. Filesystems are on their own there. Worse, rmdir(2) assumes that it should only succeed when directory is empty; unfortunately, e.g. configfs userland ABI expects to have successful rmdir(2) on _nonempty_ directories. That leads to all kinds of fun, because they allow mkdir() in the bowels of that subtree, and require rmdir() on the original to fail if any mkdir() _under_ it had been done. There's no way for syscalls to be aware of that shite; it has to be handled by the filesystem itself, and the way it's done is really not possible to describe in printable terms. For even more fun, creating a subtree for their mkdir(2) can fail halfway through. At which point we need to take out everything that had been added... and prevent e.g. an open() that wandered into that subtree while it was being built. configfs is probably the worst case, but there's a lot of PITA in other users. -------------------------------------------------------------------------------- A saner infrastructure would be useful. One problem is there's no indication of specific increment or decrement being related to the controlled leak in question. For ramfs proper it wasn't a problem - we have one dget() in ->mkdir() et.al. and one dput() in ->rmdir() et.al. For kernel-initiated operations it's harder to keep track of. Proposed approach to that part of mess: have those "controllably leaked" dentries marked as such. * New flag: DCACHE_PERSISTENT * d_make_persistent(): dget() and set flag * d_make_discardable(): dput() and clear; eventually - scream if called for dentry without that flag * simple_rmdir() and simple_unlink() call d_make_discardable() instead of dput() (that's what "eventually" above is about) * collecting the victims for kill_litter_super() - skip ones that have the flag * shrink_dcache_for_umount() - if flag is set, clear and decrement refcount. Note that kill_litter_super() proceeds to call kill_anon_super(), which will call shrink_dcache_for_umount() * d_alloc_persistent(): what it says, allocate and mark persistent. Typical use is for kernel-initiated creation - when we know that there can be no object with such name. Disposing of those on failure exits should be done with d_make_discardable() instead of dput(). * start_creating_persistent(): the analogue of the above when we do *NOT* know if the name is unique or acceptable. Parent must be locked exclusive for that. That allows to deal with filesystems one-by-one. Once a filesystem does make sure to maintain the persistency flags, it can switch to using kill_anon_super(). Another part of the mess: open-coded attempts to remove object(s). * simple_recursive_removal() is there, but it's underused. * new variant: locked_recursive_removal(), for the case when parent is already locked. Both of those take care of d_invalidate(), etc., as well as encapsulate walking directory tree. A bunch of (badly) open-coded instances out there... The series I've got (#untested.persistency) is at the 58 commits at the moment and it's very much a work in progress. Current diffstat: 54 files changed, 621 insertions(+), 1136 deletions(-) Filesystems that remain to be converted (all with interesting problems): drivers/usb/gadget/function/f_fs.c drivers/usb/gadget/legacy/inode.c fs/configfs/mount.c security/apparmor/apparmorfs.c -------------------------------------------------------------------------------- What to do about building subtrees and atomicity issues? I have something that I hope would be a usable approach; hoped to discuss it (applied to configfs) with hch here, but... What it boils down to is that ->mkdir() is allowed to splice another dentry in place of one it had been given and return success with original dentry unhashed and left negative. We need that for e.g. nfs_mkdir() and callers of vfs_mkdir() are already dealing with such possibility. And that allows us to do the following: * build the subtree unattached to anything else in dcache. * if that succeeds, d_splice_alias(root_of_new_tree->d_inode, argument_of_mkdir); return 0 * if building the subtree has failed, we can dissolve it safely - nobody could have reached it via dcache lookups. That kills the half-arsed attempts to block opens, etc. in a half-built tree. Another thing it allows to kill is the Lovecraftian games configfs_rmdir() has to pull with its "is this non-empty subtree empty enough for us and do we have an ongoing attempt of mkdir() in the bowels of that subtree we need to wait for in case it fails and unrolls" logics. hch is not here, unfortunately, so configfs parts are going to wait - I've some questions there, and it'll have to be done over email. I'll probably do such conversion (build out of tree, then splice it in) for cases like nfsctl and rpc_pipe; should be enough to shake a sane set helpers up. Hopefully will have something in that direction by tomorrow... spufs is another candidate, but that one I really can't test. -------------------------------------------------------------------------------- Random observations: * d_alloc_name() will have no callers left. Remove? * instead of playing with always_delete(), simple_lookup() and friends probably should just set DCACHE_DONTCACHE * securityfs probably needs to be unconditionally kern_mount'ed and to hell with simple_pin_fs() games. The same does *not* apply to debugfs, though. Hell knows, perhaps we want a kern_mount variant that would *NOT* contribute to module refcount... Not sure how would that work, though - we certainly want user mounts to pin the filesystem, and "the last user mount shuts down" is not a convenient thing to hang anything on... * apparmorfs locking is an atrocity. ANYONE who unlocks and relocks parent in ->mkdir()/->rmdir() is wrong, but here they mix their own locks in between ->i_rwsem of nested directories - and take an arseload of those. I tried to talk about that with some of apparmor folks, got nowhere... * simple_recursive_removal() probably needs to be taught about a mix of persistent and non-persistent dentries; the latter need d_invalidate() (and callback called for them), but that's it. OTOH, it can be expressed as "if not marked persistent, d_make_persistent(victim)" in the callback; result will be the same. Can't do that until all callers mark everything properly, though... * configfs "refcounting": symlink or directory that was mkdir'ed - pinned once. Other subdirectories created by the same mkdir - pinned twice. Regular files - not pinned. register_subsystem stuff - similar to mkdir (one for root of subtree, two for all other subdirectories). register_group - apparently all pinned twice. What should register_group do if adding into mkdir'ed subtree - fuck knows, the only example thereof is bogus beyond belief.