WARNING: this is very much a work in progress.

	A bunch of filesystems is keeping their objects pinned in
dcache.  It got started as a neat hack in ramfs, but since then it had
spread to many other places.  As it had been growing, the scope had
grown very far past the original, and at the moment it's not doing well.

	Original approach was to have a controlled dentry leak -
e.g. ->mkdir() returns with refcount of dentry bumped by 1, which
prevents its eviction.	Conversely, ->rmdir() decrements the refcount
by 1.  Anything not removed by umount time is taken out by a special
kill_anon_super() analogue - kill_litter_super().

--------------------------------------------------------------------------------

	Originally it was intended for normal filesystem semantics
and access patterns.  That didn't last - in quite a few cases we have
kernel-initiated creation and removal of objects there.  Some try to
simulate what the normal syscalls are doing, some do not even bother
and very few do that well.

	In particular, rmdir(2) and unlink(2) know how to deal with
something mounted on the victim in another namespace, but e.g. write(2)
has no idea that writing this string to that file might do an equivalent
of rm -rf on something entirely different.  Filesystems are on their
own there.

	Worse, rmdir(2) assumes that it should only succeed when
directory is empty; unfortunately, e.g. configfs userland ABI expects to
have successful rmdir(2) on _nonempty_ directories.  That leads to all
kinds of fun, because they allow mkdir() in the bowels of that subtree,
and require rmdir() on the original to fail if any mkdir() _under_ it
had been done.	There's no way for syscalls to be aware of that shite;
it has to be handled by the filesystem itself, and the way it's done is
really not possible to describe in printable terms.

	For even more fun, creating a subtree for their mkdir(2) can
fail halfway through.  At which point we need to take out everything
that had been added... and prevent e.g. an open() that wandered into
that subtree while it was being built.

	configfs is probably the worst case, but there's a lot of PITA
in other users.

--------------------------------------------------------------------------------

	A saner infrastructure would be useful.

	One problem is there's no indication of specific increment
or decrement being related to the controlled leak in question.	For ramfs
proper it wasn't a problem - we have one dget() in ->mkdir() et.al. and
one dput() in ->rmdir() et.al.	For kernel-initiated operations it's
harder to keep track of.

	Proposed approach to that part of mess: have those "controllably
leaked" dentries marked as such.

	* New flag: DCACHE_PERSISTENT
	* d_make_persistent(): dget() and set flag
	* d_make_discardable(): dput() and clear; eventually - scream if
		called for dentry without that flag
	* simple_rmdir() and simple_unlink() call d_make_discardable()
		instead of dput() (that's what "eventually" above is
		about)
	* collecting the victims for kill_litter_super() - skip ones that
		have the flag
	* shrink_dcache_for_umount() - if flag is set, clear and decrement
		refcount.  Note that kill_litter_super() proceeds to call
		kill_anon_super(), which will call shrink_dcache_for_umount()
	* d_alloc_persistent(): what it says, allocate and mark persistent.
		Typical use is for kernel-initiated creation - when we
		know that there can be no object with such name.
		Disposing of those on failure exits should be done with
		d_make_discardable() instead of dput().
	* start_creating_persistent(): the analogue of the above when
		we do *NOT* know if the name is unique or acceptable.
		Parent must be locked exclusive for that.

That allows to deal with filesystems one-by-one.  Once a filesystem
does make sure to maintain the persistency flags, it can switch to using
kill_anon_super().

	Another part of the mess: open-coded attempts to remove
object(s).
	* simple_recursive_removal() is there, but it's underused.
	* new variant: locked_recursive_removal(), for the case when
	  parent is already locked.  Both of those take care of
	  d_invalidate(), etc., as well as encapsulate walking directory
	  tree.  A bunch of (badly) open-coded instances out there...

	The series I've got (#untested.persistency) is at the 58 commits
at the moment and it's very much a work in progress.

	Current diffstat: 54 files changed, 621 insertions(+), 1136 deletions(-)

	Filesystems that remain to be converted (all with interesting problems):
drivers/usb/gadget/function/f_fs.c
drivers/usb/gadget/legacy/inode.c
fs/configfs/mount.c
security/apparmor/apparmorfs.c

--------------------------------------------------------------------------------

	What to do about building subtrees and atomicity issues?

	I have something that I hope would be a usable approach;
hoped to discuss it (applied to configfs) with hch here, but...

	What it boils down to is that ->mkdir() is allowed to
splice another dentry in place of one it had been given and return
success with original dentry unhashed and left negative.  We need that
for e.g. nfs_mkdir() and callers of vfs_mkdir() are already dealing with
such possibility.

	And that allows us to do the following:
* build the subtree unattached to anything else in dcache.
* if that succeeds,
	d_splice_alias(root_of_new_tree->d_inode, argument_of_mkdir);
	return 0
* if building the subtree has failed, we can dissolve it safely -
  nobody could have reached it via dcache lookups.

	That kills the half-arsed attempts to block opens, etc. in a
half-built tree.  Another thing it allows to kill is the Lovecraftian
games configfs_rmdir() has to pull with its "is this non-empty subtree
empty enough for us and do we have an ongoing attempt of mkdir() in the
bowels of that subtree we need to wait for in case it fails and unrolls"
logics.

	hch is not here, unfortunately, so configfs parts are going to
wait - I've some questions there, and it'll have to be done over email.

	I'll probably do such conversion (build out of tree, then splice
it in) for cases like nfsctl and rpc_pipe; should be enough to shake a
sane set helpers up.  Hopefully will have something in that direction
by tomorrow...  spufs is another candidate, but that one I really can't
test.

--------------------------------------------------------------------------------

	Random observations:
* d_alloc_name() will have no callers left.  Remove?
* instead of playing with always_delete(), simple_lookup() and friends
  probably should just set DCACHE_DONTCACHE
* securityfs probably needs to be unconditionally kern_mount'ed and
  to hell with simple_pin_fs() games.  The same does *not* apply to
  debugfs, though.  Hell knows, perhaps we want a kern_mount variant
  that would *NOT* contribute to module refcount...  Not sure how
  would that work, though - we certainly want user mounts to pin the
  filesystem, and "the last user mount shuts down" is not a convenient
  thing to hang anything on...
* apparmorfs locking is an atrocity.  ANYONE who unlocks and relocks
  parent in ->mkdir()/->rmdir() is wrong, but here they mix their
  own locks in between ->i_rwsem of nested directories - and take
  an arseload of those.  I tried to talk about that with some of
  apparmor folks, got nowhere...
* simple_recursive_removal() probably needs to be taught about
  a mix of persistent and non-persistent dentries; the latter need
  d_invalidate() (and callback called for them), but that's it.
  OTOH, it can be expressed as "if not marked persistent,
  d_make_persistent(victim)" in the callback; result will be
  the same.  Can't do that until all callers mark everything
  properly, though...
* configfs "refcounting": symlink or directory that was mkdir'ed -
  pinned once.  Other subdirectories created by the same mkdir -
  pinned twice.  Regular files - not pinned.  register_subsystem
  stuff - similar to mkdir (one for root of subtree, two for all
  other subdirectories).   register_group - apparently all
  pinned twice.  What should register_group do if adding into
  mkdir'ed subtree - fuck knows, the only example thereof
  is bogus beyond belief.