On Fri, Oct 17, 2014 at 09:53:46AM +0200, Miklos Szeredi wrote: > Ugh, no, we don't need that d_drop(), it's just an optimization. I was simply > not thinking about bind mounts. > > The reason I thought it useful to drop the dentry is that after a non-dir is > copied up we don't need the lower dentry reference any more, that dentry can be > freed. But setting ->lowerdentry to NULL is going to cause all sorts of > headaches: we assume it remains non-NULL after it's been set to non-NULL. > > So removing that d_drop should be fine, it just holds that lowerdentry ref until > the overlayfs dentry is gotten rid of (which it won't if there's a bind mount on > it, but I don't think that's a big deal). > > The d_drop() calls in ovl_rename2() are also in this category and need to be > removed. > > With the attached patch the above test passes fine. OK, folded... Next question: ovl_clear_empty() doesn't verify that udir == upper->d_parent.d_inode. Result: if some joker renames that thing in upper layer, we get vfs_rename() (and, possibly, vfs_rmdir() or vfs_unlink()) called with dentry _not_ being a child of directory we have locked. Subsequent breakage depends on the fs type, but in principle sky's the limit - ->rmdir() et.al. have every right to assume that victim->d_name is stable, etc. In other places like that you check such asserts explicitly and fail wtih ESTALE; is there any reason for not doing the same here? Related one: you unlock wdir (as part of unlock_rename()), only to relock it again after ovl_cleanup_whiteouts(). Which has lovely potential for upper getting moved someplace else, so we need to recheck the same assert again (that's where rmdir/unlink part of the problem comes from). Why? To relieve the contention on s_vfs_rename_mutex and wdir->i_mutex? Next one: ovl_do_remove() looks at the victim's type (lower vs. upper vs. mixed) before taking any locks in upper layer. What happens if it races with copyup of victim? And no, ->i_mutex on victim in overlayfs itself is not enough to exclude all causes of copyup - I can see at least one (open() for write on a regular file). That particular case is harmless (overwriting rename into place will actually DTRT here), but at the very least it needs comments re "why there's no case where such race would be possible and not harmless". I'm skipping a lot of really obvious ones; I'll push a branch with fixes for those folded as vfs.git#experimental-ovl later today (probably after I get some sleep, though), Other non-obvious stuff: * setattr() does truncated-copy first, moves it in position, then does notify_change() stripping SUID/SGID. In theory that's exploitable. I hadn't looked into reorganizing it yet, but it seems that having truncating copyup handle attribute changes *before* move would be a good idea. * rmdir() of something you should be allowed to remove (empty subdir of your directory) fails with EINVAL if you don't have permissions to open it (including the really lovely case when it's *yours* and had been just chmoded to 111). Reason: you try to open the fscker to verify it's empty, doing that with current creds. That, actually, is one of the reasons why whiteouts ought to be fs primitives - you only need that kind of games because rmdir() on underlying fs treats your poor-man-whiteout as "something making our directory non-empty". And for any BSD-style whiteouts that's really trivial to check - in fact, for FFS derivatives the normal check for directory being empty will work just fine (they are represented as negative entries with special dt_type value). Anyway, rmdir() failure is a bug - if you insist on using that crap as whiteout substitutes, at least make sure to get appropriate creds. * we really, _really_ need some sanity checks on the nature of layers. As it is, you can ask for an overlay with procfs as one of the layers. Which will blow up big way. Interactions with NFS will also be interesting. Or with FUSE, for that matter... BTW, am I misreading your readdir.c, or do you really reread the sucker every time you open a directory? With each opened struct file getting big fat cache built in memory, nevermind that they will be identical? Looks like an OOM bait, that... I'm still digging through that code, but it really looks that way...