1) the first time we need an io_context for a task, we get it allocated with refcount one and cached in task->io_context; early in do_exit() that reference is dropped and current->io_context reset to NULL. However, we do exit_mm() and exit_files() _after_ that. And both can generate IO on behalf of our process, leading to new allocation of io_context, leaving a reference to it in ->io_context. This time there'll be nothing to drop it, AFAICS. I.e. we get a leak of io_context and a leak of structures dangling off it (list of cfq_io_context, for instance). 2) when queue gets cfq set up as elevator, we get cfq_data allocated for it. We have cfqd->queue set to our queue and pinned down; it's never modified until cfqd dies and queue remains pinned down until then. At the same time queue->elevator->elevator_data is set to cfqd and pins it down. It's never modified and remains pinned down until we get to elevator_exit(). Which happens only when the last reference to queue goes away or when we explicitly switch elevator. IOW, we get a leak. 3) when we feed request to cfq, we try find a cfq_io_context attached to current->io_context with cic->key == cfq_data of queue. If it doesn't exist, we allocate it, set its ->key to cfq_data of queue and pin cfq_data down. That pointer is never modified until cic is get freed. It's _NEVER_ dropped - there is no matching decrement of refcount on cfq_data. Another leak. 4) we destroy these cfq_io_context when io_context dies. They are never removed until that point. And they retain reference to cfq_data in ->cfqd *and* to queue - in ->cfqd->queue. That queue is not freed, all right - the leak in (1) takes care of that. If driver decides that queue should be killed (e.g. on rmmod) it will do blk_cleanup_queue(), which will do nothing since we still have references to it. *HOWEVER*, queue->queue_lock is a different story. It will get freed. Normally that wouldn't be a big deal (there's no IO left on queue), but... at do_exit() time we call exit_io_context(), which triggers cfq_exit_io_context(), which triggers cfq_exit_single_io_context() for each cfq_io_context we've got on it. And that's where the shit hits the fan: static void cfq_exit_single_io_context(struct cfq_io_context *cic) { struct cfq_data *cfqd = cic->cfqq->cfqd; request_queue_t *q = cfqd->queue; WARN_ON(!irqs_disabled()); spin_lock(q->queue_lock); if (unlikely(cic->cfqq == cfqd->active_queue)) { __cfq_slice_expired(cfqd, cic->cfqq, 0); cfq_schedule_dispatch(cfqd); } cfq_put_queue(cic->cfqq); cic->cfqq = NULL; spin_unlock(q->queue_lock); } and we do spin_lock() on a spinlock that might be freed days ago. Remeber that cfq_io_context stays until the process exits; if some IO on device that had gone away had been done on our behalf a week ago, it will be there. To make things even funnier, we have interrupts disabled here. 5) sysfs allows to hold a reference to queue and elevator without affecting queue refcount and lifetime. Set default iosched to anything other than cfq so that leak in (1) wouldn't prevent freeing the queue. Then do the following: exec 42/queue/nr_requests have device removed (rmmod, whatever) exec 42 decrementing refcount in freed memory, reading from it => reading from freed memory. 6) switching iosched doesn't prevent somebody else from asking for another switch while this one is going on. Breaks in all sorts of fun ways... 7) ioprio_set() can race with cfq_get_queue(). It's not only possible to miss a new cfq_queue and have it left with old ioprio, it's possible to get list_for_each() called when another CPU does list_add(). Which is considerably nasier, albeit harder to hit... The reason, of course, is that while cfq_set_request() on its own doesn't need any locking of cic list (process-synchronous, works only with one's io_context), ioprio_set() is done to other tasks. 8) elv_unregister() doesn't bother with task_lock(); can race with exit_io_context() freeing task->io_context under it... 9) We have at most one cfq_io_context for given process and given queue. We bother with cfq_get_queue() once per cfq_io_context; after we'd set ->cfqq we won't call it again. So if the first operation from our process on given queue is write done when process doesn't have PF_SYNCWRITE set, we'll get cfq_queue for (that queue, CFQ_KEY_ASYNC, task->ioprio). It will be stored in ->cfqq of created cfq_io_context and that's it - after that _everything_ (reads, sync writes) for that queue will go to the same cfq_queue. Looks very odd... 10) There's an unpleasant problem with async queue. Suppose we have 69 processes, originally with the same ->ioprio. All do async writes. All end up with cfq_io_context pointing to the same cfq_queue; so far so good. Now think what'll happen when we do ioprio_set(2) in one of them. It will get to that queue and happily change its ->ioprio and ->ioprio_class. Oops - we'd just bumped ioprio for async writes on other processes... 11) OK, sometimes we boost cfq_queue ioprio. Somebody does a hash lookup while ioprio of an async queue is elevated. What, are they going to be stuck with lowered ioprio when we go back? 12) Suppose a process has talked both to as-iosched and cfq-iosched queues. We have killed the latter (or switched to a different iosched). Now we have all cfq_data, cfq_queue and cfq_request freed; all remaining cfq_io_context are dummies and hold no pointers (->key and ->cfqq are NULL). Process in question has called exit(); there are some pending requests in the bowels of as-iosched, but io_context is already detached from the task and is just waiting for IO to finish - it will be freed at that point. And that's when somebody tries to rmmod the cfq. elv_unregister() walks through all tasks and knocks their ->cic out. Except that this io_context is not there anymore - it's detached and the only references to it are held by as-iosched requests in flight. So elv_unregister() happily completes and module is unloaded. Eventually, as-iosched is done with it and we get to as_put_io_context(arq) ---> put_io_context(arq->io_context) ---> the last reference goes and we call ioc->cic->dtor(ioc->cic), i.e. cfq_io_context(). Which used to be in the module we'd just removed. 13) There's a narrower race between cfq_exit_io_context() and cfq_exit() - the former can get called in the middle of the later _and_ last until past the end of rmmod. 14) On top of that, we have rmmod as-iosched knocking out ->io_context->cic of processes that use cfq right now. And vice versa, of course... 15) More of the same: elv_unregister() leaves task->io_context->set_ioprio as-is... FWIW, the idea of ->set_ioprio looks bogus - it points to iosched method and the only reason it works at all is that only cfq has it non-NULL. 16) And fun just keeps coming: blk_init_queue_node() failure exit has blk_cleanup_queue(q); followed by freeing q. The thing is, we have the only reference to q at that point, so it's double-free. 17) Double elevator_put() is elevator_switch() fails in elv_register_queue(): we do elevator_exit(), followed by explict elevator_put(). The former does elevator_put() itself... (and yes, that's the final batch of refcounting fixes - elevator_t ones). 18) Same pile: use after free in the very end of elevator_switch(): we print elevator_type ->name after having done elevator_put(). 19) One more: lack of proper ->owner on elevator attributes means that we could cause interesting problems with rmmod.