10 years agoext4: enable "punch hole" functionality
Allison Henderson [Wed, 25 May 2011 11:41:50 +0000 (07:41 -0400)]
ext4: enable "punch hole" functionality

This patch adds new routines: "ext4_punch_hole" "ext4_ext_punch_hole"
and "ext4_ext_check_cache"

fallocate has been modified to call ext4_punch_hole when the punch hole
flag is passed.  At the moment, we only support punching holes in
extents, so this routine is pretty much a wrapper for the ext4_ext_punch_hole

The ext4_ext_punch_hole routine first completes all outstanding writes
with the associated pages, and then releases them.  The unblock
aligned data is zeroed, and all blocks in between are punched out.

The ext4_ext_check_cache routine is very similar to ext4_ext_in_cache
except it accepts a ext4_ext_cache parameter instead of a ext4_extent
parameter.  This routine is used by ext4_ext_punch_hole to check and
see if a block in a hole that has been cached.  The ext4_ext_cache
parameter is necessary because the members ext4_extent structure are
not large enough to hold a 32 bit value.  The existing
ext4_ext_in_cache routine has become a wrapper to this new function.

[ext4 punch hole patch series 5/5 v7]

Signed-off-by: Allison Henderson <>
Signed-off-by: "Theodore Ts'o" <>
Reviewed-by: Mingming Cao <>
10 years agoext4: add "punch hole" flag to ext4_map_blocks()
Allison Henderson [Wed, 25 May 2011 11:41:46 +0000 (07:41 -0400)]
ext4: add "punch hole" flag to ext4_map_blocks()

This patch adds a new flag to ext4_map_blocks() that specifies the
given range of blocks should be punched out.  Extents are first
converted to uninitialized extents before they are punched
out. Because punching a hole may require that the extent be split, it
is possible that the splitting may need more blocks than are
available.  To deal with this, use of reserved blocks are enabled to
allow the split to proceed.

The routine then returns the number of blocks successfully
punched out.

[ext4 punch hole patch series 4/5 v7]

Signed-off-by: Allison Henderson <>
Signed-off-by: "Theodore Ts'o" <>
Reviewed-by: Mingming Cao <>
10 years agoext4: punch out extents
Allison Henderson [Wed, 25 May 2011 11:41:43 +0000 (07:41 -0400)]
ext4: punch out extents

This patch modifies the truncate routines to support hole punching
Below is a brief summary of the patches changes:

- Added end param to ext_ext4_rm_leaf
        This function has been modified to accept an end parameter
        which enables it to punch holes in leafs instead of just
        truncating them.

- Implemented the "remove head" case in the ext_remove_blocks routine
        This routine is used by ext_ext4_rm_leaf to remove the tail
        of an extent during a truncate.  The new ext_ext4_rm_leaf
        routine will now also use it to remove the head of an extent in the
        case that the hole covers a region of blocks at the beginning
        of an extent.

- Added "end" param to ext4_ext_remove_space routine
        This function has been modified to accept a stop parameter, which
        is passed through to ext4_ext_rm_leaf.

[ext4 punch hole patch series 3/5 v6]

Signed-off-by: Allison Henderson <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: add new function ext4_block_zero_page_range()
Allison Henderson [Wed, 25 May 2011 11:41:32 +0000 (07:41 -0400)]
ext4: add new function ext4_block_zero_page_range()

This patch modifies the existing ext4_block_truncate_page() function
which was used by the truncate code path, and which zeroes out block
unaligned data, by adding a new length parameter, and renames it to
ext4_block_zero_page_rage().  This function can now be used to zero out the
head of a block, the tail of a block, or the middle
of a block.

The ext4_block_truncate_page() function is now a wrapper to

[ext4 punch hole patch series 2/5 v7]

Signed-off-by: Allison Henderson <>
Signed-off-by: "Theodore Ts'o" <>
Reviewed-by: Mingming Cao <>
10 years agoext4: add flag to ext4_has_free_blocks
Allison Henderson [Wed, 25 May 2011 11:41:26 +0000 (07:41 -0400)]
ext4: add flag to ext4_has_free_blocks

This patch adds an allocation request flag to the ext4_has_free_blocks
function which enables the use of reserved blocks.  This will allow a
punch hole to proceed even if the disk is full.  Punching a hole may
require additional blocks to first split the extents.

Because ext4_has_free_blocks is a low level function, the flag needs
to be passed down through several functions listed below:


[ext4 punch hole patch series 1/5 v7]

Signed-off-by: Allison Henderson <>
Signed-off-by: "Theodore Ts'o" <>
Reviewed-by: Mingming Cao <>
10 years agoext4: reserve inodes and feature code for 'quota' feature
Aditya Kali [Tue, 24 May 2011 23:00:39 +0000 (19:00 -0400)]
ext4: reserve inodes and feature code for 'quota' feature

I am working on patch to add quota as a built-in feature for ext4
filesystem. The implementation is based on the design given at
This patch reserves the inode numbers 3 and 4 for quota purposes and
also reserves EXT4_FEATURE_RO_COMPAT_QUOTA feature code.

Signed-off-by: Aditya Kali <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: add support for multiple mount protection
Johann Lombardi [Tue, 24 May 2011 22:31:25 +0000 (18:31 -0400)]
ext4: add support for multiple mount protection

Prevent an ext4 filesystem from being mounted multiple times.
A sequence number is stored on disk and is periodically updated (every 5
seconds by default) by a mounted filesystem.
At mount time, we now wait for s_mmp_update_interval seconds to make sure
that the MMP sequence does not change.
In case of failure, the nodename, bdevname and the time at which the MMP
block was last updated is displayed.

Signed-off-by: Andreas Dilger <>
Signed-off-by: Johann Lombardi <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: ensure f_bfree returned by ext4_statfs() is non-negative
Kazuya Mio [Tue, 24 May 2011 22:30:07 +0000 (18:30 -0400)]
ext4: ensure f_bfree returned by ext4_statfs() is non-negative

I found the issue that the number of free blocks went negative.
# stat -f /mnt/mp1/
  File: "/mnt/mp1/"
    ID: e175ccb83a872efe Namelen: 255     Type: ext2/ext3
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 258022     Free: -15        Available: -13122
Inodes: Total: 65536      Free: 63029

f_bfree in struct statfs will go negative when the filesystem has
few free blocks. Because the number of dirty blocks is bigger than
the number of free blocks in the following two cases.

            percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
        <--- interrupt statfs systemcall --->
                            used + ei->i_allocated_meta_blocks);

            percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
            <--- interrupt statfs systemcall --->
            percpu_counter_sub(&sbi->s_dirtyblocks_counter, reserv_blks);

To avoid the issue, this patch ensures that f_bfree is non-negative.

Signed-off-by: Kazuya Mio <>
10 years agoext4: protect bb_first_free in ext4_trim_all_free() with group lock
Lukas Czerner [Tue, 24 May 2011 22:28:07 +0000 (18:28 -0400)]
ext4: protect bb_first_free in ext4_trim_all_free() with group lock

We should protect reading bd_info->bb_first_free with the group lock
because otherwise we might miss some free blocks. This is not a big deal
at all, but the change to do right thing is really simple, so lets do

Signed-off-by: Lukas Czerner <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: only load buddy bitmap in ext4_trim_fs() when it is needed
Lukas Czerner [Tue, 24 May 2011 22:16:27 +0000 (18:16 -0400)]
ext4: only load buddy bitmap in ext4_trim_fs() when it is needed

Currently we are loading buddy ext4_mb_load_buddy() for every block
group we are going through in ext4_trim_fs() in many cases just to find
out that there is not enough space to be bothered with. As Amir Goldstein
suggested we can use bb_free information directly from ext4_group_info.

This commit removes ext4_mb_load_buddy() from ext4_trim_fs() and rather
get the ext4_group_info via ext4_get_group_info() and use the bb_free
information directly from that. This avoids unnecessary call to load
buddy in the case the group does not have enough free space to trim.
Loading buddy is now moved to ext4_trim_all_free().

Tested by me with xfstests 251.

Signed-off-by: Lukas Czerner <>
Signed-off-by: "Theodore Ts'o" <>
10 years agojbd2: Fix comment to match the code in jbd2__journal_start()
Eryu Guan [Tue, 24 May 2011 21:09:58 +0000 (17:09 -0400)]
jbd2: Fix comment to match the code in jbd2__journal_start()

jbd2__journal_start() returns an ERR_PTR() value rather than NULL on

Signed-off-by: Eryu Guan <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: fix waiting and sending of a barrier in ext4_sync_file()
Jan Kara [Tue, 24 May 2011 16:00:54 +0000 (12:00 -0400)]
ext4: fix waiting and sending of a barrier in ext4_sync_file()

jbd2_log_start_commit() returns 1 only when we really start a
transaction.  But we also need to wait for a transaction when the
commit is already running.  Fix this problem by waiting for
transaction commit unconditionally (which is just a quick check if the
transaction is already committed).

Also we have to be more careful with sending of a barrier because when
transaction is being committed in parallel to ext4_sync_file()
running, we cannot be sure that the barrier the journalling code sends
happens after we wrote all the data for fsync (note that not every
data writeout needs to trigger metadata changes thus commit of some
metadata changes can be running while other data is still written
out). So use jbd2_will_send_data_barrier() helper to detect the common
cases when we can be sure barrier will be issued by the commit code
and issue the barrier ourselves in the remaining cases.

Reported-by: Edward Goggin <>
Signed-off-by: Jan Kara <>
Signed-off-by: "Theodore Ts'o" <>
10 years agojbd2: Add function jbd2_trans_will_send_data_barrier()
Jan Kara [Tue, 24 May 2011 15:59:18 +0000 (11:59 -0400)]
jbd2: Add function jbd2_trans_will_send_data_barrier()

Provide a function which returns whether a transaction with given tid
will send a flush to the filesystem device.  The function will be used
by ext4 to detect whether fsync needs to send a separate flush or not.

Signed-off-by: Jan Kara <>
Signed-off-by: "Theodore Ts'o" <>
10 years agojbd2: fix sending of data flush on journal commit
Jan Kara [Tue, 24 May 2011 15:52:40 +0000 (11:52 -0400)]
jbd2: fix sending of data flush on journal commit

In data=ordered mode, it's theoretically possible (however rare) that
an inode is filed to transaction's t_inode_list and a flusher thread
writes all the data and inode is reclaimed before the transaction
starts to commit.  In such a case, we could erroneously omit sending a
flush to file system device when it is different from the journal
device (because data can still be in disk cache only).

Fix the problem by setting a flag in a transaction when some inode is added
to it and then send disk flush in the commit code when the flag is set.

Signed-off-by: Jan Kara <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: fix ext4_ext_fiemap_cb() to handle blocks before request range correctly
Yongqiang Yang [Tue, 24 May 2011 15:36:58 +0000 (11:36 -0400)]
ext4: fix ext4_ext_fiemap_cb() to handle blocks before request range correctly

To get delayed-extent information, ext4_ext_fiemap_cb() looks up
pagecache, it thus collects information starting from a page's
head block.

If blocksize < pagesize, the beginning blocks of a page may lies
before the request range. So ext4_ext_fiemap_cb() should proceed
ignoring them, because they has been handled before. If no mapped
buffer in the range is found in the 1st page, we need to look up
the 2nd page, otherwise delayed-extents after a hole will be ignored.

Without this patch, xfstests 225 will hung on ext4 with 1K block.

Reported-by: Amir Goldstein <>
Signed-off-by: Yongqiang Yang <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: use truncate_setsize() unconditionally
Theodore Ts'o [Mon, 23 May 2011 19:13:02 +0000 (15:13 -0400)]
ext4: use truncate_setsize() unconditionally

In commit c8d46e41 (ext4: Add flag to files with blocks intentionally
past EOF), if the EOFBLOCKS_FL flag is set, we call ext4_truncate()
before calling vmtruncate().  This caused any allocated but unwritten
blocks created by calling fallocate() with the FALLOC_FL_KEEP_SIZE
flag to be dropped.  This was done to make to make sure that
EOFBLOCKS_FL would not be cleared while still leaving blocks past
i_size allocated.  This was not necessary, since ext4_truncate()
guarantees that blocks past i_size will be dropped, even in the case
where truncate() has increased i_size before calling ext4_truncate().

So fix this by removing the EOFBLOCKS_FL special case treatment in
ext4_setattr().  In addition, use truncate_setsize() followed by a
call to ext4_truncate() instead of using vmtruncate().  This is more
efficient since it skips the call to inode_newsize_ok(), which has
been checked already by inode_change_ok().  This is also in a win in
the case where EOFBLOCKS_FL is set since it avoids calling
ext4_truncate() twice.

Signed-off-by: "Theodore Ts'o" <>
10 years agojbd2: Fix the wrong calculation of t_max_wait in update_t_max_wait
Tao Ma [Mon, 23 May 2011 01:45:26 +0000 (21:45 -0400)]
jbd2: Fix the wrong calculation of t_max_wait in update_t_max_wait

t_max_wait is added in commit 8e85fb3f to indicate how long we
were waiting for new transaction to start. In commit 6d0bf005,
it is moved to another function named update_t_max_wait to
avoid a build warning. But the wrong thing is that the original
'ts' is initialized in the start of function start_this_handle
and we can calculate t_max_wait in the right way. while with
this change, ts is initialized within the function and t_max_wait
can never be calculated right.

This patch moves the initialization of ts to the original beginning
of start_this_handle and pass it to function update_t_max_wait so
that it can be calculated right and the build warning is avoided also.

Cc: Jan Kara <>
Signed-off-by: Tao Ma <>
Signed-off-by: "Theodore Ts'o" <>
Reviewed-by: Eric Sandeen <>
10 years agoext4: fix unbalanced up_write() in ext4_ext_truncate() error path
Eric Gouriou [Mon, 23 May 2011 01:33:00 +0000 (21:33 -0400)]
ext4: fix unbalanced up_write() in ext4_ext_truncate() error path

ext4_ext_truncate() should not invoke up_write(&EXT4_I(inode)->i_data_sem)
when ext4_orphan_add() returns an error, as it hasn't performed a
down_write() yet. This trivial patch fixes this by moving the up_write()
invocation above the out_stop label.

Signed-off-by: Eric Gouriou <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: count hits/misses of extent cache and expose in sysfs
Vivek Haldar [Mon, 23 May 2011 01:24:16 +0000 (21:24 -0400)]
ext4: count hits/misses of extent cache and expose in sysfs

The number of hits and misses for each filesystem is exposed in
/sys/fs/ext4/<dev>/extent_cache_{hits, misses}.

Tested: fsstress, manual checks.
Signed-off-by: Vivek Haldar <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: make ext4_split_extent() handle error correctly
Yongqiang Yang [Mon, 23 May 2011 00:49:12 +0000 (20:49 -0400)]
ext4: make ext4_split_extent() handle error correctly

Signed-off-by: Yongqiang Yang <>
Signed-off-by: "Theodore Ts'o" <>
Reviewed-by: Mingming Cao <>
10 years agoext4: don't show mount options in /proc/mounts if there is no journal
Theodore Ts'o [Sun, 22 May 2011 20:12:35 +0000 (16:12 -0400)]
ext4: don't show mount options in /proc/mounts if there is no journal

After creating an ext4 file system without a journal:

  # mke2fs -t ext4 -O ^has_journal /dev/sda
  # mount -t ext4 /dev/sda /test

the /proc/mounts will show:
"/dev/sda /test ext4 rw,relatime,user_xattr,acl,barrier=1,data=writeback 0 0"
which can fool users into thinking that the fs is using writeback mode.

So don't set the writeback option when the journal has not been
enabled; we don't depend on the writeback option being set, since
ext4_should_writeback_data() in ext4_jbd2.h tests to see if the
journal is not present before returning true.

Reported-by: Robin Dong <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: fix possible use-after-free in ext4_remove_li_request()
Lukas Czerner [Fri, 20 May 2011 17:55:29 +0000 (13:55 -0400)]
ext4: fix possible use-after-free in ext4_remove_li_request()

We need to take reference to the s_li_request after we take a mutex,
because it might be freed since then, hence result in accessing old
already freed memory. Also we should protect the whole
ext4_remove_li_request() because ext4_li_info might be in the process of
being freed in ext4_lazyinit_thread().

Signed-off-by: Lukas Czerner <>
Signed-off-by: "Theodore Ts'o" <>
Reviewed-by: Eric Sandeen <>
10 years agoext4: fix the mount option "init_itable=n" to work as expected for n=0
Lukas Czerner [Fri, 20 May 2011 17:55:16 +0000 (13:55 -0400)]
ext4: fix the mount option "init_itable=n" to work as expected for n=0

For some reason, when we set the mount option "init_itable=0" it
behaves as we would set init_itable=20 which is not right at all.
Basically when we set it to zero we are saying to lazyinit thread not
to wait between zeroing the inode table (except of cond_resched()) so
this commit fixes that and removes the unnecessary condition.  The 'n'
should be also properly used on remount.

When the n is not set at all, it means that the default miltiplier
EXT4_DEF_LI_WAIT_MULT is set instead.

Signed-off-by: Lukas Czerner <>
Signed-off-by: "Theodore Ts'o" <>
Reported-by: Eric Sandeen <>
10 years agoext4: Remove unnecessary wait_event ext4_run_lazyinit_thread()
Lukas Czerner [Fri, 20 May 2011 17:49:51 +0000 (13:49 -0400)]
ext4: Remove unnecessary wait_event ext4_run_lazyinit_thread()

For some reason we have been waiting for lazyinit thread to start in the
ext4_run_lazyinit_thread() but it is not needed since it was jus
unnecessary complexity, so get rid of it. We can also remove li_task and
li_wait_task since it is not used anymore.

Signed-off-by: Lukas Czerner <>
Signed-off-by: "Theodore Ts'o" <>
Reviewed-by: Eric Sandeen <>
10 years agoext4: Use schedule_timeout_interruptible() for waiting in lazyinit thread
Lukas Czerner [Fri, 20 May 2011 17:49:04 +0000 (13:49 -0400)]
ext4: Use schedule_timeout_interruptible() for waiting in lazyinit thread

In order to make lazyinit eat approx. 10% of io bandwidth at max, we
are sleeping between zeroing each single inode table. For that purpose
we are using timer which wakes up thread when it expires. It is set
via add_timer() and this may cause troubles in the case that thread
has been woken up earlier and in next iteration we call add_timer() on
still running timer hence hitting BUG_ON in add_timer(). We could fix
that by using mod_timer() instead however we can use
schedule_timeout_interruptible() for waiting and hence simplifying
things a lot.

This commit exchange the old "waiting mechanism" with simple
schedule_timeout_interruptible(), setting the time to sleep. Hence we
do not longer need li_wait_daemon waiting queue and others, so get rid
of it.

Addresses-Red-Hat-Bugzilla: #699708

Signed-off-by: Lukas Czerner <>
Signed-off-by: "Theodore Ts'o" <>
Reviewed-by: Eric Sandeen <>
10 years agoext4: wait for writeback to complete while making pages writable
Darrick J. Wong [Wed, 18 May 2011 17:55:20 +0000 (13:55 -0400)]
ext4: wait for writeback to complete while making pages writable

In order to stabilize pages during disk writes, ext4_page_mkwrite must
wait for writeback operations to complete before making a page
writable.  Furthermore, the function must return locked pages, and
recheck the writeback status if the page lock is ever dropped.  The
"someone could wander in" part of this patch was suggested by Chris

Signed-off-by: Darrick J. Wong <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: clean up some wait_on_page_writeback calls
Darrick J. Wong [Wed, 18 May 2011 17:53:20 +0000 (13:53 -0400)]
ext4: clean up some wait_on_page_writeback calls

wait_on_page_writeback already checks the writeback bit, so callers of it
needn't do that test.

Signed-off-by: Darrick J. Wong <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: don't warn about mnt_count if it has been disabled
Tao Ma [Wed, 18 May 2011 17:29:57 +0000 (13:29 -0400)]
ext4: don't warn about mnt_count if it has been disabled

Currently, if we mkfs a new ext4 volume with s_max_mnt_count set to
zero, and mount it for the first time, we will get the warning:

maximal mount count reached, running e2fsck is recommended

It is really misleading. So change the check so that it won't warn in
that case.

Signed-off-by: Tao Ma <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: ext4_ext_convert_to_initialized bug found in extended FSX testing
Allison Henderson [Mon, 16 May 2011 14:11:09 +0000 (10:11 -0400)]
ext4: ext4_ext_convert_to_initialized bug found in extended FSX testing

This patch addresses bugs found while testing punch hole
with the fsx test.  The patch corrects the number of blocks
that are zeroed out while splitting an extent, and also corrects
the return value to return the number of blocks split out, instead
of the number of blocks zeroed out.

This patch has been tested in addition to the following patches:
[Ext4 punch hole v7]
[XFS Tests Punch Hole 1/1 v2] Add Punch Hole Testing to FSX

The test ran successfully for 24 hours.

Signed-off-by: Allison Henderson <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: fix oops in ext4_quota_off()
Amir Goldstein [Mon, 16 May 2011 13:59:13 +0000 (09:59 -0400)]
ext4: fix oops in ext4_quota_off()

If quota is not enabled when ext4_quota_off() is called, we must not
dereference quota file inode since it is NULL.  Check properly for

This fixes a bug in commit 21f976975cbe (ext4: remove unnecessary
[cm]time update of quota file), which was merged for 2.6.39-rc3.

Reported-by: Amir Goldstein <>
Signed-off-by: Amir Goldstein <>
Signed-off-by: Jan Kara <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: don't dereference null pointer when make_indexed_dir() fails
Allison Henderson [Sun, 15 May 2011 04:19:41 +0000 (00:19 -0400)]
ext4: don't dereference null pointer when make_indexed_dir() fails

Fix for a null pointer bug found while running punch hole tests

Signed-off-by: Allison Henderson <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: remove alloc_semp
Amir Goldstein [Tue, 10 May 2011 01:52:36 +0000 (21:52 -0400)]
ext4: remove alloc_semp

After taking care of all group init races, all that remains is to
remove alloc_semp from ext4_allocation_context and ext4_buddy structs.

Signed-off-by: Amir Goldstein <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: teach ext4_mb_init_cache() to skip uptodate buddy caches
Amir Goldstein [Tue, 10 May 2011 01:49:42 +0000 (21:49 -0400)]
ext4: teach ext4_mb_init_cache() to skip uptodate buddy caches

After online resize which adds new groups, some of the groups
in a buddy page may be initialized and uptodate, while other
(new ones) may be uninitialized.

The indication for init of new block groups is when ext4_mb_init_cache()
is called with an uptodate buddy page. In this case, initialized groups
on that buddy page must be skipped when initializing the buddy cache.

Signed-off-by: Amir Goldstein <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: synchronize ext4_mb_init_group() with buddy page lock
Amir Goldstein [Tue, 10 May 2011 01:48:13 +0000 (21:48 -0400)]
ext4: synchronize ext4_mb_init_group() with buddy page lock

The old routines ext4_mb_[get|put]_buddy_cache_lock(), which used
to take grp->alloc_sem for all groups on the buddy page have been
replaced with the routines ext4_mb_[get|put]_buddy_page_lock().

The new routines take both buddy and bitmap page locks to protect
against concurrent init of groups on the same buddy page.

The GROUP_NEED_INIT flag is tested again under page lock to check
if the group was initialized by another caller.

Signed-off-by: Amir Goldstein <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: implement ext4_add_groupblocks() by freeing blocks
Amir Goldstein [Tue, 10 May 2011 01:40:01 +0000 (21:40 -0400)]
ext4: implement ext4_add_groupblocks() by freeing blocks

The old imlementation used to take grp->alloc_sem and set the
GROUP_NEED_INIT flag, so that the buddy cache would be reloaded.

The new implementation updates the buddy cache by freeing the added
blocks and making them available for use, so there is no need to
reload the buddy cache and there is no need to take grp->alloc_sem.

Signed-off-by: Amir Goldstein <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: remove unneeded ext4_journal_get_undo_access
Theodore Ts'o [Mon, 9 May 2011 14:58:45 +0000 (10:58 -0400)]
ext4: remove unneeded ext4_journal_get_undo_access

The block allocation code used to use jbd2_journal_get_undo_access as
a way to make changes that wouldn't show up until the commit took
place.  The new multi-block allocation code has a its own way of
preventing newly freed blocks from getting reused until the commit
takes place (it avoids updating the buddy bitmaps until the commit is
done), so we don't need to use jbd2_journal_get_undo_access(), which
has extra overhead compared to jbd2_journal_get_write_access().

There was one last vestigal use of ext4_journal_get_undo_access() in
ext4_add_groupblocks(); change it to use ext4_journal_get_write_access()
and then remove the ext4_journal_get_undo_access() support.

Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: move ext4_add_groupblocks() to mballoc.c
Amir Goldstein [Mon, 9 May 2011 14:46:41 +0000 (10:46 -0400)]
ext4: move ext4_add_groupblocks() to mballoc.c

In preparation for the next patch, the function ext4_add_groupblocks()
is moved to mballoc.c, where it could use some static functions.

Signed-off-by: Amir Goldstein <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: remove redundant #ifdef in super.c
Amerigo Wang [Mon, 9 May 2011 14:30:41 +0000 (10:30 -0400)]
ext4: remove redundant #ifdef in super.c

There is already an #ifdef CONFIG_QUOTA some lines above,
so this one is totally useless.

Signed-off-by: WANG Cong <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: remove redundant check for first_not_zeroed in ext4_register_li_request
Tao Ma [Mon, 9 May 2011 14:28:41 +0000 (10:28 -0400)]
ext4: remove redundant check for first_not_zeroed in ext4_register_li_request

We have checked first_not_zeroed == ngroups already above, so remove
this redundant check.

sbi->s_li_request = NULL above is also removed since it is NULL

Cc: Lukas Czerner <>
Signed-off-by: Tao Ma <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: use s_inodes_per_block directly in __ext4_get_inode_loc
Tao Ma [Mon, 9 May 2011 14:26:41 +0000 (10:26 -0400)]
ext4: use s_inodes_per_block directly in __ext4_get_inode_loc

In __ext4_get_inode_loc, we calculate inodes_per_block every time by
EXT4_BLOCK_SIZE(sb) / EXT4_INODE_SIZE(sb).  AFAICS, this function is a
hot path for ext4, so we'd better use s_inodes_per_block directly
instead of calculating every time.

Signed-off-by: Tao Ma <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: use EXT4FS_DEBUG instead of EXT4_DEBUG in fsync.c
Tao Ma [Mon, 9 May 2011 14:25:54 +0000 (10:25 -0400)]
ext4: use EXT4FS_DEBUG instead of EXT4_DEBUG in fsync.c

We have EXT4FS_DEBUG for some old debug and CONFIG_EXT4_DEBUG
for the new mballoc debug, but there isn't any EXT4_DEBUG.

As CONFIG_EXT4_DEBUG seems to be only used in mballoc, use
EXT4FS_DEBUG in fsync.c.

[ It doesn't really matter; although I'm including this commit for
  consistency's sake.  The whole point of the #ifdef's is to disable
  the debugging code.  In general you're not going to want to enable
  all of the code protected by EXT4FS_DEBUG at the same time.  -- Ted ]

Signed-off-by: Tao Ma <>
Signed-off-by: "Theodore Ts'o" <>
10 years agojbd2: only print the debugging information for tid wraparound once
Theodore Ts'o [Sun, 8 May 2011 23:37:54 +0000 (19:37 -0400)]
jbd2: only print the debugging information for tid wraparound once

If we somehow wrap, we don't want to keep printing the warning message
over and over again.

Signed-off-by: "Theodore Ts'o" <>
10 years agojbd2: Fix forever sleeping process in do_get_write_access()
Jan Kara [Sun, 8 May 2011 23:09:53 +0000 (19:09 -0400)]
jbd2: Fix forever sleeping process in do_get_write_access()

In do_get_write_access() we wait on BH_Unshadow bit for buffer to get
from shadow state. The waking code in journal_commit_transaction() has
a bug because it does not issue a memory barrier after the buffer is
moved from the shadow state and before wake_up_bit() is called. Thus a
waitqueue check can happen before the buffer is actually moved from
the shadow state and waiting process may never be woken. Fix the
problem by issuing proper barrier.

Reported-by: Tao Ma <>
Signed-off-by: Jan Kara <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: reimplement convert and split_unwritten
Yongqiang Yang [Tue, 3 May 2011 16:25:07 +0000 (12:25 -0400)]
ext4: reimplement convert and split_unwritten

Reimplement ext4_ext_convert_to_initialized() and
ext4_split_unwritten_extents() using ext4_split_extent()

Signed-off-by: Yongqiang Yang <>
Signed-off-by: "Theodore Ts'o" <>
Tested-by: Allison Henderson <>
10 years agoext4: add ext4_split_extent_at() and ext4_split_extent()
Yongqiang Yang [Tue, 3 May 2011 16:23:07 +0000 (12:23 -0400)]
ext4: add ext4_split_extent_at() and ext4_split_extent()

Add two functions: ext4_split_extent_at(), which splits an extent into
two extents at given logical block, and ext4_split_extent() which
splits an extent into three extents.

Signed-off-by: Yongqiang Yang <>
Signed-off-by: "Theodore Ts'o" <>
Tested-by: Allison Henderson <>
10 years agoext4: add a function merging extents right and left
Yongqiang Yang [Tue, 3 May 2011 15:45:29 +0000 (11:45 -0400)]
ext4: add a function merging extents right and left

1) Rename ext4_ext_try_to_merge() to ext4_ext_try_to_merge_right().

2) Add a new function ext4_ext_try_to_merge() which tries to merge
   an extent both left and right.

3) Use the new function in ext4_ext_convert_unwritten_endio() and

Signed-off-by: Yongqiang Yang <>
Tested-by: Allison Henderson <>
10 years agoext4: fix deadlock in ext4_symlink() in ENOSPC conditions
Jan Kara [Tue, 3 May 2011 15:12:58 +0000 (11:12 -0400)]
ext4: fix deadlock in ext4_symlink() in ENOSPC conditions

ext4_symlink() cannot call __page_symlink() with transaction open.
__page_symlink() calls ext4_write_begin() which can wait for
transaction commit if we are running out of space thus causing a
deadlock. Also error recovery in ext4_truncate_failed_write() does not
count with the transaction being already started (although I'm not
aware of any particular deadlock here).

Fix the problem by stopping a transaction before calling
__page_symlink() (we have to be careful and put inode to orphan list
so that it gets deleted in case of crash) and starting another one
after __page_symlink() returns for addition of symlink into a

Signed-off-by: Jan Kara <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: Fix fs corruption when make_indexed_dir() fails
Jan Kara [Tue, 3 May 2011 15:05:55 +0000 (11:05 -0400)]
ext4: Fix fs corruption when make_indexed_dir() fails

When make_indexed_dir() fails (e.g. because of ENOSPC) after it has
allocated block for index tree root, we did not properly mark all
changed buffers dirty.  This lead to only some of these buffers being
written out and thus effectively corrupting the directory.

Fix the issue by marking all changed data dirty even in the error
failure case.

Signed-off-by: Jan Kara <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: set extents flag when migrating file to use extents
Theodore Ts'o [Tue, 3 May 2011 13:34:42 +0000 (09:34 -0400)]
ext4: set extents flag when migrating file to use extents

Fix a typo that was introduced in commit 07a038245b (in 2.6.36) which
caused the extents flag not to be set at the conclusion of converting
an inode to use extents.

Reported-by: Peter Uchno <>
Signed-off-by: "Theodore Ts'o" <>
10 years agojbd2: fix fsync() tid wraparound bug
Theodore Ts'o [Sun, 1 May 2011 22:16:26 +0000 (18:16 -0400)]
jbd2: fix fsync() tid wraparound bug

If an application program does not make any changes to the indirect
blocks or extent tree, i_datasync_tid will not get updated.  If there
are enough commits (i.e., 2**31) such that tid_geq()'s calculations
wrap, and there isn't a currently active transaction at the time of
the fdatasync() call, this can end up triggering a BUG_ON in

J_ASSERT(journal->j_running_transaction != NULL);

It's pretty rare that this can happen, since it requires the use of
fdatasync() plus *very* frequent and excessive use of fsync().  But
with the right workload, it can.

We fix this by replacing the use of tid_geq() with an equality test,
since there's only one valid transaction id that we is valid for us to
wait until it is commited: namely, the currently running transaction
(if it exists).

Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: remove obsolete mount options from ext4's documentation
Theodore Ts'o [Sun, 1 May 2011 22:14:26 +0000 (18:14 -0400)]
ext4: remove obsolete mount options from ext4's documentation

The block reservation code from ext3 was removed long ago...

Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: remove dead code in ext4_has_free_blocks()
Shaohua Li [Sun, 1 May 2011 22:11:18 +0000 (18:11 -0400)]
ext4: remove dead code in ext4_has_free_blocks()

percpu_counter_sum_positive() never returns a negative value.

Signed-off-by: Shaohua Li <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: ignore errors when issuing discards
Theodore Ts'o [Sat, 30 Apr 2011 17:47:24 +0000 (13:47 -0400)]
ext4: ignore errors when issuing discards

This is an effective revert of commit a30eec2a8: "ext4: stop issuing
discards if not supported by device".  The problem is that there are
some devices that may return errors in response to a discard request
some times but not others.  (One example would be a hybrid dm device
which concatenates an SSD and an HDD device).

By this logic, I also removed the error checking from ext4's FITRIM
code; so that an error from a discard will not stop the FITRIM from
trying to trim the rest of the file system.

Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: don't set PageUptodate in ext4_end_bio()
Curt Wohlgemuth [Sat, 30 Apr 2011 17:26:26 +0000 (13:26 -0400)]
ext4: don't set PageUptodate in ext4_end_bio()

In the bio completion routine, we should not be setting
PageUptodate at all -- it's set at sys_write() time, and is
unaffected by success/failure of the write to disk.

This can cause a page corruption bug when the file system's
block size is less than the architecture's VM page size.

if we have only written a single block -- we might end up
setting the page's PageUptodate flag, indicating that page
is completely read into memory, which may not be true.
This could cause subsequent reads to get bad data.

This commit also takes the opportunity to clean up error
handling in ext4_end_bio(), and remove some extraneous code:

   - fixes ext4_end_bio() to set AS_EIO in the
     page->mapping->flags on error, which was left out by
     mistake.  This is needed so that fsync() will
     return an error if there was an I/O error.
   - remove the clear_buffer_dirty() call on unmapped
     buffers for each page.
   - consolidate page/buffer error handling in a single

Signed-off-by: Curt Wohlgemuth <>
Signed-off-by: "Theodore Ts'o" <>
Reported-by: Jim Meyering <>
Reported-by: Hugh Dickins <>
Cc: Mingming Cao <>
10 years agoext4: check for ext[23] file system features when mounting as ext[23]
Theodore Ts'o [Mon, 18 Apr 2011 21:29:14 +0000 (17:29 -0400)]
ext4: check for ext[23] file system features when mounting as ext[23]

Provide better emulation for ext[23] mode by enforcing that the file
system does not have any unsupported file system features as defined
by ext[23] when emulating the ext[23] file system driver when
CONFIG_EXT4_USE_FOR_EXT23 is defined.

This causes the file system type information in /proc/mounts to be
correct for the automatically mounted root file system.  This also
means that "mount -t ext2 /dev/sda /mnt" will fail if /dev/sda
contains an ext3 or ext4 file system, just as one would expect if the
original ext2 file system driver were in use.

Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: release page cache in ext4_mb_load_buddy error path
Yang Ruirui [Sat, 16 Apr 2011 23:17:48 +0000 (19:17 -0400)]
ext4: release page cache in ext4_mb_load_buddy error path

Add missing page_cache_release in the error path of ext4_mb_load_buddy

Signed-off-by: Yang Ruirui <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoLinux 2.6.39-rc3 v2.6.39-rc3
Linus Torvalds [Tue, 12 Apr 2011 00:21:51 +0000 (17:21 -0700)]
Linux 2.6.39-rc3

10 years agoMerge branch 'for-linus' of git://
Linus Torvalds [Mon, 11 Apr 2011 22:48:57 +0000 (15:48 -0700)]
Merge branch 'for-linus' of git://

* 'for-linus' of git://
  xfs: use proper interfaces for on-stack plugging
  xfs: fix xfs_debug warnings
  xfs: fix variable set but not used warnings
  xfs: convert log tail checking to a warning
  xfs: catch bad block numbers freeing extents.
  xfs: push the AIL from memory reclaim and periodic sync
  xfs: clean up code layout in xfs_trans_ail.c
  xfs: convert the xfsaild threads to a workqueue
  xfs: introduce background inode reclaim work
  xfs: convert ENOSPC inode flushing to use new syncd workqueue
  xfs: introduce a xfssyncd workqueue
  xfs: fix extent format buffer allocation size
  xfs: fix unreferenced var error in xfs_buf.c

Also, applied patch from Tony Luck that fixes ia64:
  xfs_destroy_workqueues() should not be tagged with__exit
in the branch before merging.

10 years agoxfs_destroy_workqueues() should not be tagged with__exit
Luck, Tony [Mon, 11 Apr 2011 19:06:12 +0000 (12:06 -0700)]
xfs_destroy_workqueues() should not be tagged with__exit

ia64 throws away .exit sections for the built-in CONFIG case, so routines
that are used in other circumstances should not be tagged as __exit.

Signed-off-by: Tony Luck <>
Reviewed-by: Christoph Hellwig <>
Signed-off-by: Alex Elder <>
Signed-off-by: Linus Torvalds <>
10 years agoMerge branch 'for_linus' of git://
Linus Torvalds [Mon, 11 Apr 2011 22:45:47 +0000 (15:45 -0700)]
Merge branch 'for_linus' of git://git./linux/kernel/git/tytso/ext4

* 'for_linus' of git://
  ext4: fix data corruption regression by reverting commit 6de9843dab3f
  ext4: Allow indirect-block file to grow the file size to max file size
  ext4: allow an active handle to be started when freezing
  ext4: sync the directory inode in ext4_sync_parent()
  ext4: init timer earlier to avoid a kernel panic in __save_error_info
  jbd2: fix potential memory leak on transaction commit
  ext4: fix a double free in ext4_register_li_request
  ext4: fix credits computing for indirect mapped files
  ext4: remove unnecessary [cm]time update of quota file
  jbd2: move bdget out of critical section

10 years agoMerge branch 'for-2.6.39' of git://
Linus Torvalds [Mon, 11 Apr 2011 22:45:17 +0000 (15:45 -0700)]
Merge branch 'for-2.6.39' of git://

* 'for-2.6.39' of git://
  nfsd4: fix oops on lock failure
  nfsd: fix auth_domain reference leak on nlm operations

10 years agoMerge branch 'spi/merge' of git://
Linus Torvalds [Mon, 11 Apr 2011 22:44:38 +0000 (15:44 -0700)]
Merge branch 'spi/merge' of git://

* 'spi/merge' of git://
  dt/fsldma: fix build warning caused by of_platform_device changes
  spi: Fix race condition in stop_queue()
  gpio/pch_gpio: Fix output value of pch_gpio_direction_output()
  gpio/ml_ioh_gpio: Fix output value of ioh_gpio_direction_output()
  gpio/pca953x: fix error handling path in probe() call

10 years agopci: fix PCI bus allocation alignment handling
Linus Torvalds [Mon, 11 Apr 2011 17:53:11 +0000 (10:53 -0700)]
pci: fix PCI bus allocation alignment handling

In commit 13583b16592a ("PCI: refactor io size calculation code") Ram
had a thinko in the refactorization of the code: the end result used the
variable 'align' for the bus alignment, but the original code used

Since then, another use of that 'align' variable got introduced by
commit c8adf9a3e873 ("PCI: pre-allocate additional resources to devices
only after successful allocation of essential resources.")

Fix both of those uses to use 'min_align' as they should.

Daniel Hellstrom <>
Acked-by: Ram Pai <>
Acked-by: Jesse Barnes <>
Signed-off-by: Linus Torvalds <>
10 years agoMerge git://
Linus Torvalds [Mon, 11 Apr 2011 14:27:24 +0000 (07:27 -0700)]
Merge git://git./linux/kernel/git/davem/net-2.6

* git:// (34 commits)
  net: Add support for SMSC LAN9530, LAN9730 and LAN89530
  mlx4_en: Restoring RX buffer pointer in case of failure
  mlx4: Sensing link type at device initialization
  ipv4: Fix "Set rt->rt_iif more sanely on output routes."
  MAINTAINERS: add entry for Xen network backend
  be2net: Fix suspend/resume operation
  be2net: Rename some struct members for clarity
  pppoe: drop PPPOX_ZOMBIEs in pppoe_flush_dev
  dsa/mv88e6131: add support for mv88e6085 switch
  ipv6: Enable RFS sk_rxhash tracking for ipv6 sockets (v2)
  be2net: Fix a potential crash during shutdown.
  bna: Fix for handling firmware heartbeat failure
  can: mcp251x: Allow pass IRQ flags through platform data.
  smsc911x: fix mac_lock acquision before calling smsc911x_mac_read
  iwlwifi: accept EEPROM version 0x423 for iwl6000
  rt2x00: fix cancelling uninitialized work
  rtlwifi: Fix some warnings/bugs
  p54usb: IDs for two new devices
  wl12xx: fix potential buffer overflow in testmode nvs push
  zd1211rw: reset rx idle timer from tasklet

10 years agodt/fsldma: fix build warning caused by of_platform_device changes
Ira W. Snyder [Thu, 7 Apr 2011 17:33:03 +0000 (10:33 -0700)]
dt/fsldma: fix build warning caused by of_platform_device changes

Commit 000061245a6797d542854106463b6b20fbdcb12e, "dt/powerpc:
Eliminate users of of_platform_{,un}register_driver" forgot to convert
the type of structure passed into platform_device_register() when it
was converted from of_platform_device_register. Fix it.

Signed-off-by: Ira W. Snyder <>
Signed-off-by: Grant Likely <>
10 years agoext4: fix data corruption regression by reverting commit 6de9843dab3f
Theodore Ts'o [Mon, 11 Apr 2011 02:30:07 +0000 (22:30 -0400)]
ext4: fix data corruption regression by reverting commit 6de9843dab3f

Revert commit 6de9843dab3f2a1d4d66d80aa9e5782f80977d20, since it
caused a data corruption regression with BitTorrent downloads.  Thanks
to Damien for discovering and bisecting to find the problem commit.

Reported-by: Damien Grassart <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: Allow indirect-block file to grow the file size to max file size
Kazuya Mio [Mon, 11 Apr 2011 02:06:36 +0000 (22:06 -0400)]
ext4: Allow indirect-block file to grow the file size to max file size

We can create 4402345721856 byte file with indirect block mapping.
However, if we grow an indirect-block file to the size with ftruncate(),
we can see an ext4 warning. The following patch fixes this problem.

How to reproduce:
# dd if=/dev/zero of=/mnt/mp1/hoge bs=1 count=0 seek=4402345721856
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000221428 s, 0.0 kB/s
# tail -n 1 /var/log/messages
Nov 25 15:10:27 test kernel: EXT4-fs warning (device sda8): ext4_block_to_path:345: block 1074791436 > max in inode 12

Signed-off-by: Kazuya Mio <>
Signed-off-by: "Theodore Ts'o" <>
10 years agoext4: allow an active handle to be started when freezing
Yongqiang Yang [Mon, 11 Apr 2011 02:06:07 +0000 (22:06 -0400)]
ext4: allow an active handle to be started when freezing

ext4_journal_start_sb() should not prevent an active handle from being
started due to s_frozen.  Otherwise, deadlock is easy to happen, below
is a situation.

     freeze         |       truncate
                    |  ext4_ext_truncate()
    freeze_super()  |   starts a handle
    sets s_frozen   |
                    |  ext4_ext_truncate()
                    |  holds i_data_sem
  ext4_freeze()     |
  waits for updates |
                    |  ext4_free_blocks()
                    |  calls dquot_free_block()
                    |  dquot_free_blocks()
                    |  calls ext4_dirty_inode()
                    |  ext4_dirty_inode()
                    |  trys to start an active
                    |  handle
                    |  block due to s_frozen

Signed-off-by: Yongqiang Yang <>
Signed-off-by: "Theodore Ts'o" <>
Reported-by: Amir Goldstein <>
Reviewed-by: Jan Kara <>
Reviewed-by: Andreas Dilger <>
10 years agoext4: sync the directory inode in ext4_sync_parent()
Curt Wohlgemuth [Mon, 11 Apr 2011 02:05:31 +0000 (22:05 -0400)]
ext4: sync the directory inode in ext4_sync_parent()

ext4 has taken the stance that, in the absence of a journal,
when an fsync/fdatasync of an inode is done, the parent
directory should be sync'ed if this inode entry is new.
ext4_sync_parent(), which implements this, does indeed sync
the dirent pages for parent directories, but it does not
sync the directory *inode*.  This patch fixes this.

Also now return error status from ext4_sync_parent().

I tested this using a power fail test, which panics a
machine running a file server getting requests from a
client.  Without this patch, on about every other test run,
the server is missing many, many files that had been synced.
With this patch, on > 6 runs, I see zero files being lost.

Google-Bug-Id: 4179519
Signed-off-by: Curt Wohlgemuth <>
Signed-off-by: "Theodore Ts'o" <>
10 years agonet: Add support for SMSC LAN9530, LAN9730 and LAN89530
Steve Glendinning [Mon, 11 Apr 2011 01:59:27 +0000 (18:59 -0700)]
net: Add support for SMSC LAN9530, LAN9730 and LAN89530

This patch adds support for SMSC's LAN9530, LAN9730 and LAN89530 USB
ethernet controllers to the existing smsc95xx driver by adding
their new USB VID/PID pairs.

Signed-off-by: Steve Glendinning <>
Signed-off-by: David S. Miller <>
10 years agoMerge branch 'for-linus' of git://
Linus Torvalds [Sun, 10 Apr 2011 16:56:10 +0000 (09:56 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/tiwai/sound-2.6

* 'for-linus' of git://
  ALSA: hda - Don't query connections for widgets have no connections
  ALSA: HDA: Fix single internal mic on ALC275 (Sony Vaio VPCSB1C5E)
  ALSA: hda - HDMI: Fix MCP7x audio infoframe checksums
  ALSA: usb-audio: define another USB ID for a buggy USB MIDI cable
  ALSA: HDA: Fix dock mic for Lenovo X220-tablet
  ASoC: format_register_str: Don't clip register values
  ASoC: PXA: Fix oops in __pxa2xx_pcm_prepare
  ASoC: zylonite: set .codec_dai_name in initializer

10 years agonfsd4: fix oops on lock failure
J. Bruce Fields [Mon, 28 Mar 2011 07:15:09 +0000 (15:15 +0800)]
nfsd4: fix oops on lock failure

Lock stateid's can have access_bmap 0 if they were only partially
initialized (due to a failed lock request); handle that case in

------------[ cut here ]------------
kernel BUG at fs/nfsd/nfs4state.c:380!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/run
Modules linked in: nfs fscache md4 nls_utf8 cifs ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat bridge stp llc nfsd lockd nfs_acl auth_rpcgss sunrpc ipv6 ppdev parport_pc parport pcnet32 mii pcspkr microcode i2c_piix4 BusLogic floppy [last unloaded: mperf]

Pid: 1468, comm: nfsd Not tainted 2.6.38+ #120 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
EIP: 0060:[<e24f180d>] EFLAGS: 00010297 CPU: 0
EIP is at nfs4_access_to_omode+0x1c/0x29 [nfsd]
EAX: ffffffff EBX: dd758120 ECX: 00000000 EDX: 00000004
ESI: dd758120 EDI: ddfe657c EBP: dd54dde0 ESP: dd54dde0
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process nfsd (pid: 1468, ti=dd54c000 task=ddc92580 task.ti=dd54c000)
 dd54ddf0 e24f19ca 00000000 ddfe6560 dd54de08 e24f1a5d dd758130 deee3a20
 ddfe6560 31270000 dd54df1c e24f52fd 0000000f dd758090 e2505dd0 0be304cf
 dbb51d68 0000000e ddfe657c ddcd8020 dd758130 dd758128 dd7580d8 dd54de68
Call Trace:
 [<e24f19ca>] free_generic_stateid+0x1c/0x3e [nfsd]
 [<e24f1a5d>] release_lockowner+0x71/0x8a [nfsd]
 [<e24f52fd>] nfsd4_lock+0x617/0x66c [nfsd]
 [<e24e57b6>] ? nfsd_setuser+0x199/0x1bb [nfsd]
 [<e24e056c>] ? nfsd_setuser_and_check_port+0x65/0x81 [nfsd]
 [<c07a0052>] ? _cond_resched+0x8/0x1c
 [<c04ca61f>] ? slab_pre_alloc_hook.clone.33+0x23/0x27
 [<c04cac01>] ? kmem_cache_alloc+0x1a/0xd2
 [<c04835a0>] ? __call_rcu+0xd7/0xdd
 [<e24e0dfb>] ? fh_verify+0x401/0x452 [nfsd]
 [<e24f0b61>] ? nfsd4_encode_operation+0x52/0x117 [nfsd]
 [<e24ea0d7>] ? nfsd4_putfh+0x33/0x3b [nfsd]
 [<e24f4ce6>] ? nfsd4_delegreturn+0xd4/0xd4 [nfsd]
 [<e24ea2c9>] nfsd4_proc_compound+0x1ea/0x33e [nfsd]
 [<e24de6ee>] nfsd_dispatch+0xd1/0x1a5 [nfsd]
 [<e1d6e1c7>] svc_process_common+0x282/0x46f [sunrpc]
 [<e1d6e578>] svc_process+0xdc/0xfa [sunrpc]
 [<e24de0fa>] nfsd+0xd6/0x115 [nfsd]
 [<e24de024>] ? nfsd_shutdown+0x24/0x24 [nfsd]
 [<c0454322>] kthread+0x62/0x67
 [<c04542c0>] ? kthread_worker_fn+0x114/0x114
 [<c07a6ebe>] kernel_thread_helper+0x6/0x10
Code: eb 05 b8 00 00 27 4f 8d 65 f4 5b 5e 5f 5d c3 83 e0 03 55 83 f8 02 89 e5 74 17 83 f8 03 74 05 48 75 09 eb 09 b8 02 00 00 00 eb 0b <0f> 0b 31 c0 eb 05 b8 01 00 00 00 5d c3 55 89 e5 57 56 89 d6 8d
EIP: [<e24f180d>] nfs4_access_to_omode+0x1c/0x29 [nfsd] SS:ESP 0068:dd54dde0
---[ end trace 2b0bf6c6557cb284 ]---

The trace route is:

 -> nfsd4_lock()
   -> if (lock->lk_is_new) {
     -> alloc_init_lock_stateid()

        3739: stp->st_access_bmap = 0;

   ->if (status && lock->lk_is_new && lock_sop)
     -> release_lockowner()
      -> free_generic_stateid()
       -> nfs4_access_bmap_to_omode()
          -> nfs4_access_to_omode()

        380: BUG();   *****

This problem was introduced by 0997b173609b9229ece28941c118a2a9b278796e.

Reported-by: Mi Jinlong <>
Tested-by: Mi Jinlong <>
Signed-off-by: J. Bruce Fields <>
10 years agoMerge git://
Linus Torvalds [Sat, 9 Apr 2011 20:23:50 +0000 (13:23 -0700)]
Merge git://

* git://
  mtd: atmel_nand: use CPU I/O when buffer is in vmalloc(ed) region
  mtd: atmel_nand: modify test case for using DMA operations
  mtd: atmel_nand: fix support for CPUs that do not support DMA access
  mtd: atmel_nand: trivial: change DMA usage information trace
  mtd: mtdswap: fix printk format warning

10 years agoMerge branch 'fix/hda' into for-linus
Takashi Iwai [Sat, 9 Apr 2011 08:05:53 +0000 (10:05 +0200)]
Merge branch 'fix/hda' into for-linus

10 years agoMerge branch 'fix/asoc' into for-linus
Takashi Iwai [Sat, 9 Apr 2011 08:05:30 +0000 (10:05 +0200)]
Merge branch 'fix/asoc' into for-linus

10 years agoMerge branch 'bugfixes' of git://
Linus Torvalds [Fri, 8 Apr 2011 18:47:35 +0000 (11:47 -0700)]
Merge branch 'bugfixes' of git://

* 'bugfixes' of git://
  NFS: Change initial mount authflavor only when server returns NFS4ERR_WRONGSEC
  NFS: Fix a signed vs. unsigned secinfo bug
  Revert "net/sunrpc: Use static const char arrays"

10 years agosignal.c: fix erroneous syscall kernel-doc
Randy Dunlap [Fri, 8 Apr 2011 17:53:46 +0000 (10:53 -0700)]
signal.c: fix erroneous syscall kernel-doc

Fix erroneous syscall kernel-doc comments in kernel/signal.c.

Reported-by: Matt Fleming <>
Signed-off-by: Randy Dunlap <>
Signed-off-by: Linus Torvalds <>
10 years agoMerge branch 'for-linus' of git://
Linus Torvalds [Fri, 8 Apr 2011 14:36:14 +0000 (07:36 -0700)]
Merge branch 'for-linus' of git://

* 'for-linus' of git://
  [S390] compile fix for latest binutils
  [S390] cio: prevent purging of CCW devices in the online state
  [S390] qdio: fix init sequence
  [S390] Fix parameter passing for smp_switch_to_cpu()
  [S390] oprofile s390: prevent stack corruption

10 years agoMerge branch 'for_linus' of git://
Linus Torvalds [Fri, 8 Apr 2011 14:35:17 +0000 (07:35 -0700)]
Merge branch 'for_linus' of git://git./linux/kernel/git/jack/linux-fs-2.6

* 'for_linus' of git://
  quota: Don't write quota info in dquot_commit()
  ext3: Fix writepage credits computation for ordered mode

10 years agoxfs: use proper interfaces for on-stack plugging
Christoph Hellwig [Wed, 30 Mar 2011 11:05:09 +0000 (11:05 +0000)]
xfs: use proper interfaces for on-stack plugging

Add proper blk_start_plug/blk_finish_plug pairs for the two places where
we issue buffer I/O, and remove the blk_flush_plug in xfs_buf_lock and
xfs_buf_iowait, given that context switches already flush the per-process
plugging lists.

Signed-off-by: Christoph Hellwig <>
Signed-off-by: Alex Elder <>
10 years agoxfs: fix xfs_debug warnings
Christoph Hellwig [Sat, 2 Apr 2011 18:13:40 +0000 (18:13 +0000)]
xfs: fix xfs_debug warnings

For a CONFIG_XFS_DEBUG=n build gcc complains about statements with no
effect in xfs_debug:

fs/xfs/quota/xfs_qm_syscalls.c: In function 'xfs_qm_scall_trunc_qfiles':
fs/xfs/quota/xfs_qm_syscalls.c:291:3: warning: statement with no effect

The reason for that is that the various new xfs message functions have a
return value which is never used, and in case of the non-debug build
xfs_debug the macro evaluates to a plain 0 which produces the above
warnings.  This can be fixed by turning xfs_debug into an inline function
instead of a macro, but in addition to that I've also changed all the
message helpers to return void as we never use their return values.

Signed-off-by: Christoph Hellwig <>
Reviewed-by: Dave Chinner <>
Signed-off-by: Alex Elder <>
10 years agoxfs: fix variable set but not used warnings
Christoph Hellwig [Mon, 4 Apr 2011 12:55:44 +0000 (12:55 +0000)]
xfs: fix variable set but not used warnings

GCC 4.6 now warnings about variables set but not used.  Fix the trivially
fixable warnings of this sort.

Signed-off-by: Christoph Hellwig <>
Signed-off-by: Alex Elder <>
10 years agomlx4_en: Restoring RX buffer pointer in case of failure
Yevgeny Petrilin [Wed, 6 Apr 2011 23:25:45 +0000 (23:25 +0000)]
mlx4_en: Restoring RX buffer pointer in case of failure

If not done, second attempt to open the RX ring would cause memory corruption.

Signed-off-by: Yevgeny Petrilin <>
Signed-off-by: David S. Miller <>
10 years agomlx4: Sensing link type at device initialization
Yevgeny Petrilin [Wed, 6 Apr 2011 23:24:42 +0000 (23:24 +0000)]
mlx4: Sensing link type at device initialization

When bringing the port up, performing a SENSE_PORT command
To try and check to which physical link type (IB or Ethernet) the physical
port is connected.
In case there is no valid link partner, the port will come up as its
supported default.

Signed-off-by: Yevgeny Petrilin <>
Signed-off-by: David S. Miller <>
10 years agoxfs: convert log tail checking to a warning
Dave Chinner [Fri, 8 Apr 2011 02:45:07 +0000 (12:45 +1000)]
xfs: convert log tail checking to a warning

On the Power platform, the log tail debug checks fire excessively
causing the system to panic early in testing. The debug checks are
known to be racy, though on x86_64 there is no evidence that they
trigger at all.

We want to keep the checks active on debug systems to alert us to
problems with log space accounting, but we need to reduce the impact
of a racy check on testing on the Power platform.

As a result, convert the ASSERT conditions to warnings, and
allow them to fire only once per filesystem mount. This will prevent
false positives from interfering with testing, whilst still
providing us with the indication that they may be a problem with log
space accounting should that occur.

Signed-off-by: Dave Chinner <>
Reviewed-by: Christoph Hellwig <>
Reviewed-by: Alex Elder <>
10 years agoxfs: catch bad block numbers freeing extents.
Dave Chinner [Fri, 8 Apr 2011 02:45:07 +0000 (12:45 +1000)]
xfs: catch bad block numbers freeing extents.

A fuzzed filesystem crashed a kernel when freeing an extent with a
block number beyond the end of the filesystem. Convert all the debug
asserts in xfs_free_extent() to active checks so that we catch bad
extents and return that the filesytsem is corrupted rather than

Signed-off-by: Dave Chinner <>
Reviewed-by: Christoph Hellwig <>
Reviewed-by: Alex Elder <>
10 years agoxfs: push the AIL from memory reclaim and periodic sync
Dave Chinner [Fri, 8 Apr 2011 02:45:07 +0000 (12:45 +1000)]
xfs: push the AIL from memory reclaim and periodic sync

When we are short on memory, we want to expedite the cleaning of
dirty objects.  Hence when we run short on memory, we need to kick
the AIL flushing into action to clean as many dirty objects as
quickly as possible.  To implement this, sample the lsn of the log
item at the head of the AIL and use that as the push target for the
AIL flush.

Further, we keep items in the AIL that are dirty that are not
tracked any other way, so we can get objects sitting in the AIL that
don't get written back until the AIL is pushed. Hence to get the
filesystem to the idle state, we might need to push the AIL to flush
out any remaining dirty objects sitting in the AIL. This requires
the same push mechanism as the reclaim push.

This patch also renames xfs_trans_ail_tail() to xfs_ail_min_lsn() to
match the new xfs_ail_max_lsn() function introduced in this patch.
Similarly for xfs_trans_ail_push -> xfs_ail_push.

Signed-off-by: Dave Chinner <>
Reviewed-by: Alex Elder <>
10 years agoxfs: clean up code layout in xfs_trans_ail.c
Dave Chinner [Fri, 8 Apr 2011 02:45:07 +0000 (12:45 +1000)]
xfs: clean up code layout in xfs_trans_ail.c

This patch rearranges the location of functions in xfs_trans_ail.c
to remove the need for forward declarations of those functions in
preparation for adding new functions without the need for forward

Signed-off-by: Dave Chinner <>
Reviewed-by: Alex Elder <>
10 years agoxfs: convert the xfsaild threads to a workqueue
Dave Chinner [Fri, 8 Apr 2011 02:45:07 +0000 (12:45 +1000)]
xfs: convert the xfsaild threads to a workqueue

Similar to the xfssyncd, the per-filesystem xfsaild threads can be
converted to a global workqueue and run periodically by delayed
works. This makes sense for the AIL pushing because it uses
variable timeouts depending on the work that needs to be done.

By removing the xfsaild, we simplify the AIL pushing code and
remove the need to spread the code to implement the threading
and pushing across multiple files.

Signed-off-by: Dave Chinner <>
Reviewed-by: Christoph Hellwig <>
Reviewed-by: Alex Elder <>
10 years agoxfs: introduce background inode reclaim work
Dave Chinner [Fri, 8 Apr 2011 02:45:07 +0000 (12:45 +1000)]
xfs: introduce background inode reclaim work

Background inode reclaim needs to run more frequently that the XFS
syncd work is run as 30s is too long between optimal reclaim runs.
Add a new periodic work item to the xfs syncd workqueue to run a
fast, non-blocking inode reclaim scan.

Background inode reclaim is kicked by the act of marking inodes for
reclaim.  When an AG is first marked as having reclaimable inodes,
the background reclaim work is kicked. It will continue to run
periodically untill it detects that there are no more reclaimable
inodes. It will be kicked again when the first inode is queued for

To ensure shrinker based inode reclaim throttles to the inode
cleaning and reclaim rate but still reclaim inodes efficiently, make it kick the
background inode reclaim so that when we are low on memory we are
trying to reclaim inodes as efficiently as possible. This kick shoul
d not be necessary, but it will protect against failures to kick the
background reclaim when inodes are first dirtied.

To provide the rate throttling, make the shrinker pass do
synchronous inode reclaim so that it blocks on inodes under IO. This
means that the shrinker will reclaim inodes rather than just
skipping over them, but it does not adversely affect the rate of
reclaim because most dirty inodes are already under IO due to the
background reclaim work the shrinker kicked.

These two modifications solve one of the two OOM killer invocations
Chris Mason reported recently when running a stress testing script.
The particular workload trigger for the OOM killer invocation is
where there are more threads than CPUs all unlinking files in an
extremely memory constrained environment. Unlike other solutions,
this one does not have a performance impact on performance when
memory is not constrained or the number of concurrent threads
operating is <= to the number of CPUs.

Signed-off-by: Dave Chinner <>
Reviewed-by: Christoph Hellwig <>
Reviewed-by: Alex Elder <>
10 years agoxfs: convert ENOSPC inode flushing to use new syncd workqueue
Dave Chinner [Fri, 8 Apr 2011 02:45:07 +0000 (12:45 +1000)]
xfs: convert ENOSPC inode flushing to use new syncd workqueue

On of the problems with the current inode flush at ENOSPC is that we
queue a flush per ENOSPC event, regardless of how many are already
queued. Thi can result in    hundreds of queued flushes, most of
which simply burn CPU scanned and do no real work. This simply slows
down allocation at ENOSPC.

We really only need one active flush at a time, and we can easily
implement that via the new xfs_syncd_wq. All we need to do is queue
a flush if one is not already active, then block waiting for the
currently active flush to complete. The result is that we only ever
have a single ENOSPC inode flush active at a time and this greatly
reduces the overhead of ENOSPC processing.

On my 2p test machine, this results in tests exercising ENOSPC
conditions running significantly faster - 042 halves execution time,
083 drops from 60s to 5s, etc - while not introducing test

This allows us to remove the old xfssyncd threads and infrastructure
as they are no longer used.

Signed-off-by: Dave Chinner <>
Reviewed-by: Christoph Hellwig <>
Reviewed-by: Alex Elder <>
10 years agoxfs: introduce a xfssyncd workqueue
Dave Chinner [Fri, 8 Apr 2011 02:45:07 +0000 (12:45 +1000)]
xfs: introduce a xfssyncd workqueue

All of the work xfssyncd does is background functionality. There is
no need for a thread per filesystem to do this work - it can al be
managed by a global workqueue now they manage concurrency

Introduce a new gglobal xfssyncd workqueue, and convert the periodic
work to use this new functionality. To do this, use a delayed work
construct to schedule the next running of the periodic sync work
for the filesystem. When the sync work is complete, queue a new
delayed work for the next running of the sync work.

For laptop mode, we wait on completion for the sync works, so ensure
that the sync work queuing interface can flush and wait for work to
complete to enable the work queue infrastructure to replace the
current sequence number and wakeup that is used.

Because the sync work does non-trivial amounts of work, mark the
new work queue as CPU intensive.

Signed-off-by: Dave Chinner <>
Reviewed-by: Christoph Hellwig <>
Reviewed-by: Alex Elder <>
10 years agoxfs: fix extent format buffer allocation size
Dave Chinner [Fri, 8 Apr 2011 02:45:07 +0000 (12:45 +1000)]
xfs: fix extent format buffer allocation size

When formatting an inode item, we have to allocate a separate buffer
to hold extents when there are delayed allocation extents on the
inode and it is in extent format. The allocation size is derived
from the in-core data fork representation, which accounts for
delayed allocation extents, while the on-disk representation does
not contain any delalloc extents.

As a result of this mismatch, the allocated buffer can be far larger
than needed to hold the real extent list which, due to the fact the
inode is in extent format, is limited to the size of the literal
area of the inode. However, we can have thousands of delalloc
extents, resulting in an allocation size orders of magnitude larger
than is needed to hold all the real extents.

Fix this by limiting the size of the buffer being allocated to the
size of the literal area of the inodes in the filesystem (i.e. the
maximum size an inode fork can grow to).

Signed-off-by: Dave Chinner <>
Reviewed-by: Alex Elder <>
10 years agoipv4: Fix "Set rt->rt_iif more sanely on output routes."
OGAWA Hirofumi [Thu, 7 Apr 2011 21:04:08 +0000 (14:04 -0700)]
ipv4: Fix "Set rt->rt_iif more sanely on output routes."

Commit 1018b5c01636c7c6bda31a719bda34fc631db29a ("Set rt->rt_iif more
sanely on output routes.")  breaks rt_is_{output,input}_route.

This became the cause to return "IP_PKTINFO's ->ipi_ifindex == 0".

To fix it, this does:

1) Add "int rt_route_iif;" to struct rtable

2) For input routes, always set rt_route_iif to same value as rt_iif

3) For output routes, always set rt_route_iif to zero.  Set rt_iif
   as it is done currently.

4) Change rt_is_{output,input}_route() to test rt_route_iif

Signed-off-by: OGAWA Hirofumi <>
Signed-off-by: David S. Miller <>
10 years agoMerge git://
Linus Torvalds [Thu, 7 Apr 2011 20:34:41 +0000 (13:34 -0700)]
Merge git://git./linux/kernel/git/wim/linux-2.6-watchdog

* git://
  watchdog: mpc8xxx_wdt: fix build

10 years agowatchdog: mpc8xxx_wdt: fix build
Peter Korsgaard [Wed, 30 Mar 2011 13:48:22 +0000 (15:48 +0200)]
watchdog: mpc8xxx_wdt: fix build

Since 1c48a5c93da6313 (dt: Eliminate of_platform_{,un}register_driver)
mpc8xxx_wdt no longer builds as it tries to refer to a 'match' variable
rather than ofdev->dev.of_match that it checks just before.

Signed-off-by: Peter Korsgaard <>
Acked-by: Grant Likely <>
Signed-off-by: Wim Van Sebroeck <>
10 years agoNFS: Change initial mount authflavor only when server returns NFS4ERR_WRONGSEC
Bryan Schumaker [Thu, 7 Apr 2011 20:02:20 +0000 (16:02 -0400)]
NFS: Change initial mount authflavor only when server returns NFS4ERR_WRONGSEC

When attempting an initial mount, we should only attempt other
authflavors if AUTH_UNIX receives a NFS4ERR_WRONGSEC error.
This allows other errors to be passed back to userspace programs.

Signed-off-by: Bryan Schumaker <>
Signed-off-by: Trond Myklebust <>
10 years agoMerge branch 'fbdev-fixes-for-linus' of git://
Linus Torvalds [Thu, 7 Apr 2011 19:49:17 +0000 (12:49 -0700)]
Merge branch 'fbdev-fixes-for-linus' of git://git./linux/kernel/git/lethal/fbdev-2.6

* 'fbdev-fixes-for-linus' of git://
  efifb: Add override for 11" Macbook Air 3,1
  efifb: Support overriding fields FW tells us with the DMI data.
  fb: Reduce priority of resource conflict message
  savagefb: Remove obsolete else clause in savage_setup_i2c_bus
  savagefb: Set up I2C based on chip family instead of card id
  savagefb: Replace magic register address with define
  drivers/video/bfin-lq035q1-fb.c: introduce missing kfree
  video: s3c-fb: fix checkpatch errors and warning
  efifb: support AMD Radeon HD 6490
  s3fb: fix Virge/GX2
  fbcon: Remove unused 'display *p' variable from fb_flashcursor()
  fbdev: sh_mobile_lcdcfb: fix module lock acquisition
  fbdev: sh_mobile_lcdcfb: add blanking support
  viafb: initialize margins correct
  viafb: refresh rate bug collection
  sh: mach-ap325rxa: move backlight control code
  sh: mach-ecovec24: support for main lcd backlight

10 years agoMerge branch 'rmobile-fixes-for-linus' of git://
Linus Torvalds [Thu, 7 Apr 2011 19:49:01 +0000 (12:49 -0700)]
Merge branch 'rmobile-fixes-for-linus' of git://git./linux/kernel/git/lethal/sh-2.6

* 'rmobile-fixes-for-linus' of git://
  ARM: arch-shmobile: only run FSI init on respective boards
  ARM: arch-shmobile: only run HDMI init on respective boards
  ARM: mach-shmobile: Correctly check for CONFIG_MACH_MACKEREL

10 years agoMerge branch 'sh-fixes-for-linus' of git://
Linus Torvalds [Thu, 7 Apr 2011 19:48:45 +0000 (12:48 -0700)]
Merge branch 'sh-fixes-for-linus' of git://git./linux/kernel/git/lethal/sh-2.6

* 'sh-fixes-for-linus' of git://
  sh: select ARCH_NO_SYSDEV_OPS.
  sh: fix build error in board-sh7757lcr.c
  sh: landisk: Remove whitespace
  sh: landisk: Remove mv_nr_irqs
  sh: sh-sci: Fix double initialization by serial_console_setup
  serial: sh-sci: prevent setup of uninitialized serial console
  dma: shdma: add checking the DMAOR_AE in sh_dmae_err