Pandora Sourcecodes - pandora-kernel.git/log

md/raid5: don't do chunk aligned read on degraded array.

When array is degraded, read data landed on failed drives will result in
reading rest of data in a stripe. So a single sequential read would
result in same data being read twice.

This patch is to avoid chunk aligned read for degraded array. The
downside is to involve stripe cache which means associated CPU overhead
and extra memory copy.

Test Results:
Following test are done on a enterprise storage node with Seagate 6T SAS
drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2,
chunk size 128 KiB.

I use FIO, using direct-io with various bs size, enough queue depth,
tested sequential and 100% random read against 3 array config:
1) optimal, as baseline;
2) degraded;
3) degraded with this patch.
Kernel version is 4.0-rc3.

Each individual test I only did once so there might be some variations,
but we just focus on big trend.

Sequential Read:
  bs=(KiB)  optimal(MiB/s)  degraded(MiB/s)  degraded-with-patch (MiB/s)
   1024       1608            656              995
    512       1624            710              956
    256       1635            728              980
    128       1636            771              983
     64       1612           1119             1000
     32       1580           1420             1004
     16       1368            688              986
      8        768            647              953
      4        411            413              850

Random Read:
  bs=(KiB)  optimal(IOPS)  degraded(IOPS)  degraded-with-patch (IOPS)
   1024        163            160              156
    512        274            273              272
    256        426            428              424
    128        576            592              591
     64        726            724              726
     32        849            848              837
     16        900            970              971
      8        927            940              929
      4        948            940              955

Some notes:
  * In sequential + optimal, as bs size getting smaller, the FIO thread
become CPU bound.
  * In sequential + degraded, there's big increase when bs is 64K and
32K, I don't have explanation.
  * In sequential + degraded-with-patch, the MD thread mostly become CPU
bound.

If you want to we can discuss specific data point in those data. But in
general it seems with this patch, we have more predictable and in most
cases significant better sequential read performance when array is
degraded, and almost no noticeable impact on random read.

Performance is a complicated thing, the patch works well for this
particular configuration, but may not be universal. For example I
imagine testing on all SSD array may have very different result. But I
personally think in most cases IO bandwidth is more scarce resource than
CPU.

Signed-off-by: Eric Mei <eric.mei@seagate.com>
Signed-off-by: NeilBrown <neilb@suse.de>

md/raid5: allow the stripe_cache to grow and shrink.

The default setting of 256 stripe_heads is probably
much too small for many configurations.  So it is best to make it
auto-configure.

Shrinking the cache under memory pressure is easy.  The only
interesting part here is that we put a fairly high cost
('seeks') on shrinking the cache as the cost is greater than
just having to read more data, it reduces parallelism.

Growing the cache on demand needs to be done carefully.  If we allow
fast growth, that can upset memory balance as lots of dirty memory can
quickly turn into lots of memory queued in the stripe_cache.
It is important for the raid5 block device to appear congested to
allow write-throttling to work.

So we only add stripes slowly. We set a flag when an allocation
fails because all stripes are in use, allocate at a convenient
time when that flag is set, and don't allow it to be set again
until at least one stripe_head has been released for re-use.

This means that a spurt of requests will only cause one stripe_head
to be allocated, but a steady stream of requests will slowly
increase the cache size - until memory pressure puts it back again.

It could take hours to reach a steady state.

The value written to, and displayed in, stripe_cache_size is
used as a minimum.  The cache can grow above this and shrink back
down to it.  The actual size is not directly visible, though it can
be deduced to some extent by watching stripe_cache_active.

Signed-off-by: NeilBrown <neilb@suse.de>

md/raid5: change ->inactive_blocked to a bit-flag.

This allows us to easily add more (atomic) flags.

Signed-off-by: NeilBrown <neilb@suse.de>

md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe

Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe()
succeeds, do it inside the functions.

Also choose the correct hash to handle next inside the functions.

This removes duplication and will help with future new uses of
{grow,drop}_one_stripe.

This also fixes a minor bug where the "md/raid:%md: allocate XXkB"
message always said "0kB".

Signed-off-by: NeilBrown <neilb@suse.de>

md/raid5: pass gfp_t arg to grow_one_stripe()

This is needed for future improvement to stripe cache management.

Signed-off-by: NeilBrown <neilb@suse.de>

md/raid5: introduce configuration option rmw_level

Depending on the available coding we allow optimized rmw logic for write
operations. To support easier testing this patch allows manual control
of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level.

The configuration can handle three levels of control.

rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q
calculation has no implementation path yet to factor in/out chunks of
a syndrome. Enforcing this level can be benefical for slow CPUs with
hardware syndrome support and fast SSDs.

rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will
save IOs. This equals the "old" unpatched behaviour and will be the
default.

rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are
equal. We might have higher CPU consumption because of calculating the
parity twice but it can be benefical otherwise. E.g. RAID4 with fast
dedicated parity disk/SSD. The option is implemented just to be
forward-looking and will ONLY work with this patch!

Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>

md/raid5: activate raid6 rmw feature

Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.

1) Enable xor_syndrome() in the async layer.

2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.

3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.

4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.

5) Adapt the several places where we ignored Q handling up to now.

Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)

bsize   rmw_level=1   rmw_level=0   rmw_level=1   rmw_level=0
        skip_copy=1   skip_copy=1   skip_copy=0   skip_copy=0
   4K      115 KB/s      141 KB/s      165 KB/s      140 KB/s
   8K      225 KB/s      275 KB/s      324 KB/s      274 KB/s
  16K      434 KB/s      536 KB/s      640 KB/s      534 KB/s
  32K      751 KB/s    1,051 KB/s    1,234 KB/s    1,045 KB/s
  64K    1,339 KB/s    1,958 KB/s    2,282 KB/s    1,962 KB/s
128K    2,673 KB/s    3,862 KB/s    4,113 KB/s    3,898 KB/s
256K    7,685 KB/s    7,539 KB/s    7,557 KB/s    7,638 KB/s
512K   19,556 KB/s   19,558 KB/s   19,652 KB/s   19,688 Kb/s

Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>

md/raid6 algorithms: xor_syndrome() for SSE2

The second and (last) optimized XOR syndrome calculation. This version
supports right and left side optimization. All CPUs with architecture
older than Haswell will benefit from it.

It should be noted that SSE2 movntdq kills performance for memory areas
that are read and written simultaneously in chunks smaller than cache
line size. So use movdqa instead for P/Q writes in sse21 and sse22 XOR
functions.

Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>

md/raid6 algorithms: xor_syndrome() for generic int

Start the algorithms with the very basic one. It is left and right
optimized. That means we can avoid all calculations for unneeded pages
above the right stop offset. For pages below the left start offset we
still need the syndrome multiplication but without reading data pages.

Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>

md/raid6 algorithms: improve test program

It is always helpful to have a test tool in place if we implement
new data critical algorithms. So add some test routines to the raid6
checker that can prove if the new xor_syndrome() works as expected.

Run through all permutations of start/stop pages per algorithm and
simulate a xor_syndrome() assisted rmw run. After each rmw check if
the recovery algorithm still confirms that the stripe is fine.

Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>

md/raid6 algorithms: delta syndrome functions

v3: s-o-b comment, explanation of performance and descision for
the start/stop implementation

Implementing rmw functionality for RAID6 requires optimized syndrome
calculation. Up to now we can only generate a complete syndrome. The
target P/Q pages are always overwritten. With this patch we provide
a framework for inplace P/Q modification. In the first place simply
fill those functions with NULL values.

xor_syndrome() has two additional parameters: start & stop. These
will indicate the first and last page that are changing during a
rmw run. That makes it possible to avoid several unneccessary loops
and speed up calculation. The caller needs to implement the following
logic to make the functions work.

1) xor_syndrome(disks, start, stop, ...): "Remove" all data of source
blocks inside P/Q between (and including) start and end.

2) modify any block with start <= block <= stop

3) xor_syndrome(disks, start, stop, ...): "Reinsert" all data of
source blocks into P/Q between (and including) start and end.

Pages between start and stop that won't be changed should be filled
with a pointer to the kernel zero page. The reasons for not taking NULL
pages are:

1) Algorithms cross the whole source data line by line. Thus avoid
additional branches.

2) Having a NULL page avoids calculating the XOR P parity but still
need calulation steps for the Q parity. Depending on the algorithm
unrolling that might be only a difference of 2 instructions per loop.

The benchmark numbers of the gen_syndrome() functions are displayed in
the kernel log. Do the same for the xor_syndrome() functions. This
will help to analyze performance problems and give an rough estimate
how well the algorithm works. The choice of the fastest algorithm will
still depend on the gen_syndrome() performance.

With the start/stop page implementation the speed can vary a lot in real
life. E.g. a change of page 0 & page 15 on a stripe will be harder to
compute than the case where page 0 & page 1 are XOR candidates. To be not
to enthusiatic about the expected speeds we will run a worse case test
that simulates a change on the upper half of the stripe. So we do:

1) calculation of P/Q for the upper pages

2) continuation of Q for the lower (empty) pages

Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>

raid5: handle expansion/resync case with stripe batching

expansion/resync can grab a stripe when the stripe is in batch list. Since all
stripes in batch list must be in the same state, we can't allow some stripes
run into expansion/resync. So we delay expansion/resync for stripe in batch
list.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

raid5: handle io error of batch list

If io error happens in any stripe of a batch list, the batch list will be
split, then normal process will run for the stripes in the list.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

RAID5: batch adjacent full stripe write

stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
unit. Idealy we should use big size for adjacent full stripe writes. Bigger
stripe cache size means less stripes runing in the state machine so can reduce
cpu overhead. And also bigger size can cause bigger IO size dispatched to under
layer disks.

With below patch, we will automatically batch adjacent full stripe write
together. Such stripes will be added to the batch list. Only the first stripe
of the list will be put to handle_list and so run handle_stripe(). Some steps
of handle_stripe() are extended to cover all stripes of the list, including
ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
running in handle_stripe() and we send IO of whole stripe list together to
increase IO size.

Stripes added to a batch list have some limitations. A batch list can only
include full stripe write and can't cross chunk boundary to make sure stripes
have the same parity disks. Stripes in a batch list must be in the same state
(no written, toread and so on). If a stripe is in a batch list, all new
read/write to add_stripe_bio will be blocked to overlap conflict till the batch
list is handled. The limitations will make sure stripes in a batch list be in
exactly the same state in the life circly.

I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
PCIe SSD. This patch improves around 30% performance and IO size to under layer
disk is exactly 32k. I also run a 4k randwrite test in the same array to make
sure the performance isn't changed with the patch.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

raid5: track overwrite disk count

Track overwrite disk count, so we can know if a stripe is a full stripe write.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

raid5: add a new flag to track if a stripe can be batched

A freshly new stripe with write request can be batched. Any time the stripe is
handled or new read is queued, the flag will be cleared.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

raid5: use flex_array for scribble data

Use flex_array for scribble data. Next patch will batch several stripes
together, so scribble data should be able to cover several stripes, so this
patch also allocates scribble data for stripes across a chunk.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid

The patch makes 3 references to mddev->queue in the raid0 personality
conditional in order to allow for it to be accessed from dm-raid.
Mandatory, because md instances underneath dm-raid don't manage
a request queue of their own which'd lead to oopses without the patch.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>

md: allow resync to go faster when there is competing IO.

When md notices non-sync IO happening while it is trying
to resync (or reshape or recover) it slows down to the
set minimum.

The default minimum might have made sense many years ago
but the drives have become faster.  Changing the default
to match the times isn't really a long term solution.

This patch changes the code so that instead of waiting until the speed
has dropped to the target, it just waits until pending requests
have completed.
This means that the delay inserted is a function of the speed
of the devices.

Testing shows that:
- for some loads, the resync speed is unchanged.  For those loads
   increasing the minimum doesn't change the speed either.
   So this is a good result.  To increase resync speed under such
   loads we would probably need to increase the resync window
   size.

- for other loads, resync speed does increase to a reasonable
   fraction (e.g. 20%) of maximum possible, and throughput of
   the load only drops a little bit (e.g. 10%)

- for other loads, throughput of the non-sync load drops quite a bit
   more.  These seem to be latency-sensitive loads.

So it isn't a perfect solution, but it is mostly an improvement.

Signed-off-by: NeilBrown <neilb@suse.de>

md: remove 'go_faster' option from ->sync_request()

This option is not well justified and testing suggests that
it hardly ever makes any difference.

The comment suggests there might be a need to wait for non-resync
activity indicated by ->nr_waiting, however raise_barrier()
already waits for all of that.

So just remove it to simplify reasoning about speed limiting.

This allows us to remove a 'FIXME' comment from raid5.c as that
never used the flag.

Signed-off-by: NeilBrown <neilb@suse.de>

md: don't require sync_min to be a multiple of chunk_size.

There is really no need for sync_min to be a multiple of
chunk_size, and values read from here often aren't.
That means you cannot read a value and expect to be able
to write it back later.

So remove the chunk_size check, and round down to a multiple
of 4K, to be sure everything works with 4K-sector devices.

Signed-off-by: NeilBrown <neilb@suse.de>

Merge branch 'cluster' into for-next

md-cluster: re-add capabilities

When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state,
the clustered md:

1. Sends RE_ADD message with the desc_nr. Nodes receiving the message
   clear the Faulty bit in their respective rdev->flags.
2. The node initiating re-add, gathers the bitmaps of all nodes
   and copies them into the local bitmap. It does not clear the bitmap
   from which it is copying.
3. Initiating node schedules a md recovery to sync the devices.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>

md: re-add a failed disk

This adds the capability of re-adding a failed disk by
writing "re-add" to /sys/block/mdXX/md/dev-YYY/state.

This facilitates adding disks which have encountered a temporary
error such as a network disconnection/hiccup in an iSCSI device,
or a SAN cable disconnection which has been restored. In such
a situation, you do not need to remove and re-add the device.
Writing re-add to the failed device's state would add it again
to the array and perform the recovery of only the blocks which
were written after the device failed.

This works for generic md, and is not related to clustering. However,
this patch is to ease re-add operations listed above in clustering
environments.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>

md-cluster: remove capabilities

This adds "remove" capabilities for the clustered environment.
When a user initiates removal of a device from the array, a
REMOVE message with disk number in the array is sent to all
the nodes which kick the respective device in their own array.

This facilitates the removal of failed devices.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>

md: Export and rename find_rdev_nr_rcu

This is required by the clustering module (patches to follow) to
find the device to remove or re-add.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>

md: Export and rename kick_rdev_from_array

This export is required for clustering module in order to
co-ordinate remove/readd a rdev from all nodes.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>

md-cluster: correct the num for comparison

Since the node num of md-cluster is from zero, and
cinfo->slot_number represents the slot num of dlm,
no need to check for equality.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>

md/raid0: fix bug with chunksize not a power of 2.

Since commit 20d0189b1012a37d2533a87fb451f7852f2418d1
in v3.14-rc1 RAID0 has performed incorrect calculations
when the chunksize is not a power of 2.

This happens because "sector_div()" modifies its first argument, but
this wasn't taken into account in the patch.

So restore that first arg before re-using the variable.

Reported-by: Joe Landman <joe.landman@gmail.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Fixes: 20d0189b1012a37d2533a87fb451f7852f2418d1
Cc: stable@vger.kernel.org (3.14 and later).
Signed-off-by: NeilBrown <neilb@suse.de>

md: fix md io stats accounting broken

Simon reported the md io stats accounting issue:
"
I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5.
It's not abnormal other than it's 3-disk, with one being SSD (sdc) and
the other two being write-mostly:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00   345.00    0.00    0.00    0.00   0.00 100.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00 58779.00    0.00    0.00    0.00   0.00 100.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00    12.00    0.00    0.00    0.00   0.00 100.00
"
The cause is commit "18c0b223cf9901727ef3b02da6711ac930b4e5d4" uses the
generic_start_io_acct to account the disk stats rather than the open code,
but it also introduced the increase to .in_flight[rw] which is needless to
md. So we re-use the open code here to fix it.

Reported-by: Simon Kirby <sim@hostway.ca>
Cc: <stable@vger.kernel.org> 3.19
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NeilBrown <neilb@suse.de>

Linux 4.0-rc7

Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) In TCP, don't register an FRTO for cumulatively ACK'd data that was
    previously SACK'd, from Neal Cardwell.

2) Need to hold RNL mutex in ipv4 multicast code namespace cleanup,
    from Cong WANG.

3) Similarly we have to hold RNL mutex for fib_rules_unregister(), also
    from Cong WANG.

4) Revert and rework netns nsid allocation fix, from Nicolas Dichtel.

5) When we encapsulate for a tunnel device, skb->sk still points to the
    user socket.  So this leads to cases where we retraverse the
    ipv4/ipv6 output path with skb->sk being of some other address
    family (f.e. AF_PACKET).  This can cause things to crash since the
    ipv4 output path is dereferencing an AF_PACKET socket as if it were
    an ipv4 one.

    The short term fix for 'net' and -stable is to elide these socket
    checks once we've entered an encapsulation sequence by testing
    xmit_recursion.

    Longer term we have a better solution wherein we pass the tunnel's
    socket down through the output paths, but that is way too invasive
    for 'net' and -stable.

    From Hannes Frederic Sowa.

6) l2tp_init() failure path forgets to unregister per-net ops, from
    Cong WANG.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net/mlx4_core: Fix error message deprecation for ConnectX-2 cards
  net: dsa: fix filling routing table from OF description
  l2tp: unregister l2tp_net_ops on failure path
  mvneta: dont call mvneta_adjust_link() manually
  ipv6: protect skb->sk accesses from recursive dereference inside the stack
  netns: don't allocate an id for dead netns
  Revert "netns: don't clear nsid too early on removal"
  ip6mr: call del_timer_sync() in ip6mr_free_table()
  net: move fib_rules_unregister() under rtnl lock
  ipv4: take rtnl_lock and mark mrt table as freed on namespace cleanup
  tcp: fix FRTO undo on cumulative ACK of SACKed range
  xen-netfront: transmit fully GSO-sized packets

net/mlx4_core: Fix error message deprecation for ConnectX-2 cards

Commit 1daa4303b4ca ("net/mlx4_core: Deprecate error message at
ConnectX-2 cards startup to debug") did the deprecation only for port 1
of the card. Need to deprecate for port 2 as well.

Fixes: 1daa4303b4ca ("net/mlx4_core: Deprecate error message at ConnectX-2 cards startup to debug")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: fix filling routing table from OF description

According to description in 'include/net/dsa.h', in cascade switches
configurations where there are more than one interconnected devices,
'rtable' array in 'dsa_chip_data' structure is used to indicate which
port on this switch should be used to send packets to that are destined
for corresponding switch.

However, dsa_of_setup_routing_table() fills 'rtable' with port numbers
of the _target_ switch, but not current one.

This commit removes redundant devicetree parsing and adds needed port
number as a function argument. So dsa_of_setup_routing_table() now just
looks for target switch number by parsing parent of 'link' device node.

To remove possible misunderstandings with the way of determining target
switch number, a corresponding comment was added to the source code and
to the DSA device tree bindings documentation file.

This was tested on a custom board with two Marvell 88E6095 switches with
following corresponding routing tables: { -1, 10 } and { 8, -1 }.

Signed-off-by: Pavel Nakonechny <pavel.nakonechny@skitlab.ru>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input

Pull input fixes from Dmitry Torokhov:
"Updates for the input subsystem - two more tweaks for ALPS driver to
  work out kinks after splitting the touchpad, trackstick, and potential
  external PS/2 mouse into separate input devices.

  Changes to support ALPS SS4 devices (protocol V8) will be coming in
  4.1..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: alps - document stick behavior for protocol V2
  Input: alps - report V2 Dualpoint Stick events via the right evdev node
  Input: alps - report interleaved bare PS/2 packets via dev3

l2tp: unregister l2tp_net_ops on failure path

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mvneta: dont call mvneta_adjust_link() manually

mvneta_adjust_link() is a callback for of_phy_connect() and should
not be called directly. The result of calling it directly is as below:

Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: protect skb->sk accesses from recursive dereference inside the stack

We should not consult skb->sk for output decisions in xmit recursion
levels > 0 in the stack. Otherwise local socket settings could influence
the result of e.g. tunnel encapsulation process.

ipv6 does not conform with this in three places:

1) ip6_fragment: we do consult ipv6_npinfo for frag_size

2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
loop the packet back to the local socket

3) ip6_skb_dst_mtu could query the settings from the user socket and
force a wrong MTU

Furthermore:
In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
PF_PACKET socket ontop of an IPv6-backed vxlan device.

Reuse xmit_recursion as we are currently only interested in protecting
tunnel devices.

Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Input: alps - document stick behavior for protocol V2

Document that protocol V2 uses standard (bare) PS/2 mouse packets for the
DualPoint stick.

Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Acked-By: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: alps - report V2 Dualpoint Stick events via the right evdev node

On V2 devices the DualPoint Stick reports bare packets, these should be
reported via the "AlpsPS/2 ALPS DualPoint Stick" dev2 evdev node, which also
has the INPUT_PROP_POINTING_STICK propbit set.

Note that since there is no way to distinguish these packets from an external
PS/2 mouse (insofar as these laptops have an external PS/2 port) this means
that we will be reporting PS/2 mouse events via this evdev node too, as we've
been doing in kernel 3.19 and older.

This has been tested on a Dell Latitude D620 and a Dell Latitude E6400,
which both have a V2 touchpad + a DualPoint Stick which reports bare packets.

Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: alps - report interleaved bare PS/2 packets via dev3

Bare packets should be reported via the same evdev device independent on
whether they are detected on the beginning of a packet or in the middle
of a packet.

This has been tested on a Dell Latitude E6400, where the DualPoint Stick
reports bare packets, which get reported via dev3 when the touchpad is
idle, and via dev2 when the touchpad and stick are used simultaneously.

This commit fixes this inconsistency by always reporting bare packets via
dev3. Note that since the come from a DualPoint Stick they really should be
reported via dev2, this gets fixed in a later commit.

Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Merge tag 'usb-4.0-rc6' of git://git./linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
"Here are some small USB fixes and new device ids for 4.0-rc6.  Nothing
  major, some xhci fixes for reported problems, and some usb-serial
  device ids.

  All have been in linux-next for a while"

* tag 'usb-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  USB: ftdi_sio: Use jtag quirk for SNAP Connect E10
  usb: isp1760: fix spin unlock in the error path of isp1760_udc_start
  usb: xhci: apply XHCI_AVOID_BEI quirk to all Intel xHCI controllers
  usb: xhci: handle Config Error Change (CEC) in xhci driver
  USB: keyspan_pda: add new device id
  USB: ftdi_sio: Added custom PID for Synapse Wireless product

Merge tag 'staging-4.0-rc6' of git://git./linux/kernel/git/gregkh/staging

Pull staging driver fixes from Greg KH:
"Here are some staging driver fixes, well, really all just IIO driver
  fixes, for 4.0-rc6.  They fix issues that have been reported with
  these drivers.

  All of these patches have been in linux-next for a while"

* tag 'staging-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  iio: imu: Use iio_trigger_get for indio_dev->trig assignment
  iio: adc: vf610: use ADC clock within specification
  iio/adc/cc10001_adc.c: Fix !HAS_IOMEM build
  iio: core: Fix double free.
  iio:inv-mpu6050: Fix inconsistency for the scale channel
  staging: iio: dummy: Fix undefined symbol build error
  iio: inv_mpu6050: Clear timestamps fifo while resetting hardware fifo
  staging: iio: hmc5843: Set iio name property in sysfs
  iio: bmc150: change sampling frequency
  iio: fix drivers that check buffer->scan_mask

Merge tag 'tty-4.0-rc6' of git://git./linux/kernel/git/gregkh/tty

Pull tty/serial fixes from Greg KH:
"Here are 3 serial driver fixes for 4.0-rc6.  They fix some reported
  issues with the samsung and fsl_lpuart drivers.

  All have been in linux-next for a while"

* tag 'tty-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  tty: serial: fsl_lpuart: clear receive flag on FIFO flush
  tty: serial: fsl_lpuart: specify transmit FIFO size
  serial: samsung: Clear operation mode on UART shutdown

Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input

Pull input subsystem fixes from Dmitry Torokhov:
"A fix for ALPS driver for issue introduced in the latest update and a
  tweak for yet another Lenovo box in Synaptics.

  There will be more ALPS tweaks coming.."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: define INPUT_PROP_ACCELEROMETER behavior
  Input: synaptics - fix min-max quirk value for E440
  Input: synaptics - add quirk for Thinkpad E440
  Input: ALPS - fix max coordinates for v5 and v7 protocols
  Input: add MT_TOOL_PALM

Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull block layer fix from Jens Axboe:
"Just one patch in this pull request, fixing a regression caused by a
'mathematically correct' change to lcm()"

* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix blk_stack_limits() regression due to lcm() change

Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:
"Misc fixes: a SYSRET single-stepping fix, a dmi-scan robustization
  fix, a reboot quirk and a kgdb fixlet"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kgdb/x86: Fix reporting of 'si' in kgdb on x86_64
  x86/asm/entry/64: Disable opportunistic SYSRET if regs->flags has TF set
  x86/reboot: Add ASRock Q1900DC-ITX mainboard reboot quirk
  MAINTAINERS: Change the x86 microcode loader maintainer
  firmware: dmi_scan: Prevent dmi_num integer overflow

Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull perf fixes from Ingo Molnar:
"Two x86 Intel PMU constraint handling fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix Haswell CYCLE_ACTIVITY.* counter constraints
perf/x86/intel: Filter branches for PEBS event

Merge tag 'devicetree-for-linus' of git://git./linux/kernel/git/glikely/linux

Pull devicetree fix from Grant Likely:
"Simple bugfix for bad device tree data on the PA-Semi platform"

* tag 'devicetree-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/glikely/linux:
drivers/of: Add empty ranges quirk for PA-Semi

Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6

Pull CIFS fixes from Steve French:
"A set of small cifs fixes fixing a memory leak, kernel oops, and
  infinite loop (and some spotted by Coverity)"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  Fix warning
  Fix another dereference before null check warning
  CIFS: session servername can't be null
  Fix warning on impossible comparison
  Fix coverity warning
  Fix dereference before null check warning
  Don't ignore errors on encrypting password in SMBTcon
  Fix warning on uninitialized buftype
  cifs: potential memory leaks when parsing mnt opts
  cifs: fix use-after-free bug in find_writable_file
  cifs: smb2_clone_range() - exit on unhandled error

netns: don't allocate an id for dead netns

First, let's explain the problem.
Suppose you have an ipip interface that stands in the netns foo and its link
part in the netns bar (so the netns bar has an nsid into the netns foo).
Now, you remove the netns bar:
- the bar nsid into the netns foo is removed
- the netns exit method of ipip is called, thus our ipip iface is removed:
   => a netlink message is built in the netns foo to advertise this deletion
   => this netlink message requests an nsid for bar, thus a new nsid is
      allocated for bar and never removed.

This patch adds a check in peernet2id() so that an id cannot be allocated for
a netns which is currently destroyed.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Revert "netns: don't clear nsid too early on removal"

This reverts
commit 4217291e592d ("netns: don't clear nsid too early on removal").

This is not the right fix, it introduces races.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ip6mr: call del_timer_sync() in ip6mr_free_table()

We need to wait for the flying timers, since we
are going to free the mrtable right after it.

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: move fib_rules_unregister() under rtnl lock

We have to hold rtnl lock for fib_rules_unregister()
otherwise the following race could happen:

fib_rules_unregister(): fib_nl_delrule():
... ...
... ops = lookup_rules_ops();
list_del_rcu(&ops->list);
list_for_each_entry(ops->rules) {
fib_rules_cleanup_ops(ops); ...
list_del_rcu(); list_del_rcu();
}

Note, net->rules_mod_lock is actually not needed at all,
either upper layer netns code or rtnl lock guarantees
we are safe.

Cc: Alexander Duyck <alexander.h.duyck@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: take rtnl_lock and mark mrt table as freed on namespace cleanup

This is the IPv4 part for commit 905a6f96a1b1
(ipv6: take rtnl_lock and mark mrt6 table as freed on namespace cleanup).

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:
"One drm core fix, one exynos regression fix, two sets of radeon fixes
  (Alex was a bit behind last week), and two i915 fixes.

  Nothing too serious we seem to have calmed down i915 since last week"

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
  drm/radeon: fix wait in radeon_mn_invalidate_range_start
  drm/radeon: add extra check in radeon_ttm_tt_unpin_userptr
  drm: Exynos: Respect framebuffer pitch for FIMD/Mixer
  drm/i915: Reject the colorkey ioctls for primary and cursor planes
  drm/i915: Skip allocating shadow batch for 0-length batches
  drm/radeon: programm the VCE fw BAR as well
  drm/radeon: always dump the ring content if it's available
  radeon: Do not directly dereference pointers to BIOS area.
  drm/radeon/dpm: fix 120hz handling harder
  drm/edid: set ELD for firmware and debugfs override EDIDs

Merge tag 'irqchip-fixes-4.0-2' of git://git.infradead.org/users/jcooper/linux

Pull irqchip fixes from Jason Cooper:
"This is the second round of fixes for irqchip.  It contains some fixes
  found while the arm64 guys were writing the kvm gicv3 its emulation.

  GICv3 ITS:
    - Small batch of fixes discovered while writing the kvm ITS emulation"

* tag 'irqchip-fixes-4.0-2' of git://git.infradead.org/users/jcooper/linux:
  irqchip: gicv3-its: Use non-cacheable accesses when no shareability
  irqchip: gicv3-its: Fix PROP/PEND and BASE/CBASE confusion
  irqchip: gicv3-its: Fix device ID encoding
  irqchip: gicv3-its: Fix encoding of collection's target redistributor

Merge branch 'drm-fixes-4.0' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

Just two small fixes for radeon, both destined for stable.

* 'drm-fixes-4.0' of git://people.freedesktop.org/~agd5f/linux:
drm/radeon: fix wait in radeon_mn_invalidate_range_start
drm/radeon: add extra check in radeon_ttm_tt_unpin_userptr

Merge branch 'exynos-drm-fixes' of git://git./linux/kernel/git/daeinki/drm-exynos into drm-fixes

Fix display on issue to Exynos5250 based Snow(1366x768) board.

* 'exynos-drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos:
drm: Exynos: Respect framebuffer pitch for FIMD/Mixer

Merge tag 'drm-intel-fixes-2015-04-02' of git://anongit.freedesktop.org/drm-intel into drm-fixes

one oops fixes and a 0-length allocation fix from next backported.

* tag 'drm-intel-fixes-2015-04-02' of git://anongit.freedesktop.org/drm-intel:
drm/i915: Reject the colorkey ioctls for primary and cursor planes
drm/i915: Skip allocating shadow batch for 0-length batches

Merge tag 'topic/drm-fixes-2015-04-02' of git://anongit.freedesktop.org/drm-intel into drm-fixes

Here's a single drm core fix, cc: stable, that affects i915
users.

* tag 'topic/drm-fixes-2015-04-02' of git://anongit.freedesktop.org/drm-intel:
drm/edid: set ELD for firmware and debugfs override EDIDs

Merge tag 'stable/for-linus-4.0-rc6-tag' of git://git./linux/kernel/git/xen/tip

Pull xen regression fixes from David Vrabel:
"Fix two regressions in the balloon driver's use of memory hotplug when
  used in a PV guest"

* tag 'stable/for-linus-4.0-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/balloon: before adding hotplugged memory, set frames to invalid
  x86/xen: prepare p2m list for memory hotplug

Merge tag 'powerpc-4.0-4' of git://git./linux/kernel/git/mpe/linux

Pull powerpc fix from Michael Ellerman:
"Fix memory corruption by pnv_alloc_idle_core_states"

* tag 'powerpc-4.0-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
powerpc: fix memory corruption by pnv_alloc_idle_core_states

tcp: fix FRTO undo on cumulative ACK of SACKed range

On processing cumulative ACKs, the FRTO code was not checking the
SACKed bit, meaning that there could be a spurious FRTO undo on a
cumulative ACK of a previously SACKed skb.

The FRTO code should only consider a cumulative ACK to indicate that
an original/unretransmitted skb is newly ACKed if the skb was not yet
SACKed.

The effect of the spurious FRTO undo would typically be to make the
connection think that all previously-sent packets were in flight when
they really weren't, leading to a stall and an RTO.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Fixes: e33099f96d99c ("tcp: implement RFC5682 F-RTO")
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'rdma-for-linus' of git://git./linux/kernel/git/roland/infiniband

Pull infiniband/rdma fix from Roland Dreier:
"Fix for exploitable integer overflow in uverbs interface"

* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic

xen-netfront: transmit fully GSO-sized packets

xen-netfront limits transmitted skbs to be at most 44 segments in size. However,
GSO permits up to 65536 bytes, which means a maximum of 45 segments of 1448
bytes each. This slight reduction in the size of packets means a slight loss in
efficiency.

Since c/s 9ecd1a75d, xen-netfront sets gso_max_size to
    XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER,
where XEN_NETIF_MAX_TX_SIZE is 65535 bytes.

The calculation used by tcp_tso_autosize (and also tcp_xmit_size_goal since c/s
6c09fa09d) in determining when to split an skb into two is
    sk->sk_gso_max_size - 1 - MAX_TCP_HEADER.

So the maximum permitted size of an skb is calculated to be
    (XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER) - 1 - MAX_TCP_HEADER.

Intuitively, this looks like the wrong formula -- we don't need two TCP headers.
Instead, there is no need to deviate from the default gso_max_size of 65536 as
this already accommodates the size of the header.

Currently, the largest skb transmitted by netfront is 63712 bytes (44 segments
of 1448 bytes each), as observed via tcpdump. This patch makes netfront send
skbs of up to 65160 bytes (45 segments of 1448 bytes each).

Similarly, the maximum allowable mtu does not need to subtract MAX_TCP_HEADER as
it relates to the size of the whole packet, including the header.

Fixes: 9ecd1a75d977 ("xen-netfront: reduce gso_max_size to account for max TCP header")
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dma

Pull dmaengine fixes from Vinod Koul:
"This time we have addition of caps for jz4740 which fixes intentional
  warning at boot.  Then we have memory leak issues in drivers using
  virt-dma by Peter on few drive"

* 'fixes' of git://git.infradead.org/users/vkoul/slave-dma:
  dmaengine: moxart-dma: Fix memory leak when stopping a running transfer
  dmaengine: bcm2835-dma: Fix memory leak when stopping a running transfer
  dmaengine: omap-dma: Fix memory leak when terminating running transfer
  dmaengine: edma: fix memory leak when terminating running transfers
  dmaengine: jz4740: Define capabilities

Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) Fix use-after-free with mac80211 RX A-MPDU reorder timer, from
    Johannes Berg.

2) iwlwifi leaks memory every module load/unload cycles, fix from Larry
    Finger.

3) Need to use for_each_netdev_safe() in rtnl_group_changelink()
    otherwise we can crash, from WANG Cong.

4) mlx4 driver does register_netdev() too early in the probe sequence,
    from Ido Shamay.

5) Don't allow router discovery hop limit to decrease the interface's
    hop limit, from D.S. Ljungmark.

6) tx_packets and tx_bytes improperly accounted for certain classes of
    USB network devices, fix from Ben Hutchings.

7) ip{6}mr_rules_init() mistakenly use plain kfree to release the ipmr
    tables in the error path, they must instead use ip{6}mr_free_table().
    Fix from WANG Cong.

8) cxgb4 doesn't properly quiesce all RX activity before unregistering
    the netdevice.  Fix from Hariprasad Shenai.

9) Fix hash corruptions in ipvlan driver, from Jiri Benc.

10) nla_memcpy(), like a real memcpy, should fully initialize the
    destination buffer, even if the source attribute is smaller.  Fix
    from Jiri Benc.

11) Fix wrong error code returned from iucv_sock_sendmsg().  We should
    use whatever sock_alloc_send_skb() put into 'err'.  From Eugene
    Crosser.

12) Fix slab object leak on module unload in TIPC, from Ying Xue.

13) Need a READ_ONCE() when reading the cached RX socket route in
    tcp_v{4,6}_early_demux().  From Michal Kubecek.

14) Still too many problems with TPC support in the ath9k driver, so
    disable it for now.  From Felix Fietkau.

15) When in AP mode the rtlwifi driver can leak DMA mappings, fix from
    Larry Finger.

16) Missing kzalloc() failure check in gs_usb CAN driver, from Colin Ian
    King.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
  cxgb4: Fix to dump devlog, even if FW is crashed
  cxgb4: Firmware macro changes for fw verison 1.13.32.0
  bnx2x: Fix kdump when iommu=on
  bnx2x: Fix kdump on 4-port device
  mac80211: fix RX A-MPDU session reorder timer deletion
  MAINTAINERS: Update Intel Wired Ethernet Driver info
  tipc: fix a slab object leak
  net/usb/r8152: add device id for Lenovo TP USB 3.0 Ethernet
  af_iucv: fix AF_IUCV sendmsg() errno
  openvswitch: Return vport module ref before destruction
  netlink: pad nla_memcpy dest buffer with zeroes
  bonding: Bonding Overriding Configuration logic restored.
  ipvlan: fix check for IP addresses in control path
  ipvlan: do not use rcu operations for address list
  ipvlan: protect against concurrent link removal
  ipvlan: fix addr hash list corruption
  net: fec: setup right value for mdio hold time
  net: tcp6: fix double call of tcp_v6_fill_cb()
  cxgb4vf: Fix sparse warnings
  netns: don't clear nsid too early on removal
  ...

IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic

Properly verify that the resulting page aligned end address is larger
than both the start address and the length of the memory area requested.

Both the start and length arguments for ib_umem_get are controlled by
the user. A misbehaving user can provide values which will cause an
integer overflow when calculating the page aligned end address.

This overflow can cause also miscalculation of the number of pages
mapped, and additional logic issues.

Addresses: CVE-2014-8159
Cc: <stable@vger.kernel.org>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>

perf/x86/intel: Fix Haswell CYCLE_ACTIVITY.* counter constraints

Some of the CYCLE_ACTIVITY.* events can only be scheduled on
counter 2. Due to a typo Haswell matched those with
INTEL_EVENT_CONSTRAINT, which lead to the events never
matching as the comparison does not expect anything
in the umask too. Fix the typo.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1425925222-32361-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

perf/x86/intel: Filter branches for PEBS event

For supporting Intel LBR branches filtering, Intel LBR sharing logic
mechanism is introduced from commit b36817e88630 ("perf/x86: Add Intel
LBR sharing logic"). It modifies __intel_shared_reg_get_constraints() to
config lbr_sel, which is finally used to set LBR_SELECT.

However, the intel_shared_regs_constraints() function is called after
intel_pebs_constraints(). The PEBS event will return immediately after
intel_pebs_constraints(). So it's impossible to filter branches for PEBS
events.

This patch moves intel_shared_regs_constraints() ahead of
intel_pebs_constraints().

We can safely do that because the intel_shared_regs_constraints() function
only returns empty constraint if its rejecting the event, otherwise it
returns NULL such that we continue calling intel_pebs_constraints() and
x86_get_event_constraint().

Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1427467105-9260-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

drm/radeon: fix wait in radeon_mn_invalidate_range_start

We need to wait for all fences, not just the exclusive one.

Signed-off-by: Christian König <christian.koenig@amd.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/radeon: add extra check in radeon_ttm_tt_unpin_userptr

We somehow try to free the SG table twice.

Bugs: https://bugs.freedesktop.org/show_bug.cgi?id=89734

Signed-off-by: Christian König <christian.koenig@amd.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm: Exynos: Respect framebuffer pitch for FIMD/Mixer

When performing a modeset, use the framebuffer pitch value to set FIMD
IMG_SIZE and Mixer SPAN registers. These are both defined as pitch - the
distance between contiguous lines (bytes for FIMD, pixels for mixer).

Fixes display on Snow (1366x768).

Signed-off-by: Daniel Stone <daniels@collabora.com>
Tested-by: Javier Martinez Canillas <javier.martinez@collabora.co.uk>
Signed-off-by: Inki Dae <inki.dae@samsung.com>

kgdb/x86: Fix reporting of 'si' in kgdb on x86_64

This patch fixes an error in kgdb for x86_64 which would report
the value of dx when asked to give the value of si.

Signed-off-by: Steffen Liebergeld <steffen.liebergeld@kernkonzept.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

x86/asm/entry/64: Disable opportunistic SYSRET if regs->flags has TF set

When I wrote the opportunistic SYSRET code, I missed an important difference
between SYSRET and IRET.

Both instructions are capable of setting EFLAGS.TF, but they behave differently
when doing so:

- IRET will not issue a #DB trap after execution when it sets TF.
   This is critical -- otherwise you'd never be able to make forward progress when
   returning to userspace.

- SYSRET, on the other hand, will trap with #DB immediately after
   returning to CPL3, and the next instruction will never execute.

This breaks anything that opportunistically SYSRETs to a user
context with TF set.  For example, running this code with TF set
and a SIGTRAP handler loaded never gets past 'post_nop':

extern unsigned char post_nop[];
asm volatile ("pushfq\n\t"
      "popq %%r11\n\t"
      "nop\n\t"
      "post_nop:"
      : : "c" (post_nop) : "r11");

In my defense, I can't find this documented in the AMD or Intel manual.

Fix it by using IRET to restore TF.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2a23c6b8a9c4 ("x86_64, entry: Use sysret to return to userspace when possible")
Link: http://lkml.kernel.org/r/9472f1ca4c19a38ecda45bba9c91b7168135fcfa.1427923514.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

drm/i915: Reject the colorkey ioctls for primary and cursor planes

The legcy colorkey ioctls are only implemented for sprite planes, so
reject the ioctl for primary/cursor planes. If we want to support
colorkeying with these planes (assuming we have hw support of course)
we should just move ahead with the colorkey property conversion.

Testcase: kms_legacy_colorkey
Cc: Tommi Rantala <tt.rantala@gmail.com>
Cc: stable@vger.kernel.org
Reference: http://mid.gmane.org/CA+ydwtr+bCo7LJ44JFmUkVRx144UDFgOS+aJTfK6KHtvBDVuAw@mail.gmail.com
Reported-and-tested-by: Tommi Rantala <tt.rantala@gmail.com>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>

Merge tag 'wireless-drivers-for-davem-2015-04-01' of git://git./linux/kernel/git/kvalo/wireless-drivers

Kalle Valo says:

====================
iwlwifi:

* fix a memory leak, we leaked memory each time the module
was loaded.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'cxgb4-net'

Hariprasad Shenai says:

====================
cxgb4 FW macro changes for new FW

Fix to dump device log even in the case of firmware crash. Also
incorporates changes for new FW.

This patch series has been created against net tree and includes patches on
cxgb4 driver.

We have included all the maintainers of respective drivers. Kindly review the
change and let us know in case of any review comments.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Fix to dump devlog, even if FW is crashed

Add new Common Code routines to retrieve Firmware Device Log
parameters from PCIE_FW_PF[7]. The firmware initializes its Device Log very
early on and stores the parameters for its location/size in that register.
Using the parameters from the register allows us to access the Firmware
Device Log even when the firmware crashes very early on or we're not
attached to the firmware

Based on original work by Casey Leedom <leedom@chelsio.com>

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Firmware macro changes for fw verison 1.13.32.0

Adds new macro and few macro changes for fw version 1.13.32.0 also
changes version string in driver to match 1.13.32.0

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mac80211-for-davem-2015-04-01' of git://git./linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
This contains just a single fix for a crash I happened to randomly
run into today during testing. It's clearly been around for a while,
but is pretty hard to trigger, even when I tried explicitly (and
modified the code to make it more likely) it rarely did.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'iommu-fixes-v4.0-rc6' of git://git./linux/kernel/git/joro/iommu

Pull IOMMU fixes from Joerg Roedel:
"This contains fixes for:

   - a VT-d issue where hardware domain-ids might be freed while still
     in use.

   - an ipmmu-vmsa issue where where the device-table was not zero
     terminated

   - unchecked register access issue in the arm-smmu driver"

* tag 'iommu-fixes-v4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/vt-d: Remove unused variable
  iommu: ipmmu-vmsa: Add terminating entry for ipmmu_of_ids
  iommu/vt-d: Detach domain *only* from attached iommus
  iommu/arm-smmu: fix ARM_SMMU_FEAT_TRANS_OPS condition

lguest: now needs PCI_DIRECT.

Since commit 8e7094694396 ("lguest: add a dummy PCI host bridge.")
lguest uses PCI, but it needs you to frob the ports directly.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'lazytime_fix' of git://git./linux/kernel/git/tytso/ext4

Pull lazytime fixes from Ted Ts'o:
"This fixes a problem in the lazy time patches, which can cause
  frequently updated inods to never have their timestamps updated.

  These changes guarantee that no timestamp on disk will be stale by
  more than 24 hours"

* tag 'lazytime_fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  fs: add dirtytime_expire_seconds sysctl
  fs: make sure the timestamps for lazytime inodes eventually get written

Merge branch 'for-4.0' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
"Two main issues:

   - We found that turning on pNFS by default (when it's configured at
     build time) was too aggressive, so we want to switch the default
     before the 4.0 release.

   - Recent client changes to increase open parallelism uncovered a
     serious bug lurking in the server's open code.

  Also fix a krb5/selinux regression.

  The rest is mainly smaller pNFS fixes"

* 'for-4.0' of git://linux-nfs.org/~bfields/linux:
  sunrpc: make debugfs file creation failure non-fatal
  nfsd: require an explicit option to enable pNFS
  NFSD: Fix bad update of layout in nfsd4_return_file_layout
  NFSD: Take care the return value from nfsd4_encode_stateid
  NFSD: Printk blocklayout length and offset as format 0x%llx
  nfsd: return correct lockowner when there is a race on hash insert
  nfsd: return correct openowner when there is a race to put one in the hash
  NFSD: Put exports after nfsd4_layout_verify fail
  NFSD: Error out when register_shrinker() fail
  NFSD: Take care the return value from nfsd4_decode_stateid
  NFSD: Check layout type when returning client layouts
  NFSD: restore trace event lost in mismerge

Merge branch 'bnx2'

Yuval Mintz says:

====================
bnx2x: kdump related fixes

This patch series aims to fix bnx2x driver issues when loading in kdump kernel.
Both issues fixed here would be fatal to the device, requiring full reset of
the system in order to recover, preventing the device from serving its purpose
in the kdump environment.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: Fix kdump when iommu=on

When IOMM-vtd is active, once main kernel crashes unfinished DMAE transactions
will be blocked, putting the HW in an error state which will cause further
transactions to timeout.

Current employed logic uses wrong macros, causing the first function to be the
only function that cleanups that error state during its probe/load.

This patch allows all the functions to successfully re-load in kdump kernel.

Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: Fix kdump on 4-port device

When running in a kdump kernel, it's very likely that due to sync. loss with
management firmware the first PCI function to probe and reach the previous
unload flow would decide it can reset the chip and continue onward. While doing
so, it will only close its own Rx port.

On a 4-port device where 2nd port on engine is a 1g-port, the 2nd port would
allow ingress traffic after the chip is reset [assuming it was active on the
first kernel]. This would later cause a HW attention.

This changes driver flow to close both ports' 1g capabilities during the
previous driver unload flow prior to the chip reset.

Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mac80211: fix RX A-MPDU session reorder timer deletion

There's an issue with the way the RX A-MPDU reorder timer is
deleted that can cause a kernel crash like this:

* tid_rx is removed - call_rcu(ieee80211_free_tid_rx)
* station is destroyed
* reorder timer fires before ieee80211_free_tid_rx() runs,
accessing the station, thus potentially crashing due to
the use-after-free

The station deletion is protected by synchronize_net(), but
that isn't enough -- ieee80211_free_tid_rx() need not have
run when that returns (it deletes the timer.) We could use
rcu_barrier() instead of synchronize_net(), but that's much
more expensive.

Instead, to fix this, add a field tracking that the session
is being deleted. In this case, the only re-arming of the
timer happens with the reorder spinlock held, so make that
code not rearm it if the session is being deleted and also
delete the timer after setting that field. This ensures the
timer cannot fire after ___ieee80211_stop_rx_ba_session()
returns, which fixes the problem.

Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

x86/reboot: Add ASRock Q1900DC-ITX mainboard reboot quirk

The ASRock Q1900DC-ITX mainboard (Baytrail-D) hangs randomly in
both BIOS and UEFI mode while rebooting unless reboot=pci is
used. Add a quirk to reboot via the pci method.

The problem is very intermittent and hard to debug, it might succeed
rebooting just fine 40 times in a row - but fails half a dozen times
the next day. It seems to be slightly less common in BIOS CSM mode
than native UEFI (with the CSM disabled), but it does happen in either
mode. Since I've started testing this patch in late january, rebooting
has been 100% reliable.

Most of the time it already hangs during POST, but occasionally it
might even make it through the bootloader and the kernel might even
start booting, but then hangs before the mode switch. The same symptoms
occur with grub-efi, gummiboot and grub-pc, just as well as (at least)
kernel 3.16-3.19 and 4.0-rc6 (I haven't tried older kernels than 3.16).
Upgrading to the most current mainboard firmware of the ASRock
Q1900DC-ITX, version 1.20, does not improve the situation.

( Searching the web seems to suggest that other Bay Trail-D mainboards
might be affected as well. )
--
Signed-off-by: Stefan Lippers-Hollmann <s.l-h@gmx.de>
Cc: <stable@vger.kernel.org>
Cc: Matt Fleming <matt.fleming@intel.com>
Link: http://lkml.kernel.org/r/20150330224427.0fb58e42@mir
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Merge tag 'iio-fixes-for-4.0d' of git://git./linux/kernel/git/jic23/iio into staging-linus

Jonathan writes:

IIO fixes for 4.0 set 4

A couple more IIO fixes.

* Fix check for HAS_IOMEM in the cc100001_adc driver to avoid build errors.
  Rather curiously it was ORed with Regulator and clock support.
* vf610 driver was trying to use an ADC clock outside the possible
  spec on some boards.  The driver assumed a fixed clock speed previously
  across all boards, but that is not true.  This fix ensures that the
  reported frequency is correct on all boards.
* The adis imu common code directly set the current trigger to the
  driver supplied one.  Unfortunately this didn't increase the use count
  leading to a double free via a particular path of changing the trigger
  then removing the driver.

Merge tag 'usb-serial-4.0-rc6' of git://git./linux/kernel/git/johan/usb-serial into usb-linus

Johan writes:

USB-serial fixes for v4.0-rc6

Here are a few new device IDs.

Signed-off-by: Johan Hovold <johan@kernel.org>

Fix warning

Coverity reports a warning due to unitialized attr structure in one
code path.

Reported by Coverity (CID 728535)

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Jeff Layton <jlayton@samba.org>

Fix another dereference before null check warning

null tcon is not possible in these paths so
remove confusing null check

Reported by Coverity (CID 728519)

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Jeff Layton <jlayton@samba.org>

CIFS: session servername can't be null

remove impossible check

Pointed out by Coverity (CID 115422)

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Jeff Layton <jlayton@samba.org>

Fix warning on impossible comparison

workstation_RFC1001_name is part of the struct and can't be null,
remove impossible comparison (array vs. null)

Pointed out by Coverity (CID 140095)

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Jeff Layton <jlayton@samba.org>

Fix coverity warning

Coverity reports a warning for referencing the beginning of the
SMB2/SMB3 frame using the ProtocolId field as an array. Although
it works the same either way, this patch should quiet the warning
and might be a little clearer.

Reported by Coverity (CID 741269)

Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>

Fix dereference before null check warning

null tcon is not likely in these paths in current
code, but obviously it does clarify the code to
check for null (if at all) before derefrencing
rather than after.

Reported by Coverity (CID 1042666)

Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>

Don't ignore errors on encrypting password in SMBTcon

Although unlikely to fail (and tree connect does not commonly send
a password since SECMODE_USER is the default for most servers)
do not ignore errors on SMBNTEncrypt in SMB Tree Connect.

Reported by Coverity (CID 1226853)

Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>