pandora-kernel.git
3 years agonetfilter: xt_TCPMSS: add more sanity tests on tcph->doff
Eric Dumazet [Mon, 3 Apr 2017 17:55:11 +0000 (10:55 -0700)]
netfilter: xt_TCPMSS: add more sanity tests on tcph->doff

commit 2638fd0f92d4397884fd991d8f4925cb3f081901 upstream.

Denys provided an awesome KASAN report pointing to an use
after free in xt_TCPMSS

I have provided three patches to fix this issue, either in xt_TCPMSS or
in xt_tcpudp.c. It seems xt_TCPMSS patch has the smallest possible
impact.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agonetfilter: xt_TCPMSS: correct return value in tcpmss_mangle_packet
Phil Oester [Sun, 1 Sep 2013 15:32:21 +0000 (08:32 -0700)]
netfilter: xt_TCPMSS: correct return value in tcpmss_mangle_packet

commit 1205e1fa615805c9efa97303b552cf445965752a upstream.

In commit b396966c4 (netfilter: xt_TCPMSS: Fix missing fragmentation handling),
I attempted to add safe fragment handling to xt_TCPMSS.  However, Andy Padavan
of Project N56U correctly points out that returning XT_CONTINUE in this
function does not work.  The callers (tcpmss_tg[46]) expect to receive a value
of 0 in order to return XT_CONTINUE.

Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agonetfilter: xt_TCPMSS: fix handling of malformed TCP header and options
Pablo Neira Ayuso [Thu, 25 Jul 2013 08:37:49 +0000 (10:37 +0200)]
netfilter: xt_TCPMSS: fix handling of malformed TCP header and options

commit 71ffe9c77dd7a2b62207953091efa8dafec958dd upstream.

Make sure the packet has enough room for the TCP header and
that it is not malformed.

While at it, store tcph->doff*4 in a variable, as it is used
several times.

This patch also fixes a possible off by one in case of malformed
TCP options.

Reported-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agonetfilter: xt_TCPMSS: Fix missing fragmentation handling
Phil Oester [Wed, 12 Jun 2013 08:58:20 +0000 (10:58 +0200)]
netfilter: xt_TCPMSS: Fix missing fragmentation handling

commit b396966c4688522863572927cb30aa874b3ec504 upstream.

Similar to commit bc6bcb59 ("netfilter: xt_TCPOPTSTRIP: fix
possible mangling beyond packet boundary"), add safe fragment
handling to xt_TCPMSS.

Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
[bwh: Backported to 3.2: Change parameters for tcpmss_mangle_packet() as
 done upstream in commit 70d19f805f8c "netfilter: xt_TCPMSS: Fix IPv6 default
 MSS too"]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agonetfilter: xt_TCPOPTSTRIP: don't use tcp_hdr()
Pablo Neira Ayuso [Mon, 10 Jun 2013 23:51:31 +0000 (01:51 +0200)]
netfilter: xt_TCPOPTSTRIP: don't use tcp_hdr()

commit ed82c437320c48a4032492f4a55a7e2c934158b6 upstream.

In (bc6bcb5 netfilter: xt_TCPOPTSTRIP: fix possible mangling beyond
packet boundary), the use of tcp_hdr was introduced. However, we
cannot assume that skb->transport_header is set for non-local packets.

Cc: Florian Westphal <fw@strlen.de>
Reported-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agonetfilter: xt_TCPOPTSTRIP: fix possible mangling beyond packet boundary
Pablo Neira Ayuso [Tue, 7 May 2013 01:22:18 +0000 (03:22 +0200)]
netfilter: xt_TCPOPTSTRIP: fix possible mangling beyond packet boundary

commit bc6bcb59dd7c184d229f9e86d08aa56059938a4c upstream.

This target assumes that tcph->doff is well-formed, that may be well
not the case. Add extra sanity checkings to avoid possible crash due
to read/write out of the real packet boundary. After this patch, the
default action on malformed TCP packets is to drop them. Moreover,
fragments are skipped.

Reported-by: Rafal Kupka <rkupka@telemetry.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/decoder: Add new TEST instruction pattern
Masami Hiramatsu [Fri, 24 Nov 2017 04:56:30 +0000 (13:56 +0900)]
x86/decoder: Add new TEST instruction pattern

commit 12a78d43de767eaf8fb272facb7a7b6f2dc6a9df upstream.

The kbuild test robot reported this build warning:

  Warning: arch/x86/tools/test_get_len found difference at <jump_table>:ffffffff8103dd2c

  Warning: ffffffff8103dd82: f6 09 d8 testb $0xd8,(%rcx)
  Warning: objdump says 3 bytes, but insn_get_length() says 2
  Warning: decoded and checked 1569014 instructions with 1 warnings

This sequence seems to be a new instruction not in the opcode map in the Intel SDM.

The instruction sequence is "F6 09 d8", means Group3(F6), MOD(00)REG(001)RM(001), and 0xd8.
Intel SDM vol2 A.4 Table A-6 said the table index in the group is "Encoding of Bits 5,4,3 of
the ModR/M Byte (bits 2,1,0 in parenthesis)"

In that table, opcodes listed by the index REG bits as:

  000         001       010 011  100        101        110         111
 TEST Ib/Iz,(undefined),NOT,NEG,MUL AL/rAX,IMUL AL/rAX,DIV AL/rAX,IDIV AL/rAX

So, it seems TEST Ib is assigned to 001.

Add the new pattern.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoALSA: hda: Add Raven PCI ID
Vijendar Mukunda [Thu, 23 Nov 2017 14:37:00 +0000 (20:07 +0530)]
ALSA: hda: Add Raven PCI ID

commit 9ceace3c9c18c67676e75141032a65a8e01f9a7a upstream.

This commit adds PCI ID for Raven platform

Signed-off-by: Vijendar Mukunda <Vijendar.Mukunda@amd.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoALSA: usb-audio: Add sanity checks in v2 clock parsers
Takashi Iwai [Tue, 21 Nov 2017 16:28:06 +0000 (17:28 +0100)]
ALSA: usb-audio: Add sanity checks in v2 clock parsers

commit 0a62d6c966956d77397c32836a5bbfe3af786fc1 upstream.

The helper functions to parse and look for the clock source, selector
and multiplier unit may return the descriptor with a too short length
than required, while there is no sanity check in the caller side.
Add some sanity checks in the parsers, at least, to guarantee the
given descriptor size, for avoiding the potential crashes.

Fixes: 79f920fbff56 ("ALSA: usb-audio: parse clock topology of UAC2 devices")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoALSA: usb-audio: Fix potential out-of-bound access at parsing SU
Takashi Iwai [Tue, 21 Nov 2017 16:00:32 +0000 (17:00 +0100)]
ALSA: usb-audio: Fix potential out-of-bound access at parsing SU

commit f658f17b5e0e339935dca23e77e0f3cad591926b upstream.

The usb-audio driver may trigger an out-of-bound access at parsing a
malformed selector unit, as it checks the header length only after
evaluating bNrInPins field, which can be already above the given
length.  Fix it by adding the length check beforehand.

Fixes: 99fc86450c43 ("ALSA: usb-mixer: parse descriptors with structs")
Signed-off-by: Takashi Iwai <tiwai@suse.de>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoALSA: usb-audio: Add sanity checks to FE parser
Takashi Iwai [Tue, 21 Nov 2017 15:55:51 +0000 (16:55 +0100)]
ALSA: usb-audio: Add sanity checks to FE parser

commit d937cd6790a2bef2d07b500487646bd794c039bb upstream.

When the usb-audio descriptor contains the malformed feature unit
description with a too short length, the driver may access
out-of-bounds.  Add a sanity check of the header size at the beginning
of parse_audio_feature_unit().

Fixes: 23caaf19b11e ("ALSA: usb-mixer: Add support for Audio Class v2.0")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
[bwh: Backported to 3.2: use snd_printk() for logging]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoALSA: timer: Remove kernel warning at compat ioctl error paths
Takashi Iwai [Tue, 21 Nov 2017 15:36:11 +0000 (16:36 +0100)]
ALSA: timer: Remove kernel warning at compat ioctl error paths

commit 3d4e8303f2c747c8540a0a0126d0151514f6468b upstream.

Some timer compat ioctls have NULL checks of timer instance with
snd_BUG_ON() that bring up WARN_ON() when the debug option is set.
Actually the condition can be met in the normal situation and it's
confusing and bad to spew kernel warnings with stack trace there.
Let's remove snd_BUG_ON() invocation and replace with the simple
checks.  Also, correct the error code to EBADFD to follow the native
ioctl error handling.

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agonilfs2: fix race condition that causes file system corruption
Andreas Rohner [Fri, 17 Nov 2017 23:29:35 +0000 (15:29 -0800)]
nilfs2: fix race condition that causes file system corruption

commit 31ccb1f7ba3cfe29631587d451cf5bb8ab593550 upstream.

There is a race condition between nilfs_dirty_inode() and
nilfs_set_file_dirty().

When a file is opened, nilfs_dirty_inode() is called to update the
access timestamp in the inode.  It calls __nilfs_mark_inode_dirty() in a
separate transaction.  __nilfs_mark_inode_dirty() caches the ifile
buffer_head in the i_bh field of the inode info structure and marks it
as dirty.

After some data was written to the file in another transaction, the
function nilfs_set_file_dirty() is called, which adds the inode to the
ns_dirty_files list.

Then the segment construction calls nilfs_segctor_collect_dirty_files(),
which goes through the ns_dirty_files list and checks the i_bh field.
If there is a cached buffer_head in i_bh it is not marked as dirty
again.

Since nilfs_dirty_inode() and nilfs_set_file_dirty() use separate
transactions, it is possible that a segment construction that writes out
the ifile occurs in-between the two.  If this happens the inode is not
on the ns_dirty_files list, but its ifile block is still marked as dirty
and written out.

In the next segment construction, the data for the file is written out
and nilfs_bmap_propagate() updates the b-tree.  Eventually the bmap root
is written into the i_bh block, which is not dirty, because it was
written out in another segment construction.

As a result the bmap update can be lost, which leads to file system
corruption.  Either the virtual block address points to an unallocated
DAT block, or the DAT entry will be reused for something different.

The error can remain undetected for a long time.  A typical error
message would be one of the "bad btree" errors or a warning that a DAT
entry could not be found.

This bug can be reproduced reliably by a simple benchmark that creates
and overwrites millions of 4k files.

Link: http://lkml.kernel.org/r/1509367935-3086-2-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Andreas Rohner <andreas.rohner@gmx.net>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoautofs: fix careless error in recent commit
NeilBrown [Thu, 14 Dec 2017 23:32:38 +0000 (15:32 -0800)]
autofs: fix careless error in recent commit

commit 302ec300ef8a545a7fc7f667e5fd743b091c2eeb upstream.

Commit ecc0c469f277 ("autofs: don't fail mount for transient error") was
meant to replace an 'if' with a 'switch', but instead added the 'switch'
leaving the case in place.

Link: http://lkml.kernel.org/r/87zi6wstmw.fsf@notabene.neil.brown.name
Fixes: ecc0c469f277 ("autofs: don't fail mount for transient error")
Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: autofs4_write() doesn't take an autofs_sb_info
 pointer]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoautofs: don't fail mount for transient error
NeilBrown [Fri, 17 Nov 2017 23:29:13 +0000 (15:29 -0800)]
autofs: don't fail mount for transient error

commit ecc0c469f27765ed1e2b967be0aa17cee1a60b76 upstream.

Currently if the autofs kernel module gets an error when writing to the
pipe which links to the daemon, then it marks the whole moutpoint as
catatonic, and it will stop working.

It is possible that the error is transient.  This can happen if the
daemon is slow and more than 16 requests queue up.  If a subsequent
process tries to queue a request, and is then signalled, the write to
the pipe will return -ERESTARTSYS and autofs will take that as total
failure.

So change the code to assess -ERESTARTSYS and -ENOMEM as transient
failures which only abort the current request, not the whole mountpoint.

It isn't a crash or a data corruption, but having autofs mountpoints
suddenly stop working is rather inconvenient.

Ian said:

: And given the problems with a half dozen (or so) user space applications
: consuming large amounts of CPU under heavy mount and umount activity this
: could happen more easily than we expect.

Link: http://lkml.kernel.org/r/87y3norvgp.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: autofs4_write() doesn't take an autofs_sb_info
 pointer]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoautofs4: catatonic_mode vs. notify_daemon race
Al Viro [Wed, 11 Jan 2012 03:24:48 +0000 (22:24 -0500)]
autofs4: catatonic_mode vs. notify_daemon race

commit 8753333266be67ff3a984ac1f6566d31c260bee4 upstream.

we need to hold ->wq_mutex while we are forming the packet to send,
lest we have autofs4_catatonic_mode() setting wq->name.name to NULL
just as autofs4_notify_daemon() decides to memcpy() from it...

We do have check for catatonic mode immediately after that (under
->wq_mutex, as it ought to be) and packet won't be actually sent,
but it'll be too late for us if we oops on that memcpy() from NULL...

Fix is obvious - just extend the area covered by ->wq_mutex over
that switch and check whether it's catatonic *before* doing anything
else.

Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoautofs4: autofs4_wait() vs. autofs4_catatonic_mode() race
Al Viro [Wed, 11 Jan 2012 03:20:12 +0000 (22:20 -0500)]
autofs4: autofs4_wait() vs. autofs4_catatonic_mode() race

commit 4041bcdc7bef06a2fb29c57394c713a74bd13b08 upstream.

We need to recheck ->catatonic after autofs4_wait() got ->wq_mutex
for good, or we might end up with wq inserted into queue after
autofs4_catatonic_mode() had done its thing.  It will stick there
forever, since there won't be anything to clear its ->name.name.

A bit of a complication: validate_request() drops and regains ->wq_mutex.
It actually ends up the most convenient place to stick the check into...

Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agonfs: Fix ugly referral attributes
Chuck Lever [Sun, 5 Nov 2017 20:45:22 +0000 (15:45 -0500)]
nfs: Fix ugly referral attributes

commit c05cefcc72416a37eba5a2b35f0704ed758a9145 upstream.

Before traversing a referral and performing a mount, the mounted-on
directory looks strange:

dr-xr-xr-x. 2 4294967294 4294967294 0 Dec 31  1969 dir.0

nfs4_get_referral is wiping out any cached attributes with what was
returned via GETATTR(fs_locations), but the bit mask for that
operation does not request any file attributes.

Retrieve owner and timestamp information so that the memcpy in
nfs4_get_referral fills in more attributes.

Changes since v1:
- Don't request attributes that the client unconditionally replaces
- Request only MOUNTED_ON_FILEID or FILEID attribute, not both
- encode_fs_locations() doesn't use the third bitmask word

Fixes: 6b97fd3da1ea ("NFSv4: Follow a referral")
Suggested-by: Pradeep Thomas <pradeepthomas@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoKVM: SVM: obey guest PAT
Paolo Bonzini [Thu, 26 Oct 2017 07:13:27 +0000 (09:13 +0200)]
KVM: SVM: obey guest PAT

commit 15038e14724799b8c205beb5f20f9e54896013c3 upstream.

For many years some users of assigned devices have reported worse
performance on AMD processors with NPT than on AMD without NPT,
Intel or bare metal.

The reason turned out to be that SVM is discarding the guest PAT
setting and uses the default (PA0=PA4=WB, PA1=PA5=WT, PA2=PA6=UC-,
PA3=UC).  The guest might be using a different setting, and
especially might want write combining but isn't getting it
(instead getting slow UC or UC- accesses).

Thanks a lot to geoff@hostfission.com for noticing the relation
to the g_pat setting.  The patch has been tested also by a bunch
of people on VFIO users forums.

Fixes: 709ddebf81cb40e3c36c6109a7892e8b93a09464
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=196409
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Tested-by: Nick Sarnie <commendsarnex@gmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoKVM: vmx: Inject #GP on invalid PAT CR
Nadav Amit [Thu, 18 Sep 2014 19:39:44 +0000 (22:39 +0300)]
KVM: vmx: Inject #GP on invalid PAT CR

commit 4566654bb9be9e8864df417bb72ceee5136b6a6a upstream.

Guest which sets the PAT CR to invalid value should get a #GP.  Currently, if
vmx supports loading PAT CR during entry, then the value is not checked.  This
patch makes the required check in that case.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agodm bufio: fix integer overflow when limiting maximum cache size
Eric Biggers [Thu, 16 Nov 2017 00:38:09 +0000 (16:38 -0800)]
dm bufio: fix integer overflow when limiting maximum cache size

commit 74d4108d9e681dbbe4a2940ed8fdff1f6868184c upstream.

The default max_cache_size_bytes for dm-bufio is meant to be the lesser
of 25% of the size of the vmalloc area and 2% of the size of lowmem.
However, on 32-bit systems the intermediate result in the expression

    (VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100

overflows, causing the wrong result to be computed.  For example, on a
32-bit system where the vmalloc area is 520093696 bytes, the result is
1174405 rather than the expected 130023424, which makes the maximum
cache size much too small (far less than 2% of lowmem).  This causes
severe performance problems for dm-verity users on affected systems.

Fix this by using mult_frac() to correctly multiply by a percentage.  Do
this for all places in dm-bufio that multiply by a percentage.  Also
replace (VMALLOC_END - VMALLOC_START) with VMALLOC_TOTAL, which contrary
to the comment is now defined in include/linux/vmalloc.h.

Depends-on: 9993bc635 ("sched/x86: Fix overflow in cyc2ns_offset")
Fixes: 95d402f057f2 ("dm: add bufio")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
[bwh: Backported to 3.2: keep open-coded VMALLOC_TOTAL]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agodm: discard support requires all targets in a table support discards
Mike Snitzer [Tue, 14 Nov 2017 20:40:52 +0000 (15:40 -0500)]
dm: discard support requires all targets in a table support discards

commit 8a74d29d541cd86569139c6f3f44b2d210458071 upstream.

A DM device with a mix of discard capabilities (due to some underlying
devices not having discard support) _should_ just return -EOPNOTSUPP for
the region of the device that doesn't support discards (even if only by
way of the underlying driver formally not supporting discards).  BUT,
that does ask the underlying driver to handle something that it never
advertised support for.  In doing so we're exposing users to the
potential for a underlying disk driver hanging if/when a discard is
issued a the device that is incapable and never claimed to support
discards.

Fix this by requiring that each DM target in a DM table provide discard
support as a prereq for a DM device to advertise support for discards.

This may cause some configurations that were happily supporting discards
(even in the face of a mix of discard support) to stop supporting
discards -- but the risk of users hitting driver hangs, and forced
reboots, outweighs supporting those fringe mixed discard
configurations.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agonet/sctp: Always set scope_id in sctp_inet6_skb_msgname
Eric W. Biederman [Thu, 16 Nov 2017 04:17:48 +0000 (22:17 -0600)]
net/sctp: Always set scope_id in sctp_inet6_skb_msgname

commit 7c8a61d9ee1df0fb4747879fa67a99614eb62fec upstream.

Alexandar Potapenko while testing the kernel with KMSAN and syzkaller
discovered that in some configurations sctp would leak 4 bytes of
kernel stack.

Working with his reproducer I discovered that those 4 bytes that
are leaked is the scope id of an ipv6 address returned by recvmsg.

With a little code inspection and a shrewd guess I discovered that
sctp_inet6_skb_msgname only initializes the scope_id field for link
local ipv6 addresses to the interface index the link local address
pertains to instead of initializing the scope_id field for all ipv6
addresses.

That is almost reasonable as scope_id's are meaniningful only for link
local addresses.  Set the scope_id in all other cases to 0 which is
not a valid interface index to make it clear there is nothing useful
in the scope_id field.

There should be no danger of breaking userspace as the stack leak
guaranteed that previously meaningless random data was being returned.

Fixes: 372f525b495c ("SCTP:  Resync with LKSCTP tree.")
History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Reported-by: Alexander Potapenko <glider@google.com>
Tested-by: Alexander Potapenko <glider@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2:
 - Adjust context
 - Add braces]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agosctp: fully initialize the IPv6 address in sctp_v6_to_addr()
Alexander Potapenko [Wed, 16 Aug 2017 18:16:40 +0000 (20:16 +0200)]
sctp: fully initialize the IPv6 address in sctp_v6_to_addr()

commit 15339e441ec46fbc3bf3486bb1ae4845b0f1bb8d upstream.

KMSAN reported use of uninitialized sctp_addr->v4.sin_addr.s_addr and
sctp_addr->v6.sin6_scope_id in sctp_v6_cmp_addr() (see below).
Make sure all fields of an IPv6 address are initialized, which
guarantees that the IPv4 fields are also initialized.

==================================================================
 BUG: KMSAN: use of uninitialized memory in sctp_v6_cmp_addr+0x8d4/0x9f0
 net/sctp/ipv6.c:517
 CPU: 2 PID: 31056 Comm: syz-executor1 Not tainted 4.11.0-rc5+ #2944
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
 01/01/2011
 Call Trace:
  dump_stack+0x172/0x1c0 lib/dump_stack.c:42
  is_logbuf_locked mm/kmsan/kmsan.c:59 [inline]
  kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:938
  native_save_fl arch/x86/include/asm/irqflags.h:18 [inline]
  arch_local_save_flags arch/x86/include/asm/irqflags.h:72 [inline]
  arch_local_irq_save arch/x86/include/asm/irqflags.h:113 [inline]
  __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:467
  sctp_v6_cmp_addr+0x8d4/0x9f0 net/sctp/ipv6.c:517
  sctp_v6_get_dst+0x8c7/0x1630 net/sctp/ipv6.c:290
  sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
  sctp_assoc_add_peer+0x66d/0x16f0 net/sctp/associola.c:651
  sctp_sendmsg+0x35a5/0x4f90 net/sctp/socket.c:1871
  inet_sendmsg+0x498/0x670 net/ipv4/af_inet.c:762
  sock_sendmsg_nosec net/socket.c:633 [inline]
  sock_sendmsg net/socket.c:643 [inline]
  SYSC_sendto+0x608/0x710 net/socket.c:1696
  SyS_sendto+0x8a/0xb0 net/socket.c:1664
  entry_SYSCALL_64_fastpath+0x13/0x94
 RIP: 0033:0x44b479
 RSP: 002b:00007f6213f21c08 EFLAGS: 00000286 ORIG_RAX: 000000000000002c
 RAX: ffffffffffffffda RBX: 0000000020000000 RCX: 000000000044b479
 RDX: 0000000000000041 RSI: 0000000020edd000 RDI: 0000000000000006
 RBP: 00000000007080a8 R08: 0000000020b85fe4 R09: 000000000000001c
 R10: 0000000000040005 R11: 0000000000000286 R12: 00000000ffffffff
 R13: 0000000000003760 R14: 00000000006e5820 R15: 0000000000ff8000
 origin description: ----dst_saddr@sctp_v6_get_dst
 local variable created at:
  sk_fullsock include/net/sock.h:2321 [inline]
  inet6_sk include/linux/ipv6.h:309 [inline]
  sctp_v6_get_dst+0x91/0x1630 net/sctp/ipv6.c:241
  sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
==================================================================
 BUG: KMSAN: use of uninitialized memory in sctp_v6_cmp_addr+0x8d4/0x9f0
 net/sctp/ipv6.c:517
 CPU: 2 PID: 31056 Comm: syz-executor1 Not tainted 4.11.0-rc5+ #2944
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
 01/01/2011
 Call Trace:
  dump_stack+0x172/0x1c0 lib/dump_stack.c:42
  is_logbuf_locked mm/kmsan/kmsan.c:59 [inline]
  kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:938
  native_save_fl arch/x86/include/asm/irqflags.h:18 [inline]
  arch_local_save_flags arch/x86/include/asm/irqflags.h:72 [inline]
  arch_local_irq_save arch/x86/include/asm/irqflags.h:113 [inline]
  __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:467
  sctp_v6_cmp_addr+0x8d4/0x9f0 net/sctp/ipv6.c:517
  sctp_v6_get_dst+0x8c7/0x1630 net/sctp/ipv6.c:290
  sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
  sctp_assoc_add_peer+0x66d/0x16f0 net/sctp/associola.c:651
  sctp_sendmsg+0x35a5/0x4f90 net/sctp/socket.c:1871
  inet_sendmsg+0x498/0x670 net/ipv4/af_inet.c:762
  sock_sendmsg_nosec net/socket.c:633 [inline]
  sock_sendmsg net/socket.c:643 [inline]
  SYSC_sendto+0x608/0x710 net/socket.c:1696
  SyS_sendto+0x8a/0xb0 net/socket.c:1664
  entry_SYSCALL_64_fastpath+0x13/0x94
 RIP: 0033:0x44b479
 RSP: 002b:00007f6213f21c08 EFLAGS: 00000286 ORIG_RAX: 000000000000002c
 RAX: ffffffffffffffda RBX: 0000000020000000 RCX: 000000000044b479
 RDX: 0000000000000041 RSI: 0000000020edd000 RDI: 0000000000000006
 RBP: 00000000007080a8 R08: 0000000020b85fe4 R09: 000000000000001c
 R10: 0000000000040005 R11: 0000000000000286 R12: 00000000ffffffff
 R13: 0000000000003760 R14: 00000000006e5820 R15: 0000000000ff8000
 origin description: ----dst_saddr@sctp_v6_get_dst
 local variable created at:
  sk_fullsock include/net/sock.h:2321 [inline]
  inet6_sk include/linux/ipv6.h:309 [inline]
  sctp_v6_get_dst+0x91/0x1630 net/sctp/ipv6.c:241
  sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
==================================================================

Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agosctp: Fixup v4mapped behaviour to comply with Sock API
Jason Gunthorpe [Wed, 30 Jul 2014 18:40:53 +0000 (12:40 -0600)]
sctp: Fixup v4mapped behaviour to comply with Sock API

commit 299ee123e19889d511092347f5fc14db0f10e3a6 upstream.

The SCTP socket extensions API document describes the v4mapping option as
follows:

8.1.15.  Set/Clear IPv4 Mapped Addresses (SCTP_I_WANT_MAPPED_V4_ADDR)

   This socket option is a Boolean flag which turns on or off the
   mapping of IPv4 addresses.  If this option is turned on, then IPv4
   addresses will be mapped to V6 representation.  If this option is
   turned off, then no mapping will be done of V4 addresses and a user
   will receive both PF_INET6 and PF_INET type addresses on the socket.
   See [RFC3542] for more details on mapped V6 addresses.

This description isn't really in line with what the code does though.

Introduce addr_to_user (renamed addr_v4map), which should be called
before any sockaddr is passed back to user space. The new function
places the sockaddr into the correct format depending on the
SCTP_I_WANT_MAPPED_V4_ADDR option.

Audit all places that touched v4mapped and either sanely construct
a v4 or v6 address then call addr_to_user, or drop the
unnecessary v4mapped check entirely.

Audit all places that call addr_to_user and verify they are on a sycall
return path.

Add a custom getname that formats the address properly.

Several bugs are addressed:
 - SCTP_I_WANT_MAPPED_V4_ADDR=0 often returned garbage for
   addresses to user space
 - The addr_len returned from recvmsg was not correct when
   returning AF_INET on a v6 socket
 - flowlabel and scope_id were not zerod when promoting
   a v4 to v6
 - Some syscalls like bind and connect behaved differently
   depending on v4mapped

Tested bind, getpeername, getsockname, connect, and recvmsg for proper
behaviour in v4mapped = 1 and 0 cases.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Tested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agos390/disassembler: increase show_code buffer size
Vasily Gorbik [Wed, 15 Nov 2017 13:15:36 +0000 (14:15 +0100)]
s390/disassembler: increase show_code buffer size

commit b192571d1ae375e0bbe0aa3ccfa1a3c3704454b9 upstream.

Current buffer size of 64 is too small. objdump shows that there are
instructions which would require up to 75 bytes buffer (with current
formating). 128 bytes "ought to be enough for anybody".

Also replaces 8 spaces with a single tab to reduce the memory footprint.

Fixes the following KASAN finding:

BUG: KASAN: stack-out-of-bounds in number+0x3fe/0x538
Write of size 1 at addr 000000005a4a75a0 by task bash/1282

CPU: 1 PID: 1282 Comm: bash Not tainted 4.14.0+ #215
Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
Call Trace:
([<000000000011eeb6>] show_stack+0x56/0x88)
 [<0000000000e1ce1a>] dump_stack+0x15a/0x1b0
 [<00000000004e2994>] print_address_description+0xf4/0x288
 [<00000000004e2cf2>] kasan_report+0x13a/0x230
 [<0000000000e38ae6>] number+0x3fe/0x538
 [<0000000000e3dfe4>] vsnprintf+0x194/0x948
 [<0000000000e3ea42>] sprintf+0xa2/0xb8
 [<00000000001198dc>] print_insn+0x374/0x500
 [<0000000000119346>] show_code+0x4ee/0x538
 [<000000000011f234>] show_registers+0x34c/0x388
 [<000000000011f2ae>] show_regs+0x3e/0xa8
 [<000000000011f502>] die+0x1ea/0x2e8
 [<0000000000138f0e>] do_no_context+0x106/0x168
 [<0000000000139a1a>] do_protection_exception+0x4da/0x7d0
 [<0000000000e55914>] pgm_check_handler+0x16c/0x1c0
 [<000000000090639e>] sysrq_handle_crash+0x46/0x58
([<0000000000000007>] 0x7)
 [<00000000009073fa>] __handle_sysrq+0x102/0x218
 [<0000000000907c06>] write_sysrq_trigger+0xd6/0x100
 [<000000000061d67a>] proc_reg_write+0xb2/0x128
 [<0000000000520be6>] __vfs_write+0xee/0x368
 [<0000000000521222>] vfs_write+0x21a/0x278
 [<000000000052156a>] SyS_write+0xda/0x178
 [<0000000000e555cc>] system_call+0xc4/0x270

The buggy address belongs to the page:
page:000003d1016929c0 count:0 mapcount:0 mapping:          (null) index:0x0
flags: 0x0()
raw: 0000000000000000 0000000000000000 0000000000000000 ffffffff00000000
raw: 0000000000000100 0000000000000200 0000000000000000 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 000000005a4a7480: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
 000000005a4a7500: 00 00 00 00 00 00 00 00 f2 f2 f2 f2 00 00 00 00
>000000005a4a7580: 00 00 00 00 f3 f3 f3 f3 00 00 00 00 00 00 00 00
                               ^
 000000005a4a7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 f8 f8
 000000005a4a7680: f2 f2 f2 f2 f2 f2 f8 f8 f2 f2 f3 f3 f3 f3 00 00
==================================================================

Signed-off-by: Vasily Gorbik <gor@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoocfs2: fix issue that ocfs2_setattr() does not deal with new_i_size==i_size
Younger Liu [Mon, 10 Feb 2014 22:25:51 +0000 (14:25 -0800)]
ocfs2: fix issue that ocfs2_setattr() does not deal with new_i_size==i_size

commit d62e74be1270c89fbaf7aada8218bfdf62d00a58 upstream.

The issue scenario is as following:

- Create a small file and fallocate a large disk space for a file with
  FALLOC_FL_KEEP_SIZE option.

- ftruncate the file back to the original size again.  but the disk free
  space is not changed back.  This is a real bug that be fixed in this
  patch.

In order to solve the issue above, we modified ocfs2_setattr(), if
attr->ia_size != i_size_read(inode), It calls ocfs2_truncate_file(), and
truncate disk space to attr->ia_size.

Signed-off-by: Younger Liu <younger.liu@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Tested-by: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Reviewed-by: Jensen <shencanquan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoIB/mlx4: Increase maximal message size under UD QP
Mark Bloch [Thu, 2 Nov 2017 13:22:26 +0000 (15:22 +0200)]
IB/mlx4: Increase maximal message size under UD QP

commit 5f22a1d87c5315a98981ecf93cd8de226cffe6ca upstream.

Maximal message should be used as a limit to the max message payload allowed,
without the headers. The ConnectX-3 check is done against this value includes
the headers. When the payload is 4K this will cause the NIC to drop packets.

Increase maximal message to 8K as workaround, this shouldn't change current
behaviour because we continue to set the MTU to 4k.

To reproduce;
set MTU to 4296 on the corresponding interface, for example:
ifconfig eth0 mtu 4296 (both server and client)

On server:
ib_send_bw -c UD -d mlx4_0 -s 4096 -n 1000000 -i1 -m 4096

On client:
ib_send_bw -d mlx4_0 -c UD <server_ip> -s 4096 -n 1000000 -i 1 -m 4096

Fixes: 6e0d733d9215 ("IB/mlx4: Allow 4K messages for UD QPs")
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoblktrace: fix unlocked access to init/start-stop/teardown
Jens Axboe [Sun, 5 Nov 2017 16:13:48 +0000 (09:13 -0700)]
blktrace: fix unlocked access to init/start-stop/teardown

commit 1f2cac107c591c24b60b115d6050adc213d10fc0 upstream.

sg.c calls into the blktrace functions without holding the proper queue
mutex for doing setup, start/stop, or teardown.

Add internal unlocked variants, and export the ones that do the proper
locking.

Fixes: 6da127ad0918 ("blktrace: Add blktrace ioctls to SCSI generic devices")
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoblktrace: Fix potential deadlock between delete & sysfs ops
Waiman Long [Wed, 20 Sep 2017 19:12:20 +0000 (13:12 -0600)]
blktrace: Fix potential deadlock between delete & sysfs ops

commit 5acb3cc2c2e9d3020a4fee43763c6463767f1572 upstream.

The lockdep code had reported the following unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(s_active#228);
                               lock(&bdev->bd_mutex/1);
                               lock(s_active#228);
  lock(&bdev->bd_mutex);

 *** DEADLOCK ***

The deadlock may happen when one task (CPU1) is trying to delete a
partition in a block device and another task (CPU0) is accessing
tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that
partition.

The s_active isn't an actual lock. It is a reference count (kn->count)
on the sysfs (kernfs) file. Removal of a sysfs file, however, require
a wait until all the references are gone. The reference count is
treated like a rwsem using lockdep instrumentation code.

The fact that a thread is in the sysfs callback method or in the
ioctl call means there is a reference to the opended sysfs or device
file. That should prevent the underlying block structure from being
removed.

Instead of using bd_mutex in the block_device structure, a new
blk_trace_mutex is now added to the request_queue structure to protect
access to the blk_trace structure.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix typo in patch subject line, and prune a comment detailing how
the code used to work.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agodm: fix race between dm_get_from_kobject() and __dm_destroy()
Hou Tao [Wed, 1 Nov 2017 07:42:36 +0000 (15:42 +0800)]
dm: fix race between dm_get_from_kobject() and __dm_destroy()

commit b9a41d21dceadf8104812626ef85dc56ee8a60ed upstream.

The following BUG_ON was hit when testing repeat creation and removal of
DM devices:

    kernel BUG at drivers/md/dm.c:2919!
    CPU: 7 PID: 750 Comm: systemd-udevd Not tainted 4.1.44
    Call Trace:
     [<ffffffff81649e8b>] dm_get_from_kobject+0x34/0x3a
     [<ffffffff81650ef1>] dm_attr_show+0x2b/0x5e
     [<ffffffff817b46d1>] ? mutex_lock+0x26/0x44
     [<ffffffff811df7f5>] sysfs_kf_seq_show+0x83/0xcf
     [<ffffffff811de257>] kernfs_seq_show+0x23/0x25
     [<ffffffff81199118>] seq_read+0x16f/0x325
     [<ffffffff811de994>] kernfs_fop_read+0x3a/0x13f
     [<ffffffff8117b625>] __vfs_read+0x26/0x9d
     [<ffffffff8130eb59>] ? security_file_permission+0x3c/0x44
     [<ffffffff8117bdb8>] ? rw_verify_area+0x83/0xd9
     [<ffffffff8117be9d>] vfs_read+0x8f/0xcf
     [<ffffffff81193e34>] ? __fdget_pos+0x12/0x41
     [<ffffffff8117c686>] SyS_read+0x4b/0x76
     [<ffffffff817b606e>] system_call_fastpath+0x12/0x71

The bug can be easily triggered, if an extra delay (e.g. 10ms) is added
between the test of DMF_FREEING & DMF_DELETING and dm_get() in
dm_get_from_kobject().

To fix it, we need to ensure the test of DMF_FREEING & DMF_DELETING and
dm_get() are done in an atomic way, so _minor_lock is used.

The other callers of dm_get() have also been checked to be OK: some
callers invoke dm_get() under _minor_lock, some callers invoke it under
_hash_lock, and dm_start_request() invoke it after increasing
md->open_count.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agort2x00usb: mark device removed when get ENOENT usb error
Stanislaw Gruszka [Thu, 9 Nov 2017 10:59:24 +0000 (11:59 +0100)]
rt2x00usb: mark device removed when get ENOENT usb error

commit bfa62a52cad93686bb8d8171ea5288813248a7c6 upstream.

ENOENT usb error mean "specified interface or endpoint does not exist or
is not enabled". Mark device not present when we encounter this error
similar like we do with ENODEV error.

Otherwise we can have infinite loop in rt2x00usb_work_rxdone(), because
we remove and put again RX entries to the queue infinitely.

We can have similar situation when submit urb will fail all the time
with other error, so we need consider to limit number of entries
processed by rxdone work. But for now, since the patch fixes
reproducible soft lockup issue on single processor systems
and taken ENOENT error meaning, let apply this fix.

Patch adds additional ENOENT check not only in rx kick routine, but
also on other places where we check for ENODEV error.

Reported-by: Richard Genoud <richard.genoud@gmail.com>
Debugged-by: Richard Genoud <richard.genoud@gmail.com>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Tested-by: Richard Genoud <richard.genoud@gmail.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
[bwh: Backported to 3.2: adjust filename, context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agovideo: udlfb: Fix read EDID timeout
Ladislav Michl [Thu, 9 Nov 2017 17:09:30 +0000 (18:09 +0100)]
video: udlfb: Fix read EDID timeout

commit c98769475575c8a585f5b3952f4b5f90266f699b upstream.

While usb_control_msg function expects timeout in miliseconds, a value
of HZ is used. Replace it with USB_CTRL_GET_TIMEOUT and also fix error
message which looks like:
udlfb: Read EDID byte 78 failed err ffffff92
as error is either negative errno or number of bytes transferred use %d
format specifier.

Returned EDID is in second byte, so return error when less than two bytes
are received.

Fixes: 18dffdf8913a ("staging: udlfb: enhance EDID and mode handling support")
Signed-off-by: Ladislav Michl <ladis@linux-mips.org>
Cc: Bernie Thompson <bernie@plugable.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
[bwh: Backported to 3.2: adjust filename]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoUSB: usbfs: compute urb->actual_length for isochronous
Alan Stern [Wed, 8 Nov 2017 17:23:17 +0000 (12:23 -0500)]
USB: usbfs: compute urb->actual_length for isochronous

commit 2ef47001b3ee3ded579b7532ebdcf8680e4d8c54 upstream.

The USB kerneldoc says that the actual_length field "is read in
non-iso completion functions", but the usbfs driver uses it for all
URB types in processcompl().  Since not all of the host controller
drivers set actual_length for isochronous URBs, programs using usbfs
with some host controllers don't work properly.  For example, Minas
reports that a USB camera controlled by libusb doesn't work properly
with a dwc2 controller.

It doesn't seem worthwhile to change the HCDs and the documentation,
since the in-kernel USB class drivers evidently don't rely on
actual_length for isochronous transfers.  The easiest solution is for
usbfs to calculate the actual_length value for itself, by adding up
the lengths of the individual packets in an isochronous transfer.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
CC: Minas Harutyunyan <Minas.Harutyunyan@synopsys.com>
Reported-and-tested-by: wlf <wulf@rock-chips.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agokprobes, x86/alternatives: Use text_mutex to protect smp_alt_modules
Zhou Chengming [Thu, 2 Nov 2017 01:18:21 +0000 (09:18 +0800)]
kprobes, x86/alternatives: Use text_mutex to protect smp_alt_modules

commit e846d13958066828a9483d862cc8370a72fadbb6 upstream.

We use alternatives_text_reserved() to check if the address is in
the fixed pieces of alternative reserved, but the problem is that
we don't hold the smp_alt mutex when call this function. So the list
traversal may encounter a deleted list_head if another path is doing
alternatives_smp_module_del().

One solution is that we can hold smp_alt mutex before call this
function, but the difficult point is that the callers of this
functions, arch_prepare_kprobe() and arch_prepare_optimized_kprobe(),
are called inside the text_mutex. So we must hold smp_alt mutex
before we go into these arch dependent code. But we can't now,
the smp_alt mutex is the arch dependent part, only x86 has it.
Maybe we can export another arch dependent callback to solve this.

But there is a simpler way to handle this problem. We can reuse the
text_mutex to protect smp_alt_modules instead of using another mutex.
And all the arch dependent checks of kprobes are inside the text_mutex,
so it's safe now.

Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Fixes: 2cfa197 "ftrace/alternatives: Introducing *_text_reserved functions"
Link: http://lkml.kernel.org/r/1509585501-79466-1-git-send-email-zhouchengming1@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/smp: Don't ever patch back to UP if we unplug cpus
Rusty Russell [Mon, 6 Aug 2012 07:59:49 +0000 (17:29 +0930)]
x86/smp: Don't ever patch back to UP if we unplug cpus

commit 816afe4ff98ee10b1d30fd66361be132a0a5cee6 upstream.

We still patch SMP instructions to UP variants if we boot with a
single CPU, but not at any other time.  In particular, not if we
unplug CPUs to return to a single cpu.

Paul McKenney points out:

 mean offline overhead is 6251/48=130.2 milliseconds.

 If I remove the alternatives_smp_switch() from the offline
 path [...] the mean offline overhead is 550/42=13.1 milliseconds

Basically, we're never going to get those 120ms back, and the
code is pretty messy.

We get rid of:

 1) The "smp-alt-once" boot option. It's actually "smp-alt-boot", the
    documentation is wrong. It's now the default.

 2) The skip_smp_alternatives flag used by suspend.

 3) arch_disable_nonboot_cpus_begin() and arch_disable_nonboot_cpus_end()
    which were only used to set this one flag.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paul.mckenney@us.ibm.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/87vcgwwive.fsf@rustcorp.com.au
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agomedia: Don't do DMA on stack for firmware upload in the AS102 driver
Michele Baldessari [Mon, 6 Nov 2017 13:50:22 +0000 (08:50 -0500)]
media: Don't do DMA on stack for firmware upload in the AS102 driver

commit b3120d2cc447ee77b9d69bf4ad7b452c9adb4d39 upstream.

Firmware load on AS102 is using the stack which is not allowed any
longer. We currently fail with:

kernel: transfer buffer not dma capable
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 0 PID: 598 at drivers/usb/core/hcd.c:1595 usb_hcd_map_urb_for_dma+0x41d/0x620
kernel: Modules linked in: amd64_edac_mod(-) edac_mce_amd as102_fe dvb_as102(+) kvm_amd kvm snd_hda_codec_realtek dvb_core snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec irqbypass crct10dif_pclmul crc32_pclmul snd_hda_core snd_hwdep snd_seq ghash_clmulni_intel sp5100_tco fam15h_power wmi k10temp i2c_piix4 snd_seq_device snd_pcm snd_timer parport_pc parport tpm_infineon snd tpm_tis soundcore tpm_tis_core tpm shpchp acpi_cpufreq xfs libcrc32c amdgpu amdkfd amd_iommu_v2 radeon hid_logitech_hidpp i2c_algo_bit drm_kms_helper crc32c_intel ttm drm r8169 mii hid_logitech_dj
kernel: CPU: 0 PID: 598 Comm: systemd-udevd Not tainted 4.13.10-200.fc26.x86_64 #1
kernel: Hardware name: ASUS All Series/AM1I-A, BIOS 0505 03/13/2014
kernel: task: ffff979933b24c80 task.stack: ffffaf83413a4000
kernel: RIP: 0010:usb_hcd_map_urb_for_dma+0x41d/0x620
systemd-fsck[659]: /dev/sda2: clean, 49/128016 files, 268609/512000 blocks
kernel: RSP: 0018:ffffaf83413a7728 EFLAGS: 00010282
systemd-udevd[604]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
kernel: RAX: 000000000000001f RBX: ffff979930bce780 RCX: 0000000000000000
kernel: RDX: 0000000000000000 RSI: ffff97993ec0e118 RDI: ffff97993ec0e118
kernel: RBP: ffffaf83413a7768 R08: 000000000000039a R09: 0000000000000000
kernel: R10: 0000000000000001 R11: 00000000ffffffff R12: 00000000fffffff5
kernel: R13: 0000000001400000 R14: 0000000000000001 R15: ffff979930806800
kernel: FS:  00007effaca5c8c0(0000) GS:ffff97993ec00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007effa9fca962 CR3: 0000000233089000 CR4: 00000000000406f0
kernel: Call Trace:
kernel:  usb_hcd_submit_urb+0x493/0xb40
kernel:  ? page_cache_tree_insert+0x100/0x100
kernel:  ? xfs_iunlock+0xd5/0x100 [xfs]
kernel:  ? xfs_file_buffered_aio_read+0x57/0xc0 [xfs]
kernel:  usb_submit_urb+0x22d/0x560
kernel:  usb_start_wait_urb+0x6e/0x180
kernel:  usb_bulk_msg+0xb8/0x160
kernel:  as102_send_ep1+0x49/0xe0 [dvb_as102]
kernel:  ? devres_add+0x3f/0x50
kernel:  as102_firmware_upload.isra.0+0x1dc/0x210 [dvb_as102]
kernel:  as102_fw_upload+0xb6/0x1f0 [dvb_as102]
kernel:  as102_dvb_register+0x2af/0x2d0 [dvb_as102]
kernel:  as102_usb_probe+0x1f3/0x260 [dvb_as102]
kernel:  usb_probe_interface+0x124/0x300
kernel:  driver_probe_device+0x2ff/0x450
kernel:  __driver_attach+0xa4/0xe0
kernel:  ? driver_probe_device+0x450/0x450
kernel:  bus_for_each_dev+0x6e/0xb0
kernel:  driver_attach+0x1e/0x20
kernel:  bus_add_driver+0x1c7/0x270
kernel:  driver_register+0x60/0xe0
kernel:  usb_register_driver+0x81/0x150
kernel:  ? 0xffffffffc0807000
kernel:  as102_usb_driver_init+0x1e/0x1000 [dvb_as102]
kernel:  do_one_initcall+0x50/0x190
kernel:  ? __vunmap+0x81/0xb0
kernel:  ? kfree+0x154/0x170
kernel:  ? kmem_cache_alloc_trace+0x15f/0x1c0
kernel:  ? do_init_module+0x27/0x1e9
kernel:  do_init_module+0x5f/0x1e9
kernel:  load_module+0x2602/0x2c30
kernel:  SYSC_init_module+0x170/0x1a0
kernel:  ? SYSC_init_module+0x170/0x1a0
kernel:  SyS_init_module+0xe/0x10
kernel:  do_syscall_64+0x67/0x140
kernel:  entry_SYSCALL64_slow_path+0x25/0x25
kernel: RIP: 0033:0x7effab6cf3ea
kernel: RSP: 002b:00007fff5cfcbbc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
kernel: RAX: ffffffffffffffda RBX: 00005569e0b83760 RCX: 00007effab6cf3ea
kernel: RDX: 00007effac2099c5 RSI: 0000000000009a13 RDI: 00005569e0b98c50
kernel: RBP: 00007effac2099c5 R08: 00005569e0b83ed0 R09: 0000000000001d80
kernel: R10: 00007effab98db00 R11: 0000000000000246 R12: 00005569e0b98c50
kernel: R13: 00005569e0b81c60 R14: 0000000000020000 R15: 00005569dfadfdf7
kernel: Code: 48 39 c8 73 30 80 3d 59 60 9d 00 00 41 bc f5 ff ff ff 0f 85 26 ff ff ff 48 c7 c7 b8 6b d0 92 c6 05 3f 60 9d 00 01 e8 24 3d ad ff <0f> ff 8b 53 64 e9 09 ff ff ff 65 48 8b 0c 25 00 d3 00 00 48 8b
kernel: ---[ end trace c4cae366180e70ec ]---
kernel: as10x_usb: error during firmware upload part1

Let's allocate the the structure dynamically so we can get the firmware
loaded correctly:
[   14.243057] as10x_usb: firmware: as102_data1_st.hex loaded with success
[   14.500777] as10x_usb: firmware: as102_data2_st.hex loaded with success

Signed-off-by: Michele Baldessari <michele@acksyn.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
[bwh: Backported to 3.2: adjust filename, context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoeCryptfs: use after free in ecryptfs_release_messaging()
Dan Carpenter [Tue, 22 Aug 2017 20:41:28 +0000 (23:41 +0300)]
eCryptfs: use after free in ecryptfs_release_messaging()

commit db86be3a12d0b6e5c5b51c2ab2a48f06329cb590 upstream.

We're freeing the list iterator so we should be using the _safe()
version of hlist_for_each_entry().

Fixes: 88b4a07e6610 ("[PATCH] eCryptfs: Public key transport mechanism")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agocoda: fix 'kernel memory exposure attempt' in fsync
Jan Harkes [Wed, 27 Sep 2017 19:52:12 +0000 (15:52 -0400)]
coda: fix 'kernel memory exposure attempt' in fsync

commit d337b66a4c52c7b04eec661d86c2ef6e168965a2 upstream.

When an application called fsync on a file in Coda a small request with
just the file identifier was allocated, but the declared length was set
to the size of union of all possible upcall requests.

This bug has been around for a very long time and is now caught by the
extra checking in usercopy that was introduced in Linux-4.8.

The exposure happens when the Coda cache manager process reads the fsync
upcall request at which point it is killed. As a result there is nobody
servicing any further upcalls, trapping any processes that try to access
the mounted Coda filesystem.

Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoUSB: Add delay-init quirk for Corsair K70 LUX keyboards
Bernhard Rosenkraenzer [Fri, 3 Nov 2017 15:46:02 +0000 (16:46 +0100)]
USB: Add delay-init quirk for Corsair K70 LUX keyboards

commit a0fea6027f19c62727315aba1a7fae75a9caa842 upstream.

Without this patch, K70 LUX keyboards don't work, saying
usb 3-3: unable to read config index 0 descriptor/all
usb 3-3: can't read configurations, error -110
usb usb3-port3: unable to enumerate USB device

Signed-off-by: Bernhard Rosenkraenzer <Bernhard.Rosenkranzer@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoisofs: fix timestamps beyond 2027
Arnd Bergmann [Thu, 19 Oct 2017 14:47:48 +0000 (16:47 +0200)]
isofs: fix timestamps beyond 2027

commit 34be4dbf87fc3e474a842305394534216d428f5d upstream.

isofs uses a 'char' variable to load the number of years since
1900 for an inode timestamp. On architectures that use a signed
char type by default, this results in an invalid date for
anything beyond 2027.

This changes the function argument to a 'u8' array, which
is defined the same way on all architectures, and unambiguously
lets us use years until 2155.

This should be backported to all kernels that might still be
in use by that date.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agomtd: nand: Fix writing mtdoops to nand flash.
Brent Taylor [Tue, 31 Oct 2017 03:32:45 +0000 (22:32 -0500)]
mtd: nand: Fix writing mtdoops to nand flash.

commit 30863e38ebeb500a31cecee8096fb5002677dd9b upstream.

When mtdoops calls mtd_panic_write(), it eventually calls
panic_nand_write() in nand_base.c. In order to properly wait for the
nand chip to be ready in panic_nand_wait(), the chip must first be
selected.

When using the atmel nand flash controller, a panic would occur due to
a NULL pointer exception.

Fixes: 2af7c6539931 ("mtd: Add panic_write for NAND flashes")
Signed-off-by: Brent Taylor <motobud@gmail.com>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agomedia: omap_vout: Fix a possible null pointer dereference in omap_vout_open()
Markus Elfring [Sun, 24 Sep 2017 09:00:57 +0000 (05:00 -0400)]
media: omap_vout: Fix a possible null pointer dereference in omap_vout_open()

commit bfba2b3e21b9426c0f9aca00f3cad8631b2da170 upstream.

Move a debug message so that a null pointer access can not happen
for the variable "vout" in this function.

Fixes: 5c7ab6348e7b3fcca2b8ee548306c774472971e2 ("V4L/DVB: V4L2: Add support for OMAP2/3 V4L2 display driver on top of DSS2")

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
[bwh: Backported to 3.2: adjust filename]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agol2tp: initialise PPP sessions before registering them
Guillaume Nault [Fri, 27 Oct 2017 14:51:52 +0000 (16:51 +0200)]
l2tp: initialise PPP sessions before registering them

commit f98be6c6359e7e4a61aaefb9964c1db31cb9ec0c upstream.

pppol2tp_connect() initialises L2TP sessions after they've been exposed
to the rest of the system by l2tp_session_register(). This puts
sessions into transient states that are the source of several races, in
particular with session's deletion path.

This patch centralises the initialisation code into
pppol2tp_session_init(), which is called before the registration phase.
The only field that can't be set before session registration is the
pppol2tp socket pointer, which has already been converted to RCU. So
pppol2tp_connect() should now be race-free.

The session's .session_close() callback is now set before registration.
Therefore, it's always called when l2tp_core deletes the session, even
if it was created by pppol2tp_session_create() and hasn't been plugged
to a pppol2tp socket yet. That'd prevent session free because the extra
reference taken by pppol2tp_session_close() wouldn't be dropped by the
socket's ->sk_destruct() callback (pppol2tp_session_destruct()).
We could set .session_close() only while connecting a session to its
pppol2tp socket, or teach pppol2tp_session_close() to avoid grabbing a
reference when the session isn't connected, but that'd require adding
some form of synchronisation to be race free.

Instead of that, we can just let the pppol2tp socket hold a reference
on the session as soon as it starts depending on it (that is, in
pppol2tp_connect()). Then we don't need to utilise
pppol2tp_session_close() to hold a reference at the last moment to
prevent l2tp_core from dropping it.

When releasing the socket, pppol2tp_release() now deletes the session
using the standard l2tp_session_delete() function, instead of merely
removing it from hash tables. l2tp_session_delete() drops the reference
the sessions holds on itself, but also makes sure it doesn't remove a
session twice. So it can safely be called, even if l2tp_core already
tried, or is concurrently trying, to remove the session.
Finally, pppol2tp_session_destruct() drops the reference held by the
socket.

Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agol2tp: protect sock pointer of struct pppol2tp_session with RCU
Guillaume Nault [Fri, 27 Oct 2017 14:51:52 +0000 (16:51 +0200)]
l2tp: protect sock pointer of struct pppol2tp_session with RCU

commit ee40fb2e1eb5bc0ddd3f2f83c6e39a454ef5a741 upstream.

pppol2tp_session_create() registers sessions that can't have their
corresponding socket initialised. This socket has to be created by
userspace, then connected to the session by pppol2tp_connect().
Therefore, we need to protect the pppol2tp socket pointer of L2TP
sessions, so that it can safely be updated when userspace is connecting
or closing the socket. This will eventually allow pppol2tp_connect()
to avoid generating transient states while initialising its parts of the
session.

To this end, this patch protects the pppol2tp socket pointer using RCU.

The pppol2tp socket pointer is still set in pppol2tp_connect(), but
only once we know the function isn't going to fail. It's eventually
reset by pppol2tp_release(), which now has to wait for a grace period
to elapse before it can drop the last reference on the socket. This
ensures that pppol2tp_session_get_sock() can safely grab a reference
on the socket, even after ps->sk is reset to NULL but before this
operation actually gets visible from pppol2tp_session_get_sock().

The rest is standard RCU conversion: pppol2tp_recv(), which already
runs in atomic context, is simply enclosed by rcu_read_lock() and
rcu_read_unlock(), while other functions are converted to use
pppol2tp_session_get_sock() followed by sock_put().
pppol2tp_session_setsockopt() is a special case. It used to retrieve
the pppol2tp socket from the L2TP session, which itself was retrieved
from the pppol2tp socket. Therefore we can just avoid dereferencing
ps->sk and directly use the original socket pointer instead.

With all users of ps->sk now handling NULL and concurrent updates, the
L2TP ->ref() and ->deref() callbacks aren't needed anymore. Therefore,
rather than converting pppol2tp_session_sock_hold() and
pppol2tp_session_sock_put(), we can just drop them.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agol2tp: initialise l2tp_eth sessions before registering them
Guillaume Nault [Fri, 27 Oct 2017 14:51:51 +0000 (16:51 +0200)]
l2tp: initialise l2tp_eth sessions before registering them

commit ee28de6bbd78c2e18111a0aef43ea746f28d2073 upstream.

Sessions must be initialised before being made externally visible by
l2tp_session_register(). Otherwise the session may be concurrently
deleted before being initialised, which can confuse the deletion path
and eventually lead to kernel oops.

Therefore, we need to move l2tp_session_register() down in
l2tp_eth_create(), but also handle the intermediate step where only the
session or the netdevice has been registered.

We can't just call l2tp_session_register() in ->ndo_init() because
we'd have no way to properly undo this operation in ->ndo_uninit().
Instead, let's register the session and the netdevice in two different
steps and protect the session's device pointer with RCU.

And now that we allow the session's .dev field to be NULL, we don't
need to prevent the netdevice from being removed anymore. So we can
drop the dev_hold() and dev_put() calls in l2tp_eth_create() and
l2tp_eth_dev_uninit().

Fixes: d9e31d17ceba ("l2tp: Add L2TP ethernet pseudowire support")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2:
 - Update another 'goto out' in l2tp_eth_create()
 - Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agol2tp: don't register sessions in l2tp_session_create()
Guillaume Nault [Fri, 27 Oct 2017 14:51:50 +0000 (16:51 +0200)]
l2tp: don't register sessions in l2tp_session_create()

commit 3953ae7b218df4d1e544b98a393666f9ae58a78c upstream.

Sessions created by l2tp_session_create() aren't fully initialised:
some pseudo-wire specific operations need to be done before making the
session usable. Therefore the PPP and Ethernet pseudo-wires continue
working on the returned l2tp session while it's already been exposed to
the rest of the system.
This can lead to various issues. In particular, the session may enter
the deletion process before having been fully initialised, which will
confuse the session removal code.

This patch moves session registration out of l2tp_session_create(), so
that callers can control when the session is exposed to the rest of the
system. This is done by the new l2tp_session_register() function.

Only pppol2tp_session_create() can be easily converted to avoid
modifying its session after registration (the debug message is dropped
in order to avoid the need for holding a reference on the session).

For pppol2tp_connect() and l2tp_eth_create()), more work is needed.
That'll be done in followup patches. For now, let's just register the
session right after its creation, like it was done before. The only
difference is that we can easily take a reference on the session before
registering it, so, at least, we're sure it's not going to be freed
while we're working on it.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agol2tp: ensure sessions are freed after their PPPOL2TP socket
Guillaume Nault [Fri, 22 Sep 2017 13:39:23 +0000 (15:39 +0200)]
l2tp: ensure sessions are freed after their PPPOL2TP socket

commit cdd10c9627496ad25c87ce6394e29752253c69d3 upstream.

If l2tp_tunnel_delete() or l2tp_tunnel_closeall() deletes a session
right after pppol2tp_release() orphaned its socket, then the 'sock'
variable of the pppol2tp_session_close() callback is NULL. Yet the
session is still used by pppol2tp_release().

Therefore we need to take an extra reference in any case, to prevent
l2tp_tunnel_delete() or l2tp_tunnel_closeall() from freeing the session.

Since the pppol2tp_session_close() callback is only set if the session
is associated to a PPPOL2TP socket and that both l2tp_tunnel_delete()
and l2tp_tunnel_closeall() hold the PPPOL2TP socket before calling
pppol2tp_session_close(), we're sure that pppol2tp_session_close() and
pppol2tp_session_destruct() are paired and called in the right order.
So the reference taken by the former will be released by the later.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agol2tp: push all ppp pseudowire shutdown through .release handler
Tom Parkin [Tue, 19 Mar 2013 06:11:21 +0000 (06:11 +0000)]
l2tp: push all ppp pseudowire shutdown through .release handler

commit cf2f5c886a209377daefd5d2ba0bcd49c3887813 upstream.

If userspace deletes a ppp pseudowire using the netlink API, either by
directly deleting the session or by deleting the tunnel that contains the
session, we need to tear down the corresponding pppox channel.

Rather than trying to manage two pppox unbind codepaths, switch the netlink
and l2tp_core session_close handlers to close via. the l2tp_ppp socket
.release handler.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agol2tp: purge session reorder queue on delete
Tom Parkin [Tue, 19 Mar 2013 06:11:20 +0000 (06:11 +0000)]
l2tp: purge session reorder queue on delete

commit 4c6e2fd35460208596fa099ee0750a4b0438aa5c upstream.

Add calls to l2tp_session_queue_purge as a part of l2tp_tunnel_closeall
and l2tp_session_delete.  Pseudowire implementations which are deleted only
via. l2tp_core l2tp_session_delete calls can dispense with their own code for
flushing the reorder queue.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agol2tp: add session reorder queue purge function to core
Tom Parkin [Tue, 19 Mar 2013 06:11:19 +0000 (06:11 +0000)]
l2tp: add session reorder queue purge function to core

commit 48f72f92b31431c40279b0fba6c5588e07e67d95 upstream.

If an l2tp session is deleted, it is necessary to delete skbs in-flight
on the session's reorder queue before taking it down.

Rather than having each pseudowire implementation reaching into the
l2tp_session struct to handle this itself, provide a function in l2tp_core to
purge the session queue.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: use non-atomic increment on rx_errors]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agonet/9p: Switch to wait_event_killable()
Tuomas Tynkkynen [Wed, 6 Sep 2017 14:59:08 +0000 (17:59 +0300)]
net/9p: Switch to wait_event_killable()

commit 9523feac272ccad2ad8186ba4fcc89103754de52 upstream.

Because userspace gets Very Unhappy when calls like stat() and execve()
return -EINTR on 9p filesystem mounts. For instance, when bash is
looking in PATH for things to execute and some SIGCHLD interrupts
stat(), bash can throw a spurious 'command not found' since it doesn't
retry the stat().

In practice, hitting the problem is rare and needs a really
slow/bogged down 9p server.

Signed-off-by: Tuomas Tynkkynen <tuomas@tuxera.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[bwh: Backported to 3.2:
 - Drop changes in trans_xen.c
 - Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agofs/9p: Compare qid.path in v9fs_test_inode
Tuomas Tynkkynen [Wed, 6 Sep 2017 14:59:07 +0000 (17:59 +0300)]
fs/9p: Compare qid.path in v9fs_test_inode

commit 8ee031631546cf2f7859cc69593bd60bbdd70b46 upstream.

Commit fd2421f54423 ("fs/9p: When doing inode lookup compare qid details
and inode mode bits.") transformed v9fs_qid_iget() to use iget5_locked()
instead of iget_locked(). However, the test() callback is not checking
fid.path at all, which means that a lookup in the inode cache can now
accidentally locate a completely wrong inode from the same inode hash
bucket if the other fields (qid.type and qid.version) match.

Fixes: fd2421f54423 ("fs/9p: When doing inode lookup compare qid details and inode mode bits.")
Reviewed-by: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Tuomas Tynkkynen <tuomas@tuxera.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agotpm-dev-common: Reject too short writes
Alexander Steffen [Fri, 8 Sep 2017 15:21:32 +0000 (17:21 +0200)]
tpm-dev-common: Reject too short writes

commit ee70bc1e7b63ac8023c9ff9475d8741e397316e7 upstream.

tpm_transmit() does not offer an explicit interface to indicate the number
of valid bytes in the communication buffer. Instead, it relies on the
commandSize field in the TPM header that is encoded within the buffer.
Therefore, ensure that a) enough data has been written to the buffer, so
that the commandSize field is present and b) the commandSize field does not
announce more data than has been written to the buffer.

This should have been fixed with CVE-2011-1161 long ago, but apparently
a correct version of that patch never made it into the kernel.

Signed-off-by: Alexander Steffen <Alexander.Steffen@infineon.com>
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Tested-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
[bwh: Backported to 3.2:
 - s/priv/chip/
 - Adjust filename, context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoIB/srp: Avoid that a cable pull can trigger a kernel crash
Bart Van Assche [Wed, 11 Oct 2017 17:27:26 +0000 (10:27 -0700)]
IB/srp: Avoid that a cable pull can trigger a kernel crash

commit 8a0d18c62121d3c554a83eb96e2752861d84d937 upstream.

This patch fixes the following kernel crash:

general protection fault: 0000 [#1] PREEMPT SMP
Workqueue: ib_mad2 timeout_sends [ib_core]
Call Trace:
 ib_sa_path_rec_callback+0x1c4/0x1d0 [ib_core]
 send_handler+0xb2/0xd0 [ib_core]
 timeout_sends+0x14d/0x220 [ib_core]
 process_one_work+0x200/0x630
 worker_thread+0x4e/0x3b0
 kthread+0x113/0x150

Fixes: commit aef9ec39c47f ("IB: Add SCSI RDMA Protocol (SRP) initiator")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoscsi: bfa: integer overflow in debugfs
Dan Carpenter [Wed, 4 Oct 2017 07:50:37 +0000 (10:50 +0300)]
scsi: bfa: integer overflow in debugfs

commit 3e351275655d3c84dc28abf170def9786db5176d upstream.

We could allocate less memory than intended because we do:

bfad->regdata = kzalloc(len << 2, GFP_KERNEL);

The shift can overflow leading to a crash.  This is debugfs code so the
impact is very small.  I fixed the network version of this in March with
commit 13e2d5187f6b ("bna: integer overflow bug in debugfs").

Fixes: ab2a9ba189e8 ("[SCSI] bfa: add debugfs support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoKVM: nVMX: set IDTR and GDTR limits when loading L1 host state
Ladi Prosek [Wed, 11 Oct 2017 14:54:42 +0000 (16:54 +0200)]
KVM: nVMX: set IDTR and GDTR limits when loading L1 host state

commit 21f2d551183847bc7fbe8d866151d00cdad18752 upstream.

Intel SDM 27.5.2 Loading Host Segment and Descriptor-Table Registers:

"The GDTR and IDTR limits are each set to FFFFH."

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agomedia: rc: check for integer overflow
Sean Young [Sun, 8 Oct 2017 18:18:52 +0000 (14:18 -0400)]
media: rc: check for integer overflow

commit 3e45067f94bbd61dec0619b1c32744eb0de480c8 upstream.

The ioctl LIRC_SET_REC_TIMEOUT would set a timeout of 704ns if called
with a timeout of 4294968us.

Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
[bwh: Backported to 3.2: open-code U32_MAX]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoUSB: serial: garmin_gps: fix memory leak on probe errors
Johan Hovold [Wed, 11 Oct 2017 12:02:58 +0000 (14:02 +0200)]
USB: serial: garmin_gps: fix memory leak on probe errors

commit 74d471b598444b7f2d964930f7234779c80960a0 upstream.

Make sure to free the port private data before returning after a failed
probe attempt.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoUSB: serial: garmin_gps: fix I/O after failed probe and remove
Johan Hovold [Wed, 11 Oct 2017 12:02:57 +0000 (14:02 +0200)]
USB: serial: garmin_gps: fix I/O after failed probe and remove

commit 19a565d9af6e0d828bd0d521d3bafd5017f4ce52 upstream.

Make sure to stop any submitted interrupt and bulk-out URBs before
returning after failed probe and when the port is being unbound to avoid
later NULL-pointer dereferences in the completion callbacks.

Also fix up the related and broken I/O cancellation on failed open and
on close. (Note that port->write_urb was never submitted.)

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoPCI/AER: Report non-fatal errors only to the affected endpoint
Gabriele Paoloni [Thu, 28 Sep 2017 14:33:05 +0000 (15:33 +0100)]
PCI/AER: Report non-fatal errors only to the affected endpoint

commit 86acc790717fb60fb51ea3095084e331d8711c74 upstream.

Previously, if an non-fatal error was reported by an endpoint, we
called report_error_detected() for the endpoint, every sibling on the
bus, and their descendents.  If any of them did not implement the
.error_detected() method, do_recovery() failed, leaving all these
devices unrecovered.

For example, the system described in the bugzilla below has two devices:

  0000:74:02.0 [19e5:a230] SAS controller, driver has .error_detected()
  0000:74:03.0 [19e5:a235] SATA controller, driver lacks .error_detected()

When a device such as 74:02.0 reported a non-fatal error, do_recovery()
failed because 74:03.0 lacked an .error_detected() method.  But per PCIe
r3.1, sec 6.2.2.2.2, such an error does not compromise the Link and
does not affect 74:03.0:

  Non-fatal errors are uncorrectable errors which cause a particular
  transaction to be unreliable but the Link is otherwise fully functional.
  Isolating Non-fatal from Fatal errors provides Requester/Receiver logic
  in a device or system management software the opportunity to recover from
  the error without resetting the components on the Link and disturbing
  other transactions in progress.  Devices not associated with the
  transaction in error are not impacted by the error.

Report non-fatal errors only to the endpoint that reported them.  We really
want to check for AER_NONFATAL here, but the current code structure doesn't
allow that.  Looking for pci_channel_io_normal is the best we can do now.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=197055
Fixes: 6c2b374d7485 ("PCI-Express AER implemetation: AER core and aerdriver")
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@huawei.com>
Signed-off-by: Dongdong Liu <liudongdong3@huawei.com>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agortc: set the alarm to the next expiring timer
Alexandre Belloni [Thu, 28 Sep 2017 11:53:27 +0000 (13:53 +0200)]
rtc: set the alarm to the next expiring timer

commit 74717b28cb32e1ad3c1042cafd76b264c8c0f68d upstream.

If there is any non expired timer in the queue, the RTC alarm is never set.
This is an issue when adding a timer that expires before the next non
expired timer.

Ensure the RTC alarm is set in that case.

Fixes: 2b2f5ff00f63 ("rtc: interface: ignore expired timers when enqueuing new timers")
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
[bwh: Backported to 3.2: open-code ktime_before()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agortc: interface: ignore expired timers when enqueuing new timers
Colin Ian King [Mon, 16 May 2016 16:22:54 +0000 (17:22 +0100)]
rtc: interface: ignore expired timers when enqueuing new timers

commit 2b2f5ff00f63847d95adad6289bd8b05f5983dd5 upstream.

This patch fixes a RTC wakealarm issue, namely, the event fires during
hibernate and is not cleared from the list, causing hwclock to block.

The current enqueuing does not trigger an alarm if any expired timers
already exist on the timerqueue. This can occur when a RTC wake alarm
is used to wake a machine out of hibernate and the resumed state has
old expired timers that have not been removed from the timer queue.
This fix skips over any expired timers and triggers an alarm if there
are no pending timers on the timerqueue. Note that the skipped expired
timer will get reaped later on, so there is no need to clean it up
immediately.

The issue can be reproduced by putting a machine into hibernate and
waking it with the RTC wakealarm.  Running the example RTC test program
from tools/testing/selftests/timers/rtctest.c after the hibernate will
block indefinitely.  With the fix, it no longer blocks after the
hibernate resume.

BugLink: http://bugs.launchpad.net/bugs/1333569
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoInput: adxl34x - do not treat FIFO_MODE() as boolean
Arnd Bergmann [Wed, 20 Sep 2017 19:04:04 +0000 (12:04 -0700)]
Input: adxl34x - do not treat FIFO_MODE() as boolean

commit 1dbc080c9ef6bcfba652ef0d6ae919b8c7c85a1d upstream.

FIFO_MODE() is a macro expression with a '<<' operator, which gcc points
out could be misread as a '<':

drivers/input/misc/adxl34x.c: In function 'adxl34x_probe':
drivers/input/misc/adxl34x.c:799:36: error: '<<' in boolean context, did you mean '<' ? [-Werror=int-in-bool-context]

While utility of this warning is being disputed (Chief Penguin: "This
warning is clearly pure garbage.") FIFO_MODE() extracts range of values,
with 0 being FIFO_BYPASS, and not something that is logically boolean.

This converts the test to an explicit comparison with FIFO_BYPASS,
making it clearer to gcc and the reader what is intended.

Fixes: e27c729219ad ("Input: add driver for ADXL345/346 Digital Accelerometers")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoLinux 3.2.98
Ben Hutchings [Sun, 7 Jan 2018 01:46:55 +0000 (01:46 +0000)]
Linux 3.2.98

3 years agoKPTI: Report when enabled
Kees Cook [Wed, 3 Jan 2018 18:18:01 +0000 (10:18 -0800)]
KPTI: Report when enabled

Make sure dmesg reports when KPTI is enabled.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoKPTI: Rename to PAGE_TABLE_ISOLATION
Kees Cook [Thu, 4 Jan 2018 01:14:24 +0000 (01:14 +0000)]
KPTI: Rename to PAGE_TABLE_ISOLATION

This renames CONFIG_KAISER to CONFIG_PAGE_TABLE_ISOLATION.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.2]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/kaiser: Move feature detection up
Borislav Petkov [Mon, 25 Dec 2017 12:57:16 +0000 (13:57 +0100)]
x86/kaiser: Move feature detection up

... before the first use of kaiser_enabled as otherwise funky
things happen:

  about to get started...
  (XEN) d0v0 Unhandled page fault fault/trap [#14, ec=0000]
  (XEN) Pagetable walk from ffff88022a449090:
  (XEN)  L4[0x110] = 0000000229e0e067 0000000000001e0e
  (XEN)  L3[0x008] = 0000000000000000 ffffffffffffffff
  (XEN) domain_crash_sync called from entry.S: fault at ffff82d08033fd08
  entry.o#create_bounce_frame+0x135/0x14d
  (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
  (XEN) ----[ Xen-4.9.1_02-3.21  x86_64  debug=n   Not tainted ]----
  (XEN) CPU:    0
  (XEN) RIP:    e033:[<ffffffff81007460>]
  (XEN) RFLAGS: 0000000000000286   EM: 1   CONTEXT: pv guest (d0v0)

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agokaiser: disabled on Xen PV
Jiri Kosina [Tue, 2 Jan 2018 13:19:49 +0000 (14:19 +0100)]
kaiser: disabled on Xen PV

Kaiser cannot be used on paravirtualized MMUs (namely reading and writing CR3).
This does not work with KAISER as the CR3 switch from and to user space PGD
would require to map the whole XEN_PV machinery into both.

More importantly, enabling KAISER on Xen PV doesn't make too much sense, as PV
guests use distinct %cr3 values for kernel and user already.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.2: use xen_pv_domain()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/kaiser: Reenable PARAVIRT
Borislav Petkov [Tue, 2 Jan 2018 13:19:49 +0000 (14:19 +0100)]
x86/kaiser: Reenable PARAVIRT

Now that the required bits have been addressed, reenable
PARAVIRT.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/paravirt: Dont patch flush_tlb_single
Thomas Gleixner [Mon, 4 Dec 2017 14:07:30 +0000 (15:07 +0100)]
x86/paravirt: Dont patch flush_tlb_single

commit a035795499ca1c2bd1928808d1a156eda1420383 upstream.

native_flush_tlb_single() will be changed with the upcoming
PAGE_TABLE_ISOLATION feature. This requires to have more code in
there than INVLPG.

Remove the paravirt patching for it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171204150606.828111617@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agokaiser: kaiser_flush_tlb_on_return_to_user() check PCID
Hugh Dickins [Sun, 5 Nov 2017 01:43:06 +0000 (18:43 -0700)]
kaiser: kaiser_flush_tlb_on_return_to_user() check PCID

Let kaiser_flush_tlb_on_return_to_user() do the X86_FEATURE_PCID
check, instead of each caller doing it inline first: nobody needs
to optimize for the noPCID case, it's clearer this way, and better
suits later changes.  Replace those no-op X86_CR3_PCID_KERN_FLUSH lines
by a BUILD_BUG_ON() in load_new_mm_cr3(), in case something changes.

(cherry picked from Change-Id: I9b528ed9d7c1ae4a3b4738c2894ee1740b6fb0b9)

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agokaiser: asm/tlbflush.h handle noPGE at lower level
Hugh Dickins [Sun, 5 Nov 2017 01:23:24 +0000 (18:23 -0700)]
kaiser: asm/tlbflush.h handle noPGE at lower level

I found asm/tlbflush.h too twisty, and think it safer not to avoid
__native_flush_tlb_global_irq_disabled() in the kaiser_enabled case,
but instead let it handle kaiser_enabled along with cr3: it can just
use __native_flush_tlb() for that, no harm in re-disabling preemption.

(This is not the same change as Kirill and Dave have suggested for
upstream, flipping PGE in cr4: that's neat, but needs a cpu_has_pge
check; cr3 is enough for kaiser, and thought to be cheaper than cr4.)

Also delete the X86_FEATURE_INVPCID invpcid_flush_all_nonglobals()
preference from __native_flush_tlb(): unlike the invpcid_flush_all()
preference in __native_flush_tlb_global(), it's not seen in upstream
4.14, and was recently reported to be surprisingly slow.

(cherry picked from Change-Id: I0da819a797ff46bca6590040b6480178dff6ba1e)

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agokaiser: use ALTERNATIVE instead of x86_cr3_pcid_noflush
Hugh Dickins [Wed, 4 Oct 2017 03:49:04 +0000 (20:49 -0700)]
kaiser: use ALTERNATIVE instead of x86_cr3_pcid_noflush

Now that we're playing the ALTERNATIVE game, use that more efficient
method: instead of user-mapping an extra page, and reading an extra
cacheline each time for x86_cr3_pcid_noflush.

Neel has found that __stringify(bts $X86_CR3_PCID_NOFLUSH_BIT, %rax)
is a working substitute for the "bts $63, %rax" in these ALTERNATIVEs;
but the one line with $63 in looks clearer, so let's stick with that.

Worried about what happens with an ALTERNATIVE between the jump and
jump label in another ALTERNATIVE?  I was, but have checked the
combinations in SWITCH_KERNEL_CR3_NO_STACK at entry_SYSCALL_64,
and it does a good job.

(cherry picked from Change-Id: I46d06167615aa8d628eed9972125ab2faca93f05)

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/kaiser: Check boottime cmdline params
Borislav Petkov [Tue, 2 Jan 2018 13:19:48 +0000 (14:19 +0100)]
x86/kaiser: Check boottime cmdline params

AMD (and possibly other vendors) are not affected by the leak
KAISER is protecting against.

Keep the "nopti" for traditional reasons and add pti=<on|off|auto>
like upstream.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/kaiser: Rename and simplify X86_FEATURE_KAISER handling
Borislav Petkov [Tue, 2 Jan 2018 13:19:48 +0000 (14:19 +0100)]
x86/kaiser: Rename and simplify X86_FEATURE_KAISER handling

Concentrate it in arch/x86/mm/kaiser.c and use the upstream string "nopti".

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/boot: Add early cmdline parsing for options with arguments
Tom Lendacky [Mon, 17 Jul 2017 21:10:33 +0000 (16:10 -0500)]
x86/boot: Add early cmdline parsing for options with arguments

commit e505371dd83963caae1a37ead9524e8d997341be upstream.

Add a cmdline_find_option() function to look for cmdline options that
take arguments. The argument is returned in a supplied buffer and the
argument length (regardless of whether it fits in the supplied buffer)
is returned, with -1 indicating not found.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Toshimitsu Kani <toshi.kani@hpe.com>
Cc: kasan-dev@googlegroups.com
Cc: kvm@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/36b5f97492a9745dce27682305f990fc20e5cf8a.1500319216.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/boot: Pass in size to early cmdline parsing
Dave Hansen [Tue, 22 Dec 2015 22:52:43 +0000 (14:52 -0800)]
x86/boot: Pass in size to early cmdline parsing

commit 8c0517759a1a100a8b83134cf3c7f254774aaeba upstream.

We will use this in a few patches to implement tests for early parsing.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
[ Aligned args properly. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com
Cc: yu-cheng.yu@intel.com
Link: http://lkml.kernel.org/r/20151222225243.5CC47EB6@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/boot: Simplify early command line parsing
Dave Hansen [Tue, 22 Dec 2015 22:52:41 +0000 (14:52 -0800)]
x86/boot: Simplify early command line parsing

commit 4de07ea481361b08fe13735004dafae862482d38 upstream.

__cmdline_find_option_bool() tries to account for both NULL-terminated
and non-NULL-terminated strings. It keeps 'pos' to look for the end of
the buffer and also looks for '!c' in a bunch of places to look for NULL
termination.

But, it also calls strlen(). You can't call strlen on a
non-NULL-terminated string.

If !strlen(cmdline), then cmdline[0]=='\0'. In that case, we will go in
to the while() loop, set c='\0', hit st_wordstart, notice !c, and will
immediately return 0.

So, remove the strlen().  It is unnecessary and unsafe.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com
Cc: yu-cheng.yu@intel.com
Link: http://lkml.kernel.org/r/20151222225241.15365E43@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/boot: Fix early command-line parsing when partial word matches
Dave Hansen [Tue, 22 Dec 2015 22:52:39 +0000 (14:52 -0800)]
x86/boot: Fix early command-line parsing when partial word matches

commit abcdc1c694fa4055323cbec1cde4c2cb6b68398c upstream.

cmdline_find_option_bool() keeps track of position in two strings:

 1. the command-line
 2. the option we are searchign for in the command-line

We plow through each character in the command-line one at a time, always
moving forward. We move forward in the option ('opptr') when we match
characters in 'cmdline'. We reset the 'opptr' only when we go in to the
'st_wordstart' state.

But, if we fail to match an option because we see a space
(state=st_wordcmp, *opptr='\0',c=' '), we set state='st_wordskip' and
'break', moving to the next character. But, that move to the next
character is the one *after* the ' '. This means that we will miss a
'st_wordstart' state.

For instance, if we have

  cmdline = "foo fool";

and are searching for "fool", we have:

  "fool"
  opptr = ----^

           "foo fool"
   c = --------^

We see that 'l' != ' ', set state=st_wordskip, break, and then move 'c', so:

          "foo fool"
  c = ---------^

and are still in state=st_wordskip. We will stay in wordskip until we
have skipped "fool", thus missing the option we were looking for. This
*only* happens when you have a partially- matching word followed by a
matching one.

To fix this, we always fall *into* the 'st_wordskip' state when we set
it.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com
Cc: yu-cheng.yu@intel.com
Link: http://lkml.kernel.org/r/20151222225239.8E1DCA58@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/boot: Fix early command-line parsing when matching at end
Dave Hansen [Tue, 22 Dec 2015 22:52:38 +0000 (14:52 -0800)]
x86/boot: Fix early command-line parsing when matching at end

commit 02afeaae9843733a39cd9b11053748b2d1dc5ae7 upstream.

The x86 early command line parsing in cmdline_find_option_bool() is
buggy. If it matches a specified 'option' all the way to the end of the
command-line, it will consider it a match.

For instance,

  cmdline = "foo";
  cmdline_find_option_bool(cmdline, "fool");

will return 1. This is particularly annoying since we have actual FPU
options like "noxsave" and "noxsaves" So, command-line "foo bar noxsave"
will match *BOTH* a "noxsave" and "noxsaves". (This turns out not to be
an actual problem because "noxsave" implies "noxsaves", but it's still
confusing.)

To fix this, we simplify the code and stop tracking 'len'. 'len'
was trying to indicate either the NULL terminator *OR* the end of a
non-NULL-terminated command line at 'COMMAND_LINE_SIZE'. But, each of the
three states is *already* checking 'cmdline' for a NULL terminator.

We _only_ need to check if we have overrun 'COMMAND_LINE_SIZE', and that
we can do without keeping 'len' around.

Also add some commends to clarify what is going on.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com
Cc: yu-cheng.yu@intel.com
Link: http://lkml.kernel.org/r/20151222225238.9AEB560C@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86, boot: Carve out early cmdline parsing function
Borislav Petkov [Mon, 19 May 2014 18:59:16 +0000 (20:59 +0200)]
x86, boot: Carve out early cmdline parsing function

commit 1b1ded57a4f2f4420b4de7c395d1b841d8b3c41a upstream.

Carve out early cmdline parsing function into .../lib/cmdline.c so it
can be used by early code in the kernel proper as well.

Adapted from arch/x86/boot/cmdline.c.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1400525957-11525-2-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agokaiser: add "nokaiser" boot option, using ALTERNATIVE
Hugh Dickins [Sun, 24 Sep 2017 23:59:49 +0000 (16:59 -0700)]
kaiser: add "nokaiser" boot option, using ALTERNATIVE

Added "nokaiser" boot option: an early param like "noinvpcid".
Most places now check int kaiser_enabled (#defined 0 when not
CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S
and entry_64_compat.S are using the ALTERNATIVE technique, which
patches in the preferred instructions at runtime.  That technique
is tied to x86 cpu features, so X86_FEATURE_KAISER fabricated
("" in its comment so "kaiser" not magicked into /proc/cpuinfo).

Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that,
but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when
nokaiser like when !CONFIG_KAISER, but not setting either when kaiser -
neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL
won't get set in some obscure corner, or something add PGE into CR4.
By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled,
all page table setup which uses pte_pfn() masks it out of the ptes.

It's slightly shameful that the same declaration versus definition of
kaiser_enabled appears in not one, not two, but in three header files
(asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h).  I felt safer that way,
than with #including any of those in any of the others; and did not
feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes
them all, so we shall hear about it if they get out of synch.

Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER
from kaiser.c; removed the unused native_get_normal_pgd(); removed
the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some
comments.  But more interestingly, set CR4.PSE in secondary_startup_64:
the manual is clear that it does not matter whether it's 0 or 1 when
4-level-pts are enabled, but I was distracted to find cr4 different on
BSP and auxiliaries - BSP alone was adding PSE, in init_memory_mapping().

(cherry picked from Change-Id: I8e5bec716944444359cbd19f6729311eff943e9a)

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/alternatives: Use optimized NOPs for padding
Borislav Petkov [Sat, 10 Jan 2015 19:34:07 +0000 (20:34 +0100)]
x86/alternatives: Use optimized NOPs for padding

commit 4fd4b6e5537cec5b56db0b22546dd439ebb26830 upstream.

Alternatives allow now for an empty old instruction. In this case we go
and pad the space with NOPs at assembly time. However, there are the
optimal, longer NOPs which should be used. Do that at patching time by
adding alt_instr.padlen-sized NOPs at the old instruction address.

Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/alternatives: Make JMPs more robust
Borislav Petkov [Mon, 5 Jan 2015 12:48:41 +0000 (13:48 +0100)]
x86/alternatives: Make JMPs more robust

commit 48c7a2509f9e237d8465399d9cdfe487d3212a23 upstream.

Up until now we had to pay attention to relative JMPs in alternatives
about how their relative offset gets computed so that the jump target
is still correct. Or, as it is the case for near CALLs (opcode e8), we
still have to go and readjust the offset at patching time.

What is more, the static_cpu_has_safe() facility had to forcefully
generate 5-byte JMPs since we couldn't rely on the compiler to generate
properly sized ones so we had to force the longest ones. Worse than
that, sometimes it would generate a replacement JMP which is longer than
the original one, thus overwriting the beginning of the next instruction
at patching time.

So, in order to alleviate all that and make using JMPs more
straight-forward we go and pad the original instruction in an
alternative block with NOPs at build time, should the replacement(s) be
longer. This way, alternatives users shouldn't pay special attention
so that original and replacement instruction sizes are fine but the
assembler would simply add padding where needed and not do anything
otherwise.

As a second aspect, we go and recompute JMPs at patching time so that we
can try to make 5-byte JMPs into two-byte ones if possible. If not, we
still have to recompute the offsets as the replacement JMP gets put far
away in the .altinstr_replacement section leading to a wrong offset if
copied verbatim.

For example, on a locally generated kernel image

  old insn VA: 0xffffffff810014bd, CPU feat: X86_FEATURE_ALWAYS, size: 2
  __switch_to:
   ffffffff810014bd:      eb 21                   jmp ffffffff810014e0
  repl insn: size: 5
  ffffffff81d0b23c:       e9 b1 62 2f ff          jmpq ffffffff810014f2

gets corrected to a 2-byte JMP:

  apply_alternatives: feat: 3*32+21, old: (ffffffff810014bd, len: 2), repl: (ffffffff81d0b23c, len: 5)
  alt_insn: e9 b1 62 2f ff
  recompute_jumps: next_rip: ffffffff81d0b241, tgt_rip: ffffffff810014f2, new_displ: 0x00000033, ret len: 2
  converted to: eb 33 90 90 90

and a 5-byte JMP:

  old insn VA: 0xffffffff81001516, CPU feat: X86_FEATURE_ALWAYS, size: 2
  __switch_to:
   ffffffff81001516:      eb 30                   jmp ffffffff81001548
  repl insn: size: 5
   ffffffff81d0b241:      e9 10 63 2f ff          jmpq ffffffff81001556

gets shortened into a two-byte one:

  apply_alternatives: feat: 3*32+21, old: (ffffffff81001516, len: 2), repl: (ffffffff81d0b241, len: 5)
  alt_insn: e9 10 63 2f ff
  recompute_jumps: next_rip: ffffffff81d0b246, tgt_rip: ffffffff81001556, new_displ: 0x0000003e, ret len: 2
  converted to: eb 3e 90 90 90

... and so on.

This leads to a net win of around

40ish replacements * 3 bytes savings =~ 120 bytes of I$

on an AMD guest which means some savings of precious instruction cache
bandwidth. The padding to the shorter 2-byte JMPs are single-byte NOPs
which on smart microarchitectures means discarding NOPs at decode time
and thus freeing up execution bandwidth.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/alternatives: Add instruction padding
Borislav Petkov [Sat, 27 Dec 2014 09:41:52 +0000 (10:41 +0100)]
x86/alternatives: Add instruction padding

commit 4332195c5615bf748624094ce4ff6797e475024d upstream.

Up until now we have always paid attention to make sure the length of
the new instruction replacing the old one is at least less or equal to
the length of the old instruction. If the new instruction is longer, at
the time it replaces the old instruction it will overwrite the beginning
of the next instruction in the kernel image and cause your pants to
catch fire.

So instead of having to pay attention, teach the alternatives framework
to pad shorter old instructions with NOPs at buildtime - but only in the
case when

  len(old instruction(s)) < len(new instruction(s))

and add nothing in the >= case. (In that case we do add_nops() when
patching).

This way the alternatives user shouldn't have to care about instruction
sizes and simply use the macros.

Add asm ALTERNATIVE* flavor macros too, while at it.

Also, we need to save the pad length in a separate struct alt_instr
member for NOP optimization and the way to do that reliably is to carry
the pad length instead of trying to detect whether we're looking at
single-byte NOPs or at pathological instruction offsets like e9 90 90 90
90, for example, which is a valid instruction.

Thanks to Michael Matz for the great help with toolchain questions.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/alternatives: Cleanup DPRINTK macro
Borislav Petkov [Tue, 30 Dec 2014 19:27:09 +0000 (20:27 +0100)]
x86/alternatives: Cleanup DPRINTK macro

commit db477a3386dee183130916d6bbf21f5828b0b2e2 upstream.

Make it pass __func__ implicitly. Also, dump info about each replacing
we're doing. Fixup comments and style while at it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
[bwh: Update one more use of DPRINTK() that was removed upstream]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agokaiser: alloc_ldt_struct() use get_zeroed_page()
Hugh Dickins [Mon, 18 Dec 2017 03:53:01 +0000 (19:53 -0800)]
kaiser: alloc_ldt_struct() use get_zeroed_page()

Change the 3.2.96 and 3.18.72 alloc_ldt_struct() to allocate its entries
with get_zeroed_page(), as 4.3 onwards does since f454b4788613 ("x86/ldt:
Fix small LDT allocation for Xen").  This then matches the free_page()
I had misported in __free_ldt_struct(), and fixes the
"BUG: Bad page state in process ldt_gdt_32 ... flags: 0x80(slab)"
reported by Kees Cook and Jiri Kosina, and analysed by Jiri.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agokaiser: user_map __kprobes_text too
Hugh Dickins [Mon, 18 Dec 2017 03:29:01 +0000 (19:29 -0800)]
kaiser: user_map __kprobes_text too

In 3.2 (and earlier, and up to 3.15) Kaiser needs to user_map the
__kprobes_text as well as the __entry_text: entry_64.S places some
vital functions there, so without this you very soon triple-fault.
Many thanks to Jiri Kosina for pointing me in this direction.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/mm/kaiser: re-enable vsyscalls
Andrea Arcangeli [Tue, 5 Dec 2017 20:15:07 +0000 (21:15 +0100)]
x86/mm/kaiser: re-enable vsyscalls

To avoid breaking the kernel ABI.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
[Hugh Dickins: Backported to 3.2:
 - Leave out the PVCLOCK_FIXMAP user mapping, which does not apply to
   this tree
 - For safety added vsyscall_pgprot, and a BUG_ON if _PAGE_USER
   outside of FIXMAP.]
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agoKAISER: Kernel Address Isolation
Hugh Dickins [Tue, 12 Dec 2017 01:59:50 +0000 (17:59 -0800)]
KAISER: Kernel Address Isolation

This patch introduces our implementation of KAISER (Kernel Address
Isolation to have Side-channels Efficiently Removed), a kernel isolation
technique to close hardware side channels on kernel address information.

More information about the original patch can be found at:
https://github.com/IAIK/KAISER
http://marc.info/?l=linux-kernel&m=149390087310405&w=2

Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Richard Fellner <richard.fellner@student.tugraz.at>
Michael Schwarz <michael.schwarz@iaik.tugraz.at>
<clementine.maurice@iaik.tugraz.at>
<moritz.lipp@iaik.tugraz.at>

That original was then developed further by
Dave Hansen <dave.hansen@intel.com>
Hugh Dickins <hughd@google.com>
then others after this snapshot.

This combined patch for 3.2.96 was derived from hughd's patches below
for 3.18.72, in 2017-12-04's kaiser-3.18.72.tar; except for the last,
which was sent in 2017-12-09's nokaiser-3.18.72.tar.  They have been
combined in order to minimize the effort of rebasing: most of the
patches in the 3.18.72 series were small fixes and cleanups and
enhancements to three large patches.  About the only new work in this
backport is a simple reimplementation of kaiser_remove_mapping():
since mm/pageattr.c changed a lot between 3.2 and 3.18, and the
mods there for Kaiser never seemed necessary.

KAISER: Kernel Address Isolation
kaiser: merged update
kaiser: do not set _PAGE_NX on pgd_none
kaiser: stack map PAGE_SIZE at THREAD_SIZE-PAGE_SIZE
kaiser: fix build and FIXME in alloc_ldt_struct()
kaiser: KAISER depends on SMP
kaiser: fix regs to do_nmi() ifndef CONFIG_KAISER
kaiser: fix perf crashes
kaiser: ENOMEM if kaiser_pagetable_walk() NULL
kaiser: tidied up asm/kaiser.h somewhat
kaiser: tidied up kaiser_add/remove_mapping slightly
kaiser: kaiser_remove_mapping() move along the pgd
kaiser: align addition to x86/mm/Makefile
kaiser: cleanups while trying for gold link
kaiser: name that 0x1000 KAISER_SHADOW_PGD_OFFSET
kaiser: delete KAISER_REAL_SWITCH option
kaiser: vmstat show NR_KAISERTABLE as nr_overhead
kaiser: enhanced by kernel and user PCIDs
kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
kaiser: PCID 0 for kernel and 128 for user
kaiser: x86_cr3_pcid_noflush and x86_cr3_pcid_user
kaiser: paranoid_entry pass cr3 need to paranoid_exit
kaiser: _pgd_alloc() without __GFP_REPEAT to avoid stalls
kaiser: fix unlikely error in alloc_ldt_struct()
kaiser: drop is_atomic arg to kaiser_pagetable_walk()

Signed-off-by: Hugh Dickins <hughd@google.com>
[bwh:
 - Fixed the #undef in arch/x86/boot/compressed/misc.h
 - Add missing #include in arch/x86/mm/kaiser.c]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/mm/64: Fix reboot interaction with CR4.PCIDE
Andy Lutomirski [Mon, 9 Oct 2017 04:53:05 +0000 (21:53 -0700)]
x86/mm/64: Fix reboot interaction with CR4.PCIDE

commit 924c6b900cfdf376b07bccfd80e62b21914f8a5a upstream.

Trying to reboot via real mode fails with PCID on: long mode cannot
be exited while CR4.PCIDE is set.  (No, I have no idea why, but the
SDM and actual CPUs are in agreement here.)  The result is a GPF and
a hang instead of a reboot.

I didn't catch this in testing because neither my computer nor my VM
reboots this way.  I can trigger it with reboot=bios, though.

Fixes: 660da7c9228f ("x86/mm: Enable CR4.PCIDE on supported systems")
Reported-and-tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/f1e7d965998018450a7a70c2823873686a8b21c0.1507524746.git.luto@kernel.org
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/mm: Enable CR4.PCIDE on supported systems
Andy Lutomirski [Thu, 29 Jun 2017 15:53:21 +0000 (08:53 -0700)]
x86/mm: Enable CR4.PCIDE on supported systems

commit 660da7c9228f685b2ebe664f9fd69aaddcc420b5 upstream.

We can use PCID if the CPU has PCID and PGE and we're not on Xen.

By itself, this has no effect. A followup patch will start using PCID.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/6327ecd907b32f79d5aa0d466f04503bbec5df88.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[Hugh Dickins: Backported to 3.2:
 - arch/x86/xen/enlighten_pv.c (not in this tree)
 - arch/x86/xen/enlighten.c (patched instead of that)]
Signed-off-by: Hugh Dickins <hughd@google.com>
[Borislav Petkov: Fix bad backport to disable PCID on Xen]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/mm: Add the 'nopcid' boot option to turn off PCID
Andy Lutomirski [Thu, 29 Jun 2017 15:53:20 +0000 (08:53 -0700)]
x86/mm: Add the 'nopcid' boot option to turn off PCID

commit 0790c9aad84901ca1bdc14746175549c8b5da215 upstream.

The parameter is only present on x86_64 systems to save a few bytes,
as PCID is always disabled on x86_32.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/8bbb2e65bcd249a5f18bfb8128b4689f08ac2b60.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[Hugh Dickins: Backported to 3.2:
 - Documentation/admin-guide/kernel-parameters.txt (not in this tree)
 - Documentation/kernel-parameters.txt (patched instead of that)]
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/mm: Disable PCID on 32-bit kernels
Andy Lutomirski [Thu, 29 Jun 2017 15:53:19 +0000 (08:53 -0700)]
x86/mm: Disable PCID on 32-bit kernels

commit cba4671af7550e008f7a7835f06df0763825bf3e upstream.

32-bit kernels on new hardware will see PCID in CPUID, but PCID can
only be used in 64-bit mode.  Rather than making all PCID code
conditional, just disable the feature on 32-bit builds.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/2e391769192a4d31b808410c383c6bf0734bc6ea.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code
Andy Lutomirski [Sun, 28 May 2017 17:00:14 +0000 (10:00 -0700)]
x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code

commit ce4a4e565f5264909a18c733b864c3f74467f69e upstream.

The UP asm/tlbflush.h generates somewhat nicer code than the SMP version.
Aside from that, it's fallen quite a bit behind the SMP code:

 - flush_tlb_mm_range() didn't flush individual pages if the range
   was small.

 - The lazy TLB code was much weaker.  This usually wouldn't matter,
   but, if a kernel thread flushed its lazy "active_mm" more than
   once (due to reclaim or similar), it wouldn't be unlazied and
   would instead pointlessly flush repeatedly.

 - Tracepoints were missing.

Aside from that, simply having the UP code around was a maintanence
burden, since it means that any change to the TLB flush code had to
make sure not to break it.

Simplify everything by deleting the UP code.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[Hugh Dickins: Backported to 3.2]
Signed-off-by: Hugh Dickins <hughd@google.com>
[bwh: Fix allnoconfig build failure due to direct use of 'apic' in
 flush_tlb_others_ipi()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agosched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()
Andy Lutomirski [Fri, 9 Jun 2017 18:49:15 +0000 (11:49 -0700)]
sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()

commit 252d2a4117bc181b287eeddf848863788da733ae upstream.

idle_task_exit() can be called with IRQs on x86 on and therefore
should use switch_mm(), not switch_mm_irqs_off().

This doesn't seem to cause any problems right now, but it will
confuse my upcoming TLB flush changes.  Nonetheless, I think it
should be backported because it's trivial.  There won't be any
meaningful performance impact because idle_task_exit() is only
used when offlining a CPU.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: f98db6013c55 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
Link: http://lkml.kernel.org/r/ca3d1a9fa93a0b49f5a8ff729eda3640fb6abdf9.1497034141.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/mm, sched/core: Turn off IRQs in switch_mm()
Andy Lutomirski [Tue, 26 Apr 2016 16:39:09 +0000 (09:39 -0700)]
x86/mm, sched/core: Turn off IRQs in switch_mm()

commit 078194f8e9fe3cf54c8fd8bded48a1db5bd8eb8a upstream.

Potential races between switch_mm() and TLB-flush or LDT-flush IPIs
could be very messy.  AFAICT the code is currently okay, whether by
accident or by careful design, but enabling PCID will make it
considerably more complicated and will no longer be obviously safe.

Fix it with a big hammer: run switch_mm() with IRQs off.

To avoid a performance hit in the scheduler, we take advantage of
our knowledge that the scheduler already has IRQs disabled when it
calls switch_mm().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f19baf759693c9dcae64bbff76189db77cb13398.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/mm, sched/core: Uninline switch_mm()
Andy Lutomirski [Tue, 26 Apr 2016 16:39:08 +0000 (09:39 -0700)]
x86/mm, sched/core: Uninline switch_mm()

commit 69c0319aabba45bcf33178916a2f06967b4adede upstream.

It's fairly large and it has quite a few callers.  This may also
help untangle some headers down the road.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/54f3367803e7f80b2be62c8a21879aa74b1a5f57.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[Hugh Dickins: Backported to 3.2]
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
3 years agox86/mm: Build arch/x86/mm/tlb.c even on !SMP
Andy Lutomirski [Tue, 26 Apr 2016 16:39:07 +0000 (09:39 -0700)]
x86/mm: Build arch/x86/mm/tlb.c even on !SMP

commit e1074888c326038340a1ada9129d679e661f2ea6 upstream.

Currently all of the functions that live in tlb.c are inlined on
!SMP builds.  One can debate whether this is a good idea (in many
respects the code in tlb.c is better than the inlined UP code).

Regardless, I want to add code that needs to be built on UP and SMP
kernels and relates to tlb flushing, so arrange for tlb.c to be
compiled unconditionally.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f0d778f0d828fc46e5d1946bca80f0aaf9abf032.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>