Skip to content
Snippets Groups Projects
  1. Jun 01, 2013
  2. May 31, 2013
  3. May 29, 2013
  4. May 28, 2013
    • Timo Teräs's avatar
      arp: flush arp cache on IFF_NOARP change · 6c8b4e3f
      Timo Teräs authored
      
      IFF_NOARP affects what kind of neighbor entries are created
      (nud NOARP or nud INCOMPLETE). If the flag changes, flush the arp
      cache to refresh all entries.
      
      Signed-off-by: default avatarTimo Teräs <timo.teras@iki.fi>
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      
      v2->v3: shortened notifier_info struct name
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c8b4e3f
    • Jiri Pirko's avatar
      net: pass info struct via netdevice notifier · 351638e7
      Jiri Pirko authored
      
      So far, only net_device * could be passed along with netdevice notifier
      event. This patch provides a possibility to pass custom structure
      able to provide info that event listener needs to know.
      
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      
      v2->v3: fix typo on simeth
      	shortened dev_getter
      	shortened notifier_info struct name
      v1->v2: fix notifier_call parameter in call_netdevice_notifier()
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      351638e7
    • Simon Horman's avatar
      MPLS: Add limited GSO support · 0d89d203
      Simon Horman authored
      
      In the case where a non-MPLS packet is received and an MPLS stack is
      added it may well be the case that the original skb is GSO but the
      NIC used for transmit does not support GSO of MPLS packets.
      
      The aim of this code is to provide GSO in software for MPLS packets
      whose skbs are GSO.
      
      SKB Usage:
      
      When an implementation adds an MPLS stack to a non-MPLS packet it should do
      the following to skb metadata:
      
      * Set skb->inner_protocol to the old non-MPLS ethertype of the packet.
        skb->inner_protocol is added by this patch.
      
      * Set skb->protocol to the new MPLS ethertype of the packet.
      
      * Set skb->network_header to correspond to the
        end of the L3 header, including the MPLS label stack.
      
      I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to
      kernel" which adds MPLS support to the kernel datapath of Open vSwtich.
      That patch sets the above requirements in datapath/actions.c:push_mpls()
      and was used to exercise this code.  The datapath patch is against the Open
      vSwtich tree but it is intended that it be added to the Open vSwtich code
      present in the mainline Linux kernel at some point.
      
      Features:
      
      I believe that the approach that I have taken is at least partially
      consistent with the handling of other protocols.  Jesse, I understand that
      you have some ideas here.  I am more than happy to change my implementation.
      
      This patch adds dev->mpls_features which may be used by devices
      to advertise features supported for MPLS packets.
      
      A new NETIF_F_MPLS_GSO feature is added for devices which support
      hardware MPLS GSO offload.  Currently no devices support this
      and MPLS GSO always falls back to software.
      
      Alternate Implementation:
      
      One possible alternate implementation is to teach netif_skb_features()
      and skb_network_protocol() about MPLS, in a similar way to their
      understanding of VLANs. I believe this would avoid the need
      for net/mpls/mpls_gso.c and in particular the calls to
      __skb_push() and __skb_push() in mpls_gso_segment().
      
      I have decided on the implementation in this patch as it should
      not introduce any overhead in the case where mpls_gso is not compiled
      into the kernel or inserted as a module.
      
      MPLS GSO suggested by Jesse Gross.
      Based in part on "v4 GRE: Add TCP segmentation offload for GRE"
      by Pravin B Shelar.
      
      Cc: Jesse Gross <jesse@nicira.com>
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarSimon Horman <horms@verge.net.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d89d203
  5. May 26, 2013
  6. May 24, 2013
    • Eric Dumazet's avatar
      tcp: xps: fix reordering issues · 547669d4
      Eric Dumazet authored
      
      commit 3853b584 ("xps: Improvements in TX queue selection")
      introduced ooo_okay flag, but the condition to set it is slightly wrong.
      
      In our traces, we have seen ACK packets being received out of order,
      and RST packets sent in response.
      
      We should test if we have any packets still in host queue.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      547669d4
  7. May 23, 2013
    • Nandita Dukkipati's avatar
      tcp: bug fix in proportional rate reduction. · 35f079eb
      Nandita Dukkipati authored
      
      This patch is a fix for a bug triggering newly_acked_sacked < 0
      in tcp_ack(.).
      
      The bug is triggered by sacked_out decreasing relative to prior_sacked,
      but packets_out remaining the same as pior_packets. This is because the
      snapshot of prior_packets is taken after tcp_sacktag_write_queue() while
      prior_sacked is captured before tcp_sacktag_write_queue(). The problem
      is: tcp_sacktag_write_queue (tcp_match_skb_to_sack() -> tcp_fragment)
      adjusts the pcount for packets_out and sacked_out (MSS change or other
      reason). As a result, this delta in pcount is reflected in
      (prior_sacked - sacked_out) but not in (prior_packets - packets_out).
      
      This patch does the following:
      1) initializes prior_packets at the start of tcp_ack() so as to
      capture the delta in packets_out created by tcp_fragment.
      2) introduces a new "previous_packets_out" variable that snapshots
      packets_out right before tcp_clean_rtx_queue, so pkts_acked can be
      correctly computed as before.
      3) Computes pkts_acked using previous_packets_out, and computes
      newly_acked_sacked using prior_packets.
      
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35f079eb
  8. May 20, 2013
    • Eric Dumazet's avatar
      tcp: md5: remove spinlock usage in fast path · 71cea17e
      Eric Dumazet authored
      
      TCP md5 code uses per cpu variables but protects access to them with
      a shared spinlock, which is a contention point.
      
      [ tcp_md5sig_pool_lock is locked twice per incoming packet ]
      
      Makes things much simpler, by allocating crypto structures once, first
      time a socket needs md5 keys, and not deallocating them as they are
      really small.
      
      Next step would be to allow crypto allocations being done in a NUMA
      aware way.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      71cea17e
    • Eric Dumazet's avatar
      ip_gre: fix a possible crash in ipgre_err() · 96f5a846
      Eric Dumazet authored
      
      Another fix needed in ipgre_err(), as parse_gre_header() might change
      skb->head.
      
      Bug added in commit c5441932 (GRE: Refactor GRE tunneling code.)
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      96f5a846
    • Yuchung Cheng's avatar
      tcp: remove bad timeout logic in fast recovery · 3e59cb0d
      Yuchung Cheng authored
      
      tcp_timeout_skb() was intended to trigger fast recovery on timeout,
      unfortunately in reality it often causes spurious retransmission
      storms during fast recovery. The particular sign is a fast retransmit
      over the highest sacked sequence (SND.FACK).
      
      Currently the RTO timer re-arming (as in RFC6298) offers a nice cushion
      to avoid spurious timeout: when SND.UNA advances the sender re-arms
      RTO and extends the timeout by icsk_rto. The sender does not offset
      the time elapsed since the packet at SND.UNA was sent.
      
      But if the next (DUP)ACK arrives later than ~RTTVAR and triggers
      tcp_fastretrans_alert(), then tcp_timeout_skb() will mark any packet
      sent before the icsk_rto interval lost, including one that's above the
      highest sacked sequence. Most likely a large part of scorebard will be
      marked.
      
      If most packets are not lost then the subsequent DUPACKs with new SACK
      blocks will cause the sender to continue to retransmit packets beyond
      SND.FACK spuriously. Even if only one packet is lost the sender may
      falsely retransmit almost the entire window.
      
      The situation becomes common in the world of bufferbloat: the RTT
      continues to grow as the queue builds up but RTTVAR remains small and
      close to the minimum 200ms. If a data packet is lost and the DUPACK
      triggered by the next data packet is slightly delayed, then a spurious
      retransmission storm forms.
      
      As the original comment on tcp_timeout_skb() suggests: the usefulness
      of this feature is questionable. It also wastes cycles walking the
      sack scoreboard and is actually harmful because of false recovery.
      
      It's time to remove this.
      
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3e59cb0d
  9. May 17, 2013
    • Eric Dumazet's avatar
      tcp: speedup tcp_fixup_rcvbuf() · d2cf4367
      Eric Dumazet authored
      
      tcp_fixup_rcvbuf() contains a loop to estimate initial socket
      rcv space needed for a given mss. With large MTU (like 64K on lo),
      we can loop ~500 times and consume a lot of cpu cycles.
      
      perf top of 200 concurrent netperf -t TCP_CRR
      
      5.62%  netperf  [kernel.kallsyms]  [k] tcp_init_buffer_space
      1.71%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock
      1.55%  netperf  [kernel.kallsyms]  [k] kmem_cache_free
      1.51%  netperf  [kernel.kallsyms]  [k] tcp_transmit_skb
      1.50%  netperf  [kernel.kallsyms]  [k] tcp_ack
      
      Lets use a 100% factor, and remove the loop.
      
      100% is needed anyway for tcp_adv_win_scale=1
      default value, and is also the maximum factor.
      
      Refs: commit b49960a0
            ("tcp: change tcp_adv_win_scale and tcp_rmem[2]")
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d2cf4367
  10. May 16, 2013
    • Eric Dumazet's avatar
      tcp: gso: do not generate out of order packets · 6ff50cd5
      Eric Dumazet authored
      
      GSO TCP handler has following issues :
      
      1) ooo_okay from original GSO packet is duplicated to all segments
      2) segments (but the last one) are orphaned, so transmit path can not
      get transmit queue number from the socket. This happens if GSO
      segmentation is done before stacked device for example.
      
      Result is we can send packets from a given TCP flow to different TX
      queues (if using multiqueue NICS). This generates OOO problems and
      spurious SACK & retransmits.
      
      Fix this by keeping socket pointer set for all segments.
      
      This means that every segment must also have a destructor, and the
      original gso skb truesize must be split on all segments, to keep
      precise sk->sk_wmem_alloc accounting.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ff50cd5
  11. May 15, 2013
  12. May 14, 2013
    • Eric Dumazet's avatar
      tcp: fix tcp_md5_hash_skb_data() · 54d27fcb
      Eric Dumazet authored
      
      TCP md5 communications fail [1] for some devices, because sg/crypto code
      assume page offsets are below PAGE_SIZE.
      
      This was discovered using mlx4 driver [2], but I suspect loopback
      might trigger the same bug now we use order-3 pages in tcp_sendmsg()
      
      [1] Failure is giving following messages.
      
      huh, entered softirq 3 NET_RX ffffffff806ad230 preempt_count 00000100,
      exited with 00000101?
      
      [2] mlx4 driver uses order-2 pages to allocate RX frags
      
      Reported-by: default avatarMatt Schnall <mischnal@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Bernhard Beck <bbeck@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54d27fcb
  13. May 12, 2013
  14. May 08, 2013
  15. May 06, 2013
    • Al Viro's avatar
      fib_trie: no need to delay vfree() · 00203563
      Al Viro authored
      
      Now that vfree() can be called from interrupt contexts, there's no
      need to play games with schedule_work() to escape calling vfree()
      from RCU callbacks.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      00203563
    • Konstantin Khlebnikov's avatar
      net: frag, fix race conditions in LRU list maintenance · b56141ab
      Konstantin Khlebnikov authored
      
      This patch fixes race between inet_frag_lru_move() and inet_frag_lru_add()
      which was introduced in commit 3ef0eb0d
      ("net: frag, move LRU list maintenance outside of rwlock")
      
      One cpu already added new fragment queue into hash but not into LRU.
      Other cpu found it in hash and tries to move it to the end of LRU.
      This leads to NULL pointer dereference inside of list_move_tail().
      
      Another possible race condition is between inet_frag_lru_move() and
      inet_frag_lru_del(): move can happens after deletion.
      
      This patch initializes LRU list head before adding fragment into hash and
      inet_frag_lru_move() doesn't touches it if it's empty.
      
      I saw this kernel oops two times in a couple of days.
      
      [119482.128853] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [119482.132693] IP: [<ffffffff812ede89>] __list_del_entry+0x29/0xd0
      [119482.136456] PGD 2148f6067 PUD 215ab9067 PMD 0
      [119482.140221] Oops: 0000 [#1] SMP
      [119482.144008] Modules linked in: vfat msdos fat 8021q fuse nfsd auth_rpcgss nfs_acl nfs lockd sunrpc ppp_async ppp_generic bridge slhc stp llc w83627ehf hwmon_vid snd_hda_codec_hdmi snd_hda_codec_realtek kvm_amd k10temp kvm snd_hda_intel snd_hda_codec edac_core radeon snd_hwdep ath9k snd_pcm ath9k_common snd_page_alloc ath9k_hw snd_timer snd soundcore drm_kms_helper ath ttm r8169 mii
      [119482.152692] CPU 3
      [119482.152721] Pid: 20, comm: ksoftirqd/3 Not tainted 3.9.0-zurg-00001-g9f95269 #132 To Be Filled By O.E.M. To Be Filled By O.E.M./RS880D
      [119482.161478] RIP: 0010:[<ffffffff812ede89>]  [<ffffffff812ede89>] __list_del_entry+0x29/0xd0
      [119482.166004] RSP: 0018:ffff880216d5db58  EFLAGS: 00010207
      [119482.170568] RAX: 0000000000000000 RBX: ffff88020882b9c0 RCX: dead000000200200
      [119482.175189] RDX: 0000000000000000 RSI: 0000000000000880 RDI: ffff88020882ba00
      [119482.179860] RBP: ffff880216d5db58 R08: ffffffff8155c7f0 R09: 0000000000000014
      [119482.184570] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88020882ba00
      [119482.189337] R13: ffffffff81c8d780 R14: ffff880204357f00 R15: 00000000000005a0
      [119482.194140] FS:  00007f58124dc700(0000) GS:ffff88021fcc0000(0000) knlGS:0000000000000000
      [119482.198928] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [119482.203711] CR2: 0000000000000000 CR3: 00000002155f0000 CR4: 00000000000007e0
      [119482.208533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [119482.213371] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [119482.218221] Process ksoftirqd/3 (pid: 20, threadinfo ffff880216d5c000, task ffff880216d3a9a0)
      [119482.223113] Stack:
      [119482.228004]  ffff880216d5dbd8 ffffffff8155dcda 0000000000000000 ffff000200000001
      [119482.233038]  ffff8802153c1f00 ffff880000289440 ffff880200000014 ffff88007bc72000
      [119482.238083]  00000000000079d5 ffff88007bc72f44 ffffffff00000002 ffff880204357f00
      [119482.243090] Call Trace:
      [119482.248009]  [<ffffffff8155dcda>] ip_defrag+0x8fa/0xd10
      [119482.252921]  [<ffffffff815a8013>] ipv4_conntrack_defrag+0x83/0xe0
      [119482.257803]  [<ffffffff8154485b>] nf_iterate+0x8b/0xa0
      [119482.262658]  [<ffffffff8155c7f0>] ? inet_del_offload+0x40/0x40
      [119482.267527]  [<ffffffff815448e4>] nf_hook_slow+0x74/0x130
      [119482.272412]  [<ffffffff8155c7f0>] ? inet_del_offload+0x40/0x40
      [119482.277302]  [<ffffffff8155d068>] ip_rcv+0x268/0x320
      [119482.282147]  [<ffffffff81519992>] __netif_receive_skb_core+0x612/0x7e0
      [119482.286998]  [<ffffffff81519b78>] __netif_receive_skb+0x18/0x60
      [119482.291826]  [<ffffffff8151a650>] process_backlog+0xa0/0x160
      [119482.296648]  [<ffffffff81519f29>] net_rx_action+0x139/0x220
      [119482.301403]  [<ffffffff81053707>] __do_softirq+0xe7/0x220
      [119482.306103]  [<ffffffff81053868>] run_ksoftirqd+0x28/0x40
      [119482.310809]  [<ffffffff81074f5f>] smpboot_thread_fn+0xff/0x1a0
      [119482.315515]  [<ffffffff81074e60>] ? lg_local_lock_cpu+0x40/0x40
      [119482.320219]  [<ffffffff8106d870>] kthread+0xc0/0xd0
      [119482.324858]  [<ffffffff8106d7b0>] ? insert_kthread_work+0x40/0x40
      [119482.329460]  [<ffffffff816c32dc>] ret_from_fork+0x7c/0xb0
      [119482.334057]  [<ffffffff8106d7b0>] ? insert_kthread_work+0x40/0x40
      [119482.338661] Code: 00 00 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de 48 8b 47 08 48 89 e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00 ad de 48 39 c8 74 7a <4c> 8b 00 4c 39 c7 75 53 4c 8b 42 08 4c 39 c7 75 2b 48 89 42 08
      [119482.343787] RIP  [<ffffffff812ede89>] __list_del_entry+0x29/0xd0
      [119482.348675]  RSP <ffff880216d5db58>
      [119482.353493] CR2: 0000000000000000
      
      Oops happened on this path:
      ip_defrag() -> ip_frag_queue() -> inet_frag_lru_move() -> list_move_tail() -> __list_del_entry()
      
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Acked-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b56141ab
  16. May 05, 2013
  17. May 03, 2013
  18. May 01, 2013
  19. Apr 29, 2013
  20. Apr 25, 2013
  21. Apr 19, 2013
Loading