Skip to content
Snippets Groups Projects
  1. Apr 12, 2014
    • Nicolas Dichtel's avatar
      vti: don't allow to add the same tunnel twice · 8d89dcdf
      Nicolas Dichtel authored
      
      Before the patch, it was possible to add two times the same tunnel:
      ip l a vti1 type vti remote 10.16.0.121 local 10.16.0.249 key 41
      ip l a vti2 type vti remote 10.16.0.121 local 10.16.0.249 key 41
      
      It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the
      argument dev->type, which was set only later (when calling ndo_init handler
      in register_netdevice()). Let's set this type in the setup handler, which is
      called before newlink handler.
      
      Introduced by commit b9959fd3 ("vti: switch to new ip tunnel code").
      
      CC: Cong Wang <amwang@redhat.com>
      CC: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d89dcdf
    • Nicolas Dichtel's avatar
      gre: don't allow to add the same tunnel twice · 5a455275
      Nicolas Dichtel authored
      
      Before the patch, it was possible to add two times the same tunnel:
      ip l a gre1 type gre remote 10.16.0.121 local 10.16.0.249
      ip l a gre2 type gre remote 10.16.0.121 local 10.16.0.249
      
      It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the
      argument dev->type, which was set only later (when calling ndo_init handler
      in register_netdevice()). Let's set this type in the setup handler, which is
      called before newlink handler.
      
      Introduced by commit c5441932 ("GRE: Refactor GRE tunneling code.").
      
      CC: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5a455275
  2. Apr 11, 2014
    • David S. Miller's avatar
      net: Fix use after free by removing length arg from sk_data_ready callbacks. · 676d2369
      David S. Miller authored
      
      Several spots in the kernel perform a sequence like:
      
      	skb_queue_tail(&sk->s_receive_queue, skb);
      	sk->sk_data_ready(sk, skb->len);
      
      But at the moment we place the SKB onto the socket receive queue it
      can be consumed and freed up.  So this skb->len access is potentially
      to freed up memory.
      
      Furthermore, the skb->len can be modified by the consumer so it is
      possible that the value isn't accurate.
      
      And finally, no actual implementation of this callback actually uses
      the length argument.  And since nobody actually cared about it's
      value, lots of call sites pass arbitrary values in such as '0' and
      even '1'.
      
      So just remove the length argument from the callback, that way there
      is no confusion whatsoever and all of these use-after-free cases get
      fixed as a side effect.
      
      Based upon a patch by Eric Dumazet and his suggestion to audit this
      issue tree-wide.
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      676d2369
  3. Apr 08, 2014
    • Christoph Lameter's avatar
      net: replace __this_cpu_inc in route.c with raw_cpu_inc · 3ed66e91
      Christoph Lameter authored
      
      The RT_CACHE_STAT_INC macro triggers the new preemption checks
      for __this_cpu ops.
      
      I do not see any other synchronization that would allow the use of a
      __this_cpu operation here however in commit dbd2915c ("[IPV4]:
      RT_CACHE_STAT_INC() warning fix") Andrew justifies the use of
      raw_smp_processor_id() here because "we do not care" about races.  In
      the past we agreed that the price of disabling interrupts here to get
      consistent counters would be too high.  These counters may be inaccurate
      due to race conditions.
      
      The use of __this_cpu op improves the situation already from what commit
      dbd2915c did since the single instruction emitted on x86 does not
      allow the race to occur anymore.  However, non x86 platforms could still
      experience a race here.
      
      Trace:
      
        __this_cpu_add operation in preemptible [00000000] code: avahi-daemon/1193
        caller is __this_cpu_preempt_check+0x38/0x60
        CPU: 1 PID: 1193 Comm: avahi-daemon Tainted: GF            3.12.0-rc4+ #187
        Call Trace:
          check_preemption_disabled+0xec/0x110
          __this_cpu_preempt_check+0x38/0x60
          __ip_route_output_key+0x575/0x8c0
          ip_route_output_flow+0x27/0x70
          udp_sendmsg+0x825/0xa20
          inet_sendmsg+0x85/0xc0
          sock_sendmsg+0x9c/0xd0
          ___sys_sendmsg+0x37c/0x390
          __sys_sendmsg+0x49/0x90
          SyS_sendmsg+0x12/0x20
          tracesys+0xe1/0xe6
      
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ed66e91
  4. Apr 05, 2014
    • Thomas Graf's avatar
      netfilter: Can't fail and free after table replacement · c58dd2dd
      Thomas Graf authored
      
      All xtables variants suffer from the defect that the copy_to_user()
      to copy the counters to user memory may fail after the table has
      already been exchanged and thus exposed. Return an error at this
      point will result in freeing the already exposed table. Any
      subsequent packet processing will result in a kernel panic.
      
      We can't copy the counters before exposing the new tables as we
      want provide the counter state after the old table has been
      unhooked. Therefore convert this into a silent error.
      
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      c58dd2dd
  5. Mar 28, 2014
  6. Mar 27, 2014
  7. Mar 26, 2014
  8. Mar 24, 2014
  9. Mar 20, 2014
  10. Mar 19, 2014
    • Tejun Heo's avatar
      cgroup: drop const from @buffer of cftype->write_string() · 4d3bb511
      Tejun Heo authored
      
      cftype->write_string() just passes on the writeable buffer from kernfs
      and there's no reason to add const restriction on the buffer.  The
      only thing const achieves is unnecessarily complicating parsing of the
      buffer.  Drop const from @buffer.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>                                           
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      4d3bb511
  11. Mar 15, 2014
  12. Mar 14, 2014
  13. Mar 11, 2014
    • Eric Dumazet's avatar
      tcp: tcp_release_cb() should release socket ownership · c3f9b018
      Eric Dumazet authored
      
      Lars Persson reported following deadlock :
      
      -000 |M:0x0:0x802B6AF8(asm) <-- arch_spin_lock
      -001 |tcp_v4_rcv(skb = 0x8BD527A0) <-- sk = 0x8BE6B2A0
      -002 |ip_local_deliver_finish(skb = 0x8BD527A0)
      -003 |__netif_receive_skb_core(skb = 0x8BD527A0, ?)
      -004 |netif_receive_skb(skb = 0x8BD527A0)
      -005 |elk_poll(napi = 0x8C770500, budget = 64)
      -006 |net_rx_action(?)
      -007 |__do_softirq()
      -008 |do_softirq()
      -009 |local_bh_enable()
      -010 |tcp_rcv_established(sk = 0x8BE6B2A0, skb = 0x87D3A9E0, th = 0x814EBE14, ?)
      -011 |tcp_v4_do_rcv(sk = 0x8BE6B2A0, skb = 0x87D3A9E0)
      -012 |tcp_delack_timer_handler(sk = 0x8BE6B2A0)
      -013 |tcp_release_cb(sk = 0x8BE6B2A0)
      -014 |release_sock(sk = 0x8BE6B2A0)
      -015 |tcp_sendmsg(?, sk = 0x8BE6B2A0, ?, ?)
      -016 |sock_sendmsg(sock = 0x8518C4C0, msg = 0x87D8DAA8, size = 4096)
      -017 |kernel_sendmsg(?, ?, ?, ?, size = 4096)
      -018 |smb_send_kvec()
      -019 |smb_send_rqst(server = 0x87C4D400, rqst = 0x87D8DBA0)
      -020 |cifs_call_async()
      -021 |cifs_async_writev(wdata = 0x87FD6580)
      -022 |cifs_writepages(mapping = 0x852096E4, wbc = 0x87D8DC88)
      -023 |__writeback_single_inode(inode = 0x852095D0, wbc = 0x87D8DC88)
      -024 |writeback_sb_inodes(sb = 0x87D6D800, wb = 0x87E4A9C0, work = 0x87D8DD88)
      -025 |__writeback_inodes_wb(wb = 0x87E4A9C0, work = 0x87D8DD88)
      -026 |wb_writeback(wb = 0x87E4A9C0, work = 0x87D8DD88)
      -027 |wb_do_writeback(wb = 0x87E4A9C0, force_wait = 0)
      -028 |bdi_writeback_workfn(work = 0x87E4A9CC)
      -029 |process_one_work(worker = 0x8B045880, work = 0x87E4A9CC)
      -030 |worker_thread(__worker = 0x8B045880)
      -031 |kthread(_create = 0x87CADD90)
      -032 |ret_from_kernel_thread(asm)
      
      Bug occurs because __tcp_checksum_complete_user() enables BH, assuming
      it is running from softirq context.
      
      Lars trace involved a NIC without RX checksum support but other points
      are problematic as well, like the prequeue stuff.
      
      Problem is triggered by a timer, that found socket being owned by user.
      
      tcp_release_cb() should call tcp_write_timer_handler() or
      tcp_delack_timer_handler() in the appropriate context :
      
      BH disabled and socket lock held, but 'owned' field cleared,
      as if they were running from timer handlers.
      
      Fixes: 6f458dfb ("tcp: improve latencies of timer triggered events")
      Reported-by: default avatarLars Persson <lars.persson@axis.com>
      Tested-by: default avatarLars Persson <lars.persson@axis.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3f9b018
  14. Mar 10, 2014
  15. Mar 07, 2014
  16. Mar 06, 2014
    • Florian Westphal's avatar
      inet: frag: make sure forced eviction removes all frags · e588e2f2
      Florian Westphal authored
      
      Quoting Alexander Aring:
        While fragmentation and unloading of 6lowpan module I got this kernel Oops
        after few seconds:
      
        BUG: unable to handle kernel paging request at f88bbc30
        [..]
        Modules linked in: ipv6 [last unloaded: 6lowpan]
        Call Trace:
         [<c012af4c>] ? call_timer_fn+0x54/0xb3
         [<c012aef8>] ? process_timeout+0xa/0xa
         [<c012b66b>] run_timer_softirq+0x140/0x15f
      
      Problem is that incomplete frags are still around after unload; when
      their frag expire timer fires, we get crash.
      
      When a netns is removed (also done when unloading module), inet_frag
      calls the evictor with 'force' argument to purge remaining frags.
      
      The evictor loop terminates when accounted memory ('work') drops to 0
      or the lru-list becomes empty.  However, the mem accounting is done
      via percpu counters and may not be accurate, i.e. loop may terminate
      prematurely.
      
      Alter evictor to only stop once the lru list is empty when force is
      requested.
      
      Reported-by: default avatarPhoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
      Reported-by: default avatarAlexander Aring <alex.aring@gmail.com>
      Tested-by: default avatarAlexander Aring <alex.aring@gmail.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e588e2f2
    • David S. Miller's avatar
      tcp: Use NET_ADD_STATS instead of NET_ADD_STATS_BH in tcp_event_new_data_sent() · f7324acd
      David S. Miller authored
      
      Can be invoked from non-BH context.
      
      Based upon a patch by Eric Dumazet.
      
      Fixes: f19c29e3 ("tcp: snmp stats for Fast Open, SYN rtx, and data pkts")
      Reported-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7324acd
    • Nikolay Aleksandrov's avatar
      net: fix for a race condition in the inet frag code · 24b9bf43
      Nikolay Aleksandrov authored
      
      I stumbled upon this very serious bug while hunting for another one,
      it's a very subtle race condition between inet_frag_evictor,
      inet_frag_intern and the IPv4/6 frag_queue and expire functions
      (basically the users of inet_frag_kill/inet_frag_put).
      
      What happens is that after a fragment has been added to the hash chain
      but before it's been added to the lru_list (inet_frag_lru_add) in
      inet_frag_intern, it may get deleted (either by an expired timer if
      the system load is high or the timer sufficiently low, or by the
      fraq_queue function for different reasons) before it's added to the
      lru_list, then after it gets added it's a matter of time for the
      evictor to get to a piece of memory which has been freed leading to a
      number of different bugs depending on what's left there.
      
      I've been able to trigger this on both IPv4 and IPv6 (which is normal
      as the frag code is the same), but it's been much more difficult to
      trigger on IPv4 due to the protocol differences about how fragments
      are treated.
      
      The setup I used to reproduce this is: 2 machines with 4 x 10G bonded
      in a RR bond, so the same flow can be seen on multiple cards at the
      same time. Then I used multiple instances of ping/ping6 to generate
      fragmented packets and flood the machines with them while running
      other processes to load the attacked machine.
      
      *It is very important to have the _same flow_ coming in on multiple CPUs
      concurrently. Usually the attacked machine would die in less than 30
      minutes, if configured properly to have many evictor calls and timeouts
      it could happen in 10 minutes or so.
      
      An important point to make is that any caller (frag_queue or timer) of
      inet_frag_kill will remove both the timer refcount and the
      original/guarding refcount thus removing everything that's keeping the
      frag from being freed at the next inet_frag_put.  All of this could
      happen before the frag was ever added to the LRU list, then it gets
      added and the evictor uses a freed fragment.
      
      An example for IPv6 would be if a fragment is being added and is at
      the stage of being inserted in the hash after the hash lock is
      released, but before inet_frag_lru_add executes (or is able to obtain
      the lru lock) another overlapping fragment for the same flow arrives
      at a different CPU which finds it in the hash, but since it's
      overlapping it drops it invoking inet_frag_kill and thus removing all
      guarding refcounts, and afterwards freeing it by invoking
      inet_frag_put which removes the last refcount added previously by
      inet_frag_find, then inet_frag_lru_add gets executed by
      inet_frag_intern and we have a freed fragment in the lru_list.
      
      The fix is simple, just move the lru_add under the hash chain locked
      region so when a removing function is called it'll have to wait for
      the fragment to be added to the lru_list, and then it'll remove it (it
      works because the hash chain removal is done before the lru_list one
      and there's no window between the two list adds when the frag can get
      dropped). With this fix applied I couldn't kill the same machine in 24
      hours with the same setup.
      
      Fixes: 3ef0eb0d ("net: frag, move LRU list maintenance outside of
      rwlock")
      
      CC: Florian Westphal <fw@strlen.de>
      CC: Jesper Dangaard Brouer <brouer@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@redhat.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      24b9bf43
  17. Mar 03, 2014
  18. Feb 26, 2014
    • Eric Dumazet's avatar
      tcp: switch rtt estimations to usec resolution · 740b0f18
      Eric Dumazet authored
      
      Upcoming congestion controls for TCP require usec resolution for RTT
      estimations. Millisecond resolution is simply not enough these days.
      
      FQ/pacing in DC environments also require this change for finer control
      and removal of bimodal behavior due to the current hack in
      tcp_update_pacing_rate() for 'small rtt'
      
      TCP_CONG_RTT_STAMP is no longer needed.
      
      As Julian Anastasov pointed out, we need to keep user compatibility :
      tcp_metrics used to export RTT and RTTVAR in msec resolution,
      so we added RTT_US and RTTVAR_US. An iproute2 patch is needed
      to use the new attributes if provided by the kernel.
      
      In this example ss command displays a srtt of 32 usecs (10Gbit link)
      
      lpk51:~# ./ss -i dst lpk52
      Netid  State      Recv-Q Send-Q   Local Address:Port       Peer
      Address:Port
      tcp    ESTAB      0      1         10.246.11.51:42959
      10.246.11.52:64614
               cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448
      cwnd:10 send
      3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559
      
      Updated iproute2 ip command displays :
      
      lpk51:~# ./ip tcp_metrics | grep 10.246.11.52
      10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source
      10.246.11.51
      
      Old binary displays :
      
      lpk51:~# ip tcp_metrics | grep 10.246.11.52
      10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source
      10.246.11.51
      
      With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Larry Brakmo <brakmo@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      740b0f18
    • Hannes Frederic Sowa's avatar
      ipv4: yet another new IP_MTU_DISCOVER option IP_PMTUDISC_OMIT · 1b346576
      Hannes Frederic Sowa authored
      
      IP_PMTUDISC_INTERFACE has a design error: because it does not allow the
      generation of fragments if the interface mtu is exceeded, it is very
      hard to make use of this option in already deployed name server software
      for which I introduced this option.
      
      This patch adds yet another new IP_MTU_DISCOVER option to not honor any
      path mtu information and not accepting new icmp notifications destined for
      the socket this option is enabled on. But we allow outgoing fragmentation
      in case the packet size exceeds the outgoing interface mtu.
      
      As such this new option can be used as a drop-in replacement for
      IP_PMTUDISC_DONT, which is currently in use by most name server software
      making the adoption of this option very smooth and easy.
      
      The original advantage of IP_PMTUDISC_INTERFACE is still maintained:
      ignoring incoming path MTU updates and not honoring discovered path MTUs
      in the output path.
      
      Fixes: 482fc609 ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE")
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b346576
    • Hannes Frederic Sowa's avatar
      ipv4: use ip_skb_dst_mtu to determine mtu in ip_fragment · 69647ce4
      Hannes Frederic Sowa authored
      
      ip_skb_dst_mtu mostly falls back to ip_dst_mtu_maybe_forward if no socket
      is attached to the skb (in case of forwarding) or determines the mtu like
      we do in ip_finish_output, which actually checks if we should branch to
      ip_fragment. Thus use the same function to determine the mtu here, too.
      
      This is important for the introduction of IP_PMTUDISC_OMIT, where we
      want the packets getting cut in pieces of the size of the outgoing
      interface mtu. IPv6 already does this correctly.
      
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      69647ce4
    • Florian Westphal's avatar
      net: tcp: add mib counters to track zero window transitions · 8e165e20
      Florian Westphal authored
      
      Three counters are added:
      - one to track when we went from non-zero to zero window
      - one to track the reverse
      - one counter incremented when we want to announce zero window,
        but can't because we would shrink current window.
      
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8e165e20
    • Eric Dumazet's avatar
      net: tcp: use NET_INC_STATS() · 9a9bfd03
      Eric Dumazet authored
      
      While LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES can only be incremented
      in tcp_transmit_skb() from softirq (incoming message or timer
      activation), it is better to use NET_INC_STATS() instead of
      NET_INC_STATS_BH() as tcp_transmit_skb() can be called from process
      context.
      
      This will avoid copy/paste confusion when/if we want to add
      other SNMP counters in tcp_transmit_skb()
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a9bfd03
    • Hannes Frederic Sowa's avatar
      ipv4: ipv6: better estimate tunnel header cut for correct ufo handling · 91a48a2e
      Hannes Frederic Sowa authored
      
      Currently the UFO fragmentation process does not correctly handle inner
      UDP frames.
      
      (The following tcpdumps are captured on the parent interface with ufo
      disabled while tunnel has ufo enabled, 2000 bytes payload, mtu 1280,
      both sit device):
      
      IPv6:
      16:39:10.031613 IP (tos 0x0, ttl 64, id 3208, offset 0, flags [DF], proto IPv6 (41), length 1300)
          192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 1240) 2001::1 > 2001::8: frag (0x00000001:0|1232) 44883 > distinct: UDP, length 2000
      16:39:10.031709 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPv6 (41), length 844)
          192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 784) 2001::1 > 2001::8: frag (0x00000001:0|776) 58979 > 46366: UDP, length 5471
      
      We can see that fragmentation header offset is not correctly updated.
      (fragmentation id handling is corrected by 916e4cf4 ("ipv6: reuse
      ip6_frag_id from ip6_ufo_append_data")).
      
      IPv4:
      16:39:57.737761 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPIP (4), length 1296)
          192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57034, offset 0, flags [none], proto UDP (17), length 1276)
          192.168.99.1.35961 > 192.168.99.2.distinct: UDP, length 2000
      16:39:57.738028 IP (tos 0x0, ttl 64, id 3210, offset 0, flags [DF], proto IPIP (4), length 792)
          192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57035, offset 0, flags [none], proto UDP (17), length 772)
          192.168.99.1.13531 > 192.168.99.2.20653: UDP, length 51109
      
      In this case fragmentation id is incremented and offset is not updated.
      
      First, I aligned inet_gso_segment and ipv6_gso_segment:
      * align naming of flags
      * ipv6_gso_segment: setting skb->encapsulation is unnecessary, as we
        always ensure that the state of this flag is left untouched when
        returning from upper gso segmenation function
      * ipv6_gso_segment: move skb_reset_inner_headers below updating the
        fragmentation header data, we don't care for updating fragmentation
        header data
      * remove currently unneeded comment indicating skb->encapsulation might
        get changed by upper gso_segment callback (gre and udp-tunnel reset
        encapsulation after segmentation on each fragment)
      
      If we encounter an IPIP or SIT gso skb we now check for the protocol ==
      IPPROTO_UDP and that we at least have already traversed another ip(6)
      protocol header.
      
      The reason why we have to special case GSO_IPIP and GSO_SIT is that
      we reset skb->encapsulation to 0 while skb_mac_gso_segment the inner
      protocol of GSO_UDP_TUNNEL or GSO_GRE packets.
      
      Reported-by: default avatarWolfgang Walter <linux@stwm.de>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91a48a2e
  19. Feb 25, 2014
Loading