Skip to content
Snippets Groups Projects
  1. Mar 14, 2014
  2. Mar 03, 2014
  3. Feb 26, 2014
    • Eric Dumazet's avatar
      tcp: switch rtt estimations to usec resolution · 740b0f18
      Eric Dumazet authored
      
      Upcoming congestion controls for TCP require usec resolution for RTT
      estimations. Millisecond resolution is simply not enough these days.
      
      FQ/pacing in DC environments also require this change for finer control
      and removal of bimodal behavior due to the current hack in
      tcp_update_pacing_rate() for 'small rtt'
      
      TCP_CONG_RTT_STAMP is no longer needed.
      
      As Julian Anastasov pointed out, we need to keep user compatibility :
      tcp_metrics used to export RTT and RTTVAR in msec resolution,
      so we added RTT_US and RTTVAR_US. An iproute2 patch is needed
      to use the new attributes if provided by the kernel.
      
      In this example ss command displays a srtt of 32 usecs (10Gbit link)
      
      lpk51:~# ./ss -i dst lpk52
      Netid  State      Recv-Q Send-Q   Local Address:Port       Peer
      Address:Port
      tcp    ESTAB      0      1         10.246.11.51:42959
      10.246.11.52:64614
               cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448
      cwnd:10 send
      3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559
      
      Updated iproute2 ip command displays :
      
      lpk51:~# ./ip tcp_metrics | grep 10.246.11.52
      10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source
      10.246.11.51
      
      Old binary displays :
      
      lpk51:~# ip tcp_metrics | grep 10.246.11.52
      10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source
      10.246.11.51
      
      With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Larry Brakmo <brakmo@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      740b0f18
    • Hannes Frederic Sowa's avatar
      ipv4: yet another new IP_MTU_DISCOVER option IP_PMTUDISC_OMIT · 1b346576
      Hannes Frederic Sowa authored
      
      IP_PMTUDISC_INTERFACE has a design error: because it does not allow the
      generation of fragments if the interface mtu is exceeded, it is very
      hard to make use of this option in already deployed name server software
      for which I introduced this option.
      
      This patch adds yet another new IP_MTU_DISCOVER option to not honor any
      path mtu information and not accepting new icmp notifications destined for
      the socket this option is enabled on. But we allow outgoing fragmentation
      in case the packet size exceeds the outgoing interface mtu.
      
      As such this new option can be used as a drop-in replacement for
      IP_PMTUDISC_DONT, which is currently in use by most name server software
      making the adoption of this option very smooth and easy.
      
      The original advantage of IP_PMTUDISC_INTERFACE is still maintained:
      ignoring incoming path MTU updates and not honoring discovered path MTUs
      in the output path.
      
      Fixes: 482fc609 ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE")
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b346576
    • Hannes Frederic Sowa's avatar
      ipv4: use ip_skb_dst_mtu to determine mtu in ip_fragment · 69647ce4
      Hannes Frederic Sowa authored
      
      ip_skb_dst_mtu mostly falls back to ip_dst_mtu_maybe_forward if no socket
      is attached to the skb (in case of forwarding) or determines the mtu like
      we do in ip_finish_output, which actually checks if we should branch to
      ip_fragment. Thus use the same function to determine the mtu here, too.
      
      This is important for the introduction of IP_PMTUDISC_OMIT, where we
      want the packets getting cut in pieces of the size of the outgoing
      interface mtu. IPv6 already does this correctly.
      
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      69647ce4
    • Florian Westphal's avatar
      net: tcp: add mib counters to track zero window transitions · 8e165e20
      Florian Westphal authored
      
      Three counters are added:
      - one to track when we went from non-zero to zero window
      - one to track the reverse
      - one counter incremented when we want to announce zero window,
        but can't because we would shrink current window.
      
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8e165e20
  4. Feb 25, 2014
  5. Feb 19, 2014
    • Florian Westphal's avatar
      tcp: use zero-window when free_space is low · 86c1a045
      Florian Westphal authored
      
      Currently the kernel tries to announce a zero window when free_space
      is below the current receiver mss estimate.
      
      When a sender is transmitting small packets and reader consumes data
      slowly (or not at all), receiver might be unable to shrink the receive
      win because
      
      a) we cannot withdraw already-commited receive window, and,
      b) we have to round the current rwin up to a multiple of the wscale
         factor, else we would shrink the current window.
      
      This causes the receive buffer to fill up until the rmem limit is hit.
      When this happens, we start dropping packets.
      
      Moreover, tcp_clamp_window may continue to grow sk_rcvbuf towards rmem[2]
      even if socket is not being read from.
      
      As we cannot avoid the "current_win is rounded up to multiple of mss"
      issue [we would violate a) above] at least try to prevent the receive buf
      growth towards tcp_rmem[2] limit by attempting to move to zero-window
      announcement when free_space becomes less than 1/16 of the current
      allowed receive buffer maximum.  If tcp_rmem[2] is large, this will
      increase our chances to get a zero-window announcement out in time.
      
      Reproducer:
      On server:
      $ nc -l -p 12345
      <suspend it: CTRL-Z>
      
      Client:
      #!/usr/bin/env python
      import socket
      import time
      
      sock = socket.socket()
      sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
      sock.connect(("192.168.4.1", 12345));
      while True:
         sock.send('A' * 23)
         time.sleep(0.005)
      
      socket buffer on server-side will grow until tcp_rmem[2] is hit,
      at which point the client rexmits data until -EDTIMEOUT:
      
      tcp_data_queue invokes tcp_try_rmem_schedule which will call
      tcp_prune_queue which calls tcp_clamp_window().  And that function will
      grow sk->sk_rcvbuf up until it eventually hits tcp_rmem[2].
      
      Thanks to Eric Dumazet for running regression tests.
      
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Tested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      86c1a045
    • Hannes Frederic Sowa's avatar
      ipv6: honor IPV6_PKTINFO with v4 mapped addresses on sendmsg · c8e6ad08
      Hannes Frederic Sowa authored
      
      In case we decide in udp6_sendmsg to send the packet down the ipv4
      udp_sendmsg path because the destination is either of family AF_INET or
      the destination is an ipv4 mapped ipv6 address, we don't honor the
      maybe specified ipv4 mapped ipv6 address in IPV6_PKTINFO.
      
      We simply can check for this option in ip_cmsg_send because no calls to
      ipv6 module functions are needed to do so.
      
      Reported-by: default avatarGert Doering <gert@space.net>
      Cc: Tore Anderson <tore@fud.no>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8e6ad08
  6. Feb 17, 2014
  7. Feb 14, 2014
  8. Feb 13, 2014
    • Florian Westphal's avatar
      net: ip, ipv6: handle gso skbs in forwarding path · fe6cc55f
      Florian Westphal authored
      
      Marcelo Ricardo Leitner reported problems when the forwarding link path
      has a lower mtu than the incoming one if the inbound interface supports GRO.
      
      Given:
      Host <mtu1500> R1 <mtu1200> R2
      
      Host sends tcp stream which is routed via R1 and R2.  R1 performs GRO.
      
      In this case, the kernel will fail to send ICMP fragmentation needed
      messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
      checks in forward path. Instead, Linux tries to send out packets exceeding
      the mtu.
      
      When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
      not fragment the packets when forwarding, and again tries to send out
      packets exceeding R1-R2 link mtu.
      
      This alters the forwarding dstmtu checks to take the individual gso
      segment lengths into account.
      
      For ipv6, we send out pkt too big error for gso if the individual
      segments are too big.
      
      For ipv4, we either send icmp fragmentation needed, or, if the DF bit
      is not set, perform software segmentation and let the output path
      create fragments when the packet is leaving the machine.
      It is not 100% correct as the error message will contain the headers of
      the GRO skb instead of the original/segmented one, but it seems to
      work fine in my (limited) tests.
      
      Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
      sofware segmentation.
      
      However it turns out that skb_segment() assumes skb nr_frags is related
      to mss size so we would BUG there.  I don't want to mess with it considering
      Herbert and Eric disagree on what the correct behavior should be.
      
      Hannes Frederic Sowa notes that when we would shrink gso_size
      skb_segment would then also need to deal with the case where
      SKB_MAX_FRAGS would be exceeded.
      
      This uses sofware segmentation in the forward path when we hit ipv4
      non-DF packets and the outgoing link mtu is too small.  Its not perfect,
      but given the lack of bug reports wrt. GRO fwd being broken this is a
      rare case anyway.  Also its not like this could not be improved later
      once the dust settles.
      
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Reported-by: default avatarMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe6cc55f
  9. Feb 12, 2014
  10. Feb 11, 2014
    • John Ogness's avatar
      tcp: tsq: fix nonagle handling · bf06200e
      John Ogness authored
      
      Commit 46d3ceab ("tcp: TCP Small Queues") introduced a possible
      regression for applications using TCP_NODELAY.
      
      If TCP session is throttled because of tsq, we should consult
      tp->nonagle when TX completion is done and allow us to send additional
      segment, especially if this segment is not a full MSS.
      Otherwise this segment is sent after an RTO.
      
      [edumazet] : Cooked the changelog, added another fix about testing
      sk_wmem_alloc twice because TX completion can happen right before
      setting TSQ_THROTTLED bit.
      
      This problem is particularly visible with recent auto corking,
      but might also be triggered with low tcp_limit_output_bytes
      values or NIC drivers delaying TX completion by hundred of usec,
      and very low rtt.
      
      Thomas Glanzmann for example reported an iscsi regression, caused
      by tcp auto corking making this bug quite visible.
      
      Fixes: 46d3ceab ("tcp: TCP Small Queues")
      Signed-off-by: default avatarJohn Ogness <john.ogness@linutronix.de>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarThomas Glanzmann <thomas@glanzmann.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf06200e
  11. Feb 10, 2014
  12. Feb 07, 2014
    • Eric Dumazet's avatar
      tcp: remove 1ms offset in srtt computation · 4a5ab4e2
      Eric Dumazet authored
      
      TCP pacing depends on an accurate srtt estimation.
      
      Current srtt estimation is using jiffie resolution,
      and has an artificial offset of at least 1 ms, which can produce
      slowdowns when FQ/pacing is used, especially in DC world,
      where typical rtt is below 1 ms.
      
      We are planning a switch to usec resolution for linux-3.15,
      but in the meantime, this patch removes the 1 ms offset.
      
      All we need is to have tp->srtt minimal value of 1 to differentiate
      the case of srtt being initialized or not, not 8.
      
      The problematic behavior was observed on a 40Gbit testbed,
      where 32 concurrent netperf were reaching 12Gbps of aggregate
      speed, instead of line speed.
      
      This patch also has the effect of reporting more accurate srtt and send
      rates to iproute2 ss command as in :
      
      $ ss -i dst cca2
      Netid  State      Recv-Q Send-Q          Local Address:Port
      Peer Address:Port
      tcp    ESTAB      0      0                10.244.129.1:56984
      10.244.129.2:12865
      	 cubic wscale:6,6 rto:200 rtt:0.25/0.25 ato:40 mss:1448 cwnd:10 send
      463.4Mbps rcv_rtt:1 rcv_space:29200
      tcp    ESTAB      0      390960           10.244.129.1:60247
      10.244.129.2:50204
      	 cubic wscale:6,6 rto:200 rtt:0.875/0.75 mss:1448 cwnd:73 ssthresh:51
      send 966.4Mbps unacked:73 retrans:0/121 rcv_space:29200
      
      Reported-by: default avatarVytautas Valancius <valas@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4a5ab4e2
    • Geert Uytterhoeven's avatar
      ipv4: Fix runtime WARNING in rtmsg_ifa() · 63b5f152
      Geert Uytterhoeven authored
      
      On m68k/ARAnyM:
      
      WARNING: CPU: 0 PID: 407 at net/ipv4/devinet.c:1599 0x316a99()
      Modules linked in:
      CPU: 0 PID: 407 Comm: ifconfig Not tainted
      3.13.0-atari-09263-g0c71d68014d1 #1378
      Stack from 10c4fdf0:
              10c4fdf0 002ffabb 000243e8 00000000 008ced6c 00024416 00316a99 0000063f
              00316a99 00000009 00000000 002501b4 00316a99 0000063f c0a86117 00000080
              c0a86117 00ad0c90 00250a5a 00000014 00ad0c90 00000000 00000000 00000001
              00b02dd0 00356594 00000000 00356594 c0a86117 eff6c9e4 008ced6c 00000002
              008ced60 0024f9b4 00250b52 00ad0c90 00000000 00000000 00252390 00ad0c90
              eff6c9e4 0000004f 00000000 00000000 eff6c9e4 8000e25c eff6c9e4 80001020
      Call Trace: [<000243e8>] warn_slowpath_common+0x52/0x6c
       [<00024416>] warn_slowpath_null+0x14/0x1a
       [<002501b4>] rtmsg_ifa+0xdc/0xf0
       [<00250a5a>] __inet_insert_ifa+0xd6/0x1c2
       [<0024f9b4>] inet_abc_len+0x0/0x42
       [<00250b52>] inet_insert_ifa+0xc/0x12
       [<00252390>] devinet_ioctl+0x2ae/0x5d6
      
      Adding some debugging code reveals that net_fill_ifaddr() fails in
      
          put_cacheinfo(skb, ifa->ifa_cstamp, ifa->ifa_tstamp,
                                    preferred, valid))
      
      nla_put complains:
      
          lib/nlattr.c:454: skb_tailroom(skb) = 12, nla_total_size(attrlen) = 20
      
      Apparently commit 5c766d64 ("ipv4:
      introduce address lifetime") forgot to take into account the addition of
      struct ifa_cacheinfo in inet_nlmsg_size(). Hence add it, like is already
      done for ipv6.
      
      Suggested-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63b5f152
  13. Feb 06, 2014
  14. Feb 05, 2014
    • Alexey Dobriyan's avatar
      netfilter: nf_nat_h323: fix crash in nf_ct_unlink_expect_report() · 829d9315
      Alexey Dobriyan authored
      
      Similar bug fixed in SIP module in 3f509c68 ("netfilter: nf_nat_sip: fix
      incorrect handling of EBUSY for RTCP expectation").
      
      BUG: unable to handle kernel paging request at 00100104
      IP: [<f8214f07>] nf_ct_unlink_expect_report+0x57/0xf0 [nf_conntrack]
      ...
      Call Trace:
        [<c0244bd8>] ? del_timer+0x48/0x70
        [<f8215687>] nf_ct_remove_expectations+0x47/0x60 [nf_conntrack]
        [<f8211c99>] nf_ct_delete_from_lists+0x59/0x90 [nf_conntrack]
        [<f8212e5e>] death_by_timeout+0x14e/0x1c0 [nf_conntrack]
        [<f8212d10>] ? nf_conntrack_set_hashsize+0x190/0x190 [nf_conntrack]
        [<c024442d>] call_timer_fn+0x1d/0x80
        [<c024461e>] run_timer_softirq+0x18e/0x1a0
        [<f8212d10>] ? nf_conntrack_set_hashsize+0x190/0x190 [nf_conntrack]
        [<c023e6f3>] __do_softirq+0xa3/0x170
        [<c023e650>] ? __local_bh_enable+0x70/0x70
        <IRQ>
        [<c023e587>] ? irq_exit+0x67/0xa0
        [<c0202af6>] ? do_IRQ+0x46/0xb0
        [<c027ad05>] ? clockevents_notify+0x35/0x110
        [<c066ac6c>] ? common_interrupt+0x2c/0x40
        [<c056e3c1>] ? cpuidle_enter_state+0x41/0xf0
        [<c056e6fb>] ? cpuidle_idle_call+0x8b/0x100
        [<c02085f8>] ? arch_cpu_idle+0x8/0x30
        [<c027314b>] ? cpu_idle_loop+0x4b/0x140
        [<c0273258>] ? cpu_startup_entry+0x18/0x20
        [<c066056d>] ? rest_init+0x5d/0x70
        [<c0813ac8>] ? start_kernel+0x2ec/0x2f2
        [<c081364f>] ? repair_env_string+0x5b/0x5b
        [<c0813269>] ? i386_start_kernel+0x33/0x35
      
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      829d9315
    • Shlomo Pongratz's avatar
      net/ipv4: Use proper RCU APIs for writer-side in udp_offload.c · a664a4f7
      Shlomo Pongratz authored
      
      RCU writer side should use rcu_dereference_protected() and not
      rcu_dereference(), fix that. This also removes the "suspicious RCU usage"
      warning seen when running with CONFIG_PROVE_RCU.
      
      Also, don't use rcu_assign_pointer/rcu_dereference for pointers
      which are invisible beyond the udp offload code.
      
      Fixes: b582ef09 ('net: Add GRO support for UDP encapsulating protocols')
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarShlomo Pongratz <shlomop@mellanox.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a664a4f7
Loading