Skip to content
Snippets Groups Projects
  1. Feb 25, 2014
  2. Feb 19, 2014
    • Florian Westphal's avatar
      tcp: use zero-window when free_space is low · 86c1a045
      Florian Westphal authored
      
      Currently the kernel tries to announce a zero window when free_space
      is below the current receiver mss estimate.
      
      When a sender is transmitting small packets and reader consumes data
      slowly (or not at all), receiver might be unable to shrink the receive
      win because
      
      a) we cannot withdraw already-commited receive window, and,
      b) we have to round the current rwin up to a multiple of the wscale
         factor, else we would shrink the current window.
      
      This causes the receive buffer to fill up until the rmem limit is hit.
      When this happens, we start dropping packets.
      
      Moreover, tcp_clamp_window may continue to grow sk_rcvbuf towards rmem[2]
      even if socket is not being read from.
      
      As we cannot avoid the "current_win is rounded up to multiple of mss"
      issue [we would violate a) above] at least try to prevent the receive buf
      growth towards tcp_rmem[2] limit by attempting to move to zero-window
      announcement when free_space becomes less than 1/16 of the current
      allowed receive buffer maximum.  If tcp_rmem[2] is large, this will
      increase our chances to get a zero-window announcement out in time.
      
      Reproducer:
      On server:
      $ nc -l -p 12345
      <suspend it: CTRL-Z>
      
      Client:
      #!/usr/bin/env python
      import socket
      import time
      
      sock = socket.socket()
      sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
      sock.connect(("192.168.4.1", 12345));
      while True:
         sock.send('A' * 23)
         time.sleep(0.005)
      
      socket buffer on server-side will grow until tcp_rmem[2] is hit,
      at which point the client rexmits data until -EDTIMEOUT:
      
      tcp_data_queue invokes tcp_try_rmem_schedule which will call
      tcp_prune_queue which calls tcp_clamp_window().  And that function will
      grow sk->sk_rcvbuf up until it eventually hits tcp_rmem[2].
      
      Thanks to Eric Dumazet for running regression tests.
      
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Tested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      86c1a045
    • Hannes Frederic Sowa's avatar
      ipv6: honor IPV6_PKTINFO with v4 mapped addresses on sendmsg · c8e6ad08
      Hannes Frederic Sowa authored
      
      In case we decide in udp6_sendmsg to send the packet down the ipv4
      udp_sendmsg path because the destination is either of family AF_INET or
      the destination is an ipv4 mapped ipv6 address, we don't honor the
      maybe specified ipv4 mapped ipv6 address in IPV6_PKTINFO.
      
      We simply can check for this option in ip_cmsg_send because no calls to
      ipv6 module functions are needed to do so.
      
      Reported-by: default avatarGert Doering <gert@space.net>
      Cc: Tore Anderson <tore@fud.no>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8e6ad08
  3. Feb 17, 2014
  4. Feb 14, 2014
  5. Feb 13, 2014
    • Florian Westphal's avatar
      net: ip, ipv6: handle gso skbs in forwarding path · fe6cc55f
      Florian Westphal authored
      
      Marcelo Ricardo Leitner reported problems when the forwarding link path
      has a lower mtu than the incoming one if the inbound interface supports GRO.
      
      Given:
      Host <mtu1500> R1 <mtu1200> R2
      
      Host sends tcp stream which is routed via R1 and R2.  R1 performs GRO.
      
      In this case, the kernel will fail to send ICMP fragmentation needed
      messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
      checks in forward path. Instead, Linux tries to send out packets exceeding
      the mtu.
      
      When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
      not fragment the packets when forwarding, and again tries to send out
      packets exceeding R1-R2 link mtu.
      
      This alters the forwarding dstmtu checks to take the individual gso
      segment lengths into account.
      
      For ipv6, we send out pkt too big error for gso if the individual
      segments are too big.
      
      For ipv4, we either send icmp fragmentation needed, or, if the DF bit
      is not set, perform software segmentation and let the output path
      create fragments when the packet is leaving the machine.
      It is not 100% correct as the error message will contain the headers of
      the GRO skb instead of the original/segmented one, but it seems to
      work fine in my (limited) tests.
      
      Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
      sofware segmentation.
      
      However it turns out that skb_segment() assumes skb nr_frags is related
      to mss size so we would BUG there.  I don't want to mess with it considering
      Herbert and Eric disagree on what the correct behavior should be.
      
      Hannes Frederic Sowa notes that when we would shrink gso_size
      skb_segment would then also need to deal with the case where
      SKB_MAX_FRAGS would be exceeded.
      
      This uses sofware segmentation in the forward path when we hit ipv4
      non-DF packets and the outgoing link mtu is too small.  Its not perfect,
      but given the lack of bug reports wrt. GRO fwd being broken this is a
      rare case anyway.  Also its not like this could not be improved later
      once the dust settles.
      
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Reported-by: default avatarMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe6cc55f
  6. Feb 12, 2014
  7. Feb 11, 2014
    • John Ogness's avatar
      tcp: tsq: fix nonagle handling · bf06200e
      John Ogness authored
      
      Commit 46d3ceab ("tcp: TCP Small Queues") introduced a possible
      regression for applications using TCP_NODELAY.
      
      If TCP session is throttled because of tsq, we should consult
      tp->nonagle when TX completion is done and allow us to send additional
      segment, especially if this segment is not a full MSS.
      Otherwise this segment is sent after an RTO.
      
      [edumazet] : Cooked the changelog, added another fix about testing
      sk_wmem_alloc twice because TX completion can happen right before
      setting TSQ_THROTTLED bit.
      
      This problem is particularly visible with recent auto corking,
      but might also be triggered with low tcp_limit_output_bytes
      values or NIC drivers delaying TX completion by hundred of usec,
      and very low rtt.
      
      Thomas Glanzmann for example reported an iscsi regression, caused
      by tcp auto corking making this bug quite visible.
      
      Fixes: 46d3ceab ("tcp: TCP Small Queues")
      Signed-off-by: default avatarJohn Ogness <john.ogness@linutronix.de>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarThomas Glanzmann <thomas@glanzmann.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf06200e
  8. Feb 10, 2014
  9. Feb 07, 2014
    • Eric Dumazet's avatar
      tcp: remove 1ms offset in srtt computation · 4a5ab4e2
      Eric Dumazet authored
      
      TCP pacing depends on an accurate srtt estimation.
      
      Current srtt estimation is using jiffie resolution,
      and has an artificial offset of at least 1 ms, which can produce
      slowdowns when FQ/pacing is used, especially in DC world,
      where typical rtt is below 1 ms.
      
      We are planning a switch to usec resolution for linux-3.15,
      but in the meantime, this patch removes the 1 ms offset.
      
      All we need is to have tp->srtt minimal value of 1 to differentiate
      the case of srtt being initialized or not, not 8.
      
      The problematic behavior was observed on a 40Gbit testbed,
      where 32 concurrent netperf were reaching 12Gbps of aggregate
      speed, instead of line speed.
      
      This patch also has the effect of reporting more accurate srtt and send
      rates to iproute2 ss command as in :
      
      $ ss -i dst cca2
      Netid  State      Recv-Q Send-Q          Local Address:Port
      Peer Address:Port
      tcp    ESTAB      0      0                10.244.129.1:56984
      10.244.129.2:12865
      	 cubic wscale:6,6 rto:200 rtt:0.25/0.25 ato:40 mss:1448 cwnd:10 send
      463.4Mbps rcv_rtt:1 rcv_space:29200
      tcp    ESTAB      0      390960           10.244.129.1:60247
      10.244.129.2:50204
      	 cubic wscale:6,6 rto:200 rtt:0.875/0.75 mss:1448 cwnd:73 ssthresh:51
      send 966.4Mbps unacked:73 retrans:0/121 rcv_space:29200
      
      Reported-by: default avatarVytautas Valancius <valas@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4a5ab4e2
    • Geert Uytterhoeven's avatar
      ipv4: Fix runtime WARNING in rtmsg_ifa() · 63b5f152
      Geert Uytterhoeven authored
      
      On m68k/ARAnyM:
      
      WARNING: CPU: 0 PID: 407 at net/ipv4/devinet.c:1599 0x316a99()
      Modules linked in:
      CPU: 0 PID: 407 Comm: ifconfig Not tainted
      3.13.0-atari-09263-g0c71d68014d1 #1378
      Stack from 10c4fdf0:
              10c4fdf0 002ffabb 000243e8 00000000 008ced6c 00024416 00316a99 0000063f
              00316a99 00000009 00000000 002501b4 00316a99 0000063f c0a86117 00000080
              c0a86117 00ad0c90 00250a5a 00000014 00ad0c90 00000000 00000000 00000001
              00b02dd0 00356594 00000000 00356594 c0a86117 eff6c9e4 008ced6c 00000002
              008ced60 0024f9b4 00250b52 00ad0c90 00000000 00000000 00252390 00ad0c90
              eff6c9e4 0000004f 00000000 00000000 eff6c9e4 8000e25c eff6c9e4 80001020
      Call Trace: [<000243e8>] warn_slowpath_common+0x52/0x6c
       [<00024416>] warn_slowpath_null+0x14/0x1a
       [<002501b4>] rtmsg_ifa+0xdc/0xf0
       [<00250a5a>] __inet_insert_ifa+0xd6/0x1c2
       [<0024f9b4>] inet_abc_len+0x0/0x42
       [<00250b52>] inet_insert_ifa+0xc/0x12
       [<00252390>] devinet_ioctl+0x2ae/0x5d6
      
      Adding some debugging code reveals that net_fill_ifaddr() fails in
      
          put_cacheinfo(skb, ifa->ifa_cstamp, ifa->ifa_tstamp,
                                    preferred, valid))
      
      nla_put complains:
      
          lib/nlattr.c:454: skb_tailroom(skb) = 12, nla_total_size(attrlen) = 20
      
      Apparently commit 5c766d64 ("ipv4:
      introduce address lifetime") forgot to take into account the addition of
      struct ifa_cacheinfo in inet_nlmsg_size(). Hence add it, like is already
      done for ipv6.
      
      Suggested-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63b5f152
  10. Feb 06, 2014
  11. Feb 05, 2014
    • Alexey Dobriyan's avatar
      netfilter: nf_nat_h323: fix crash in nf_ct_unlink_expect_report() · 829d9315
      Alexey Dobriyan authored
      
      Similar bug fixed in SIP module in 3f509c68 ("netfilter: nf_nat_sip: fix
      incorrect handling of EBUSY for RTCP expectation").
      
      BUG: unable to handle kernel paging request at 00100104
      IP: [<f8214f07>] nf_ct_unlink_expect_report+0x57/0xf0 [nf_conntrack]
      ...
      Call Trace:
        [<c0244bd8>] ? del_timer+0x48/0x70
        [<f8215687>] nf_ct_remove_expectations+0x47/0x60 [nf_conntrack]
        [<f8211c99>] nf_ct_delete_from_lists+0x59/0x90 [nf_conntrack]
        [<f8212e5e>] death_by_timeout+0x14e/0x1c0 [nf_conntrack]
        [<f8212d10>] ? nf_conntrack_set_hashsize+0x190/0x190 [nf_conntrack]
        [<c024442d>] call_timer_fn+0x1d/0x80
        [<c024461e>] run_timer_softirq+0x18e/0x1a0
        [<f8212d10>] ? nf_conntrack_set_hashsize+0x190/0x190 [nf_conntrack]
        [<c023e6f3>] __do_softirq+0xa3/0x170
        [<c023e650>] ? __local_bh_enable+0x70/0x70
        <IRQ>
        [<c023e587>] ? irq_exit+0x67/0xa0
        [<c0202af6>] ? do_IRQ+0x46/0xb0
        [<c027ad05>] ? clockevents_notify+0x35/0x110
        [<c066ac6c>] ? common_interrupt+0x2c/0x40
        [<c056e3c1>] ? cpuidle_enter_state+0x41/0xf0
        [<c056e6fb>] ? cpuidle_idle_call+0x8b/0x100
        [<c02085f8>] ? arch_cpu_idle+0x8/0x30
        [<c027314b>] ? cpu_idle_loop+0x4b/0x140
        [<c0273258>] ? cpu_startup_entry+0x18/0x20
        [<c066056d>] ? rest_init+0x5d/0x70
        [<c0813ac8>] ? start_kernel+0x2ec/0x2f2
        [<c081364f>] ? repair_env_string+0x5b/0x5b
        [<c0813269>] ? i386_start_kernel+0x33/0x35
      
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      829d9315
    • Shlomo Pongratz's avatar
      net/ipv4: Use proper RCU APIs for writer-side in udp_offload.c · a664a4f7
      Shlomo Pongratz authored
      
      RCU writer side should use rcu_dereference_protected() and not
      rcu_dereference(), fix that. This also removes the "suspicious RCU usage"
      warning seen when running with CONFIG_PROVE_RCU.
      
      Also, don't use rcu_assign_pointer/rcu_dereference for pointers
      which are invisible beyond the udp offload code.
      
      Fixes: b582ef09 ('net: Add GRO support for UDP encapsulating protocols')
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarShlomo Pongratz <shlomop@mellanox.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a664a4f7
  12. Feb 03, 2014
  13. Jan 31, 2014
  14. Jan 28, 2014
    • Duan Jiong's avatar
      net: gre: use icmp_hdr() to get inner ip header · c0c0c50f
      Duan Jiong authored
      
      When dealing with icmp messages, the skb->data points the
      ip header that triggered the sending of the icmp message.
      
      In gre_cisco_err(), the parse_gre_header() is called, and the
      iptunnel_pull_header() is called to pull the skb at the end of
      the parse_gre_header(), so the skb->data doesn't point the
      inner ip header.
      
      Unfortunately, the ipgre_err still needs those ip addresses in
      inner ip header to look up tunnel by ip_tunnel_lookup().
      
      So just use icmp_hdr() to get inner ip header instead of skb->data.
      
      Signed-off-by: default avatarDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c0c0c50f
    • Holger Eitzenberger's avatar
      net: Fix memory leak if TPROXY used with TCP early demux · a452ce34
      Holger Eitzenberger authored
      
      I see a memory leak when using a transparent HTTP proxy using TPROXY
      together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable):
      
      unreferenced object 0xffff88008cba4a40 (size 1696):
        comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s)
        hex dump (first 32 bytes):
          0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00  .. j@..7..2.....
          02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff810b710a>] kmem_cache_alloc+0xad/0xb9
          [<ffffffff81270185>] sk_prot_alloc+0x29/0xc5
          [<ffffffff812702cf>] sk_clone_lock+0x14/0x283
          [<ffffffff812aaf3a>] inet_csk_clone_lock+0xf/0x7b
          [<ffffffff8129a893>] netlink_broadcast+0x14/0x16
          [<ffffffff812c1573>] tcp_create_openreq_child+0x1b/0x4c3
          [<ffffffff812c033e>] tcp_v4_syn_recv_sock+0x38/0x25d
          [<ffffffff812c13e4>] tcp_check_req+0x25c/0x3d0
          [<ffffffff812bf87a>] tcp_v4_do_rcv+0x287/0x40e
          [<ffffffff812a08a7>] ip_route_input_noref+0x843/0xa55
          [<ffffffff812bfeca>] tcp_v4_rcv+0x4c9/0x725
          [<ffffffff812a26f4>] ip_local_deliver_finish+0xe9/0x154
          [<ffffffff8127a927>] __netif_receive_skb+0x4b2/0x514
          [<ffffffff8127aa77>] process_backlog+0xee/0x1c5
          [<ffffffff8127c949>] net_rx_action+0xa7/0x200
          [<ffffffff81209d86>] add_interrupt_randomness+0x39/0x157
      
      But there are many more, resulting in the machine going OOM after some
      days.
      
      From looking at the TPROXY code, and with help from Florian, I see
      that the memory leak is introduced in tcp_v4_early_demux():
      
        void tcp_v4_early_demux(struct sk_buff *skb)
        {
          /* ... */
      
          iph = ip_hdr(skb);
          th = tcp_hdr(skb);
      
          if (th->doff < sizeof(struct tcphdr) / 4)
              return;
      
          sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
                             iph->saddr, th->source,
                             iph->daddr, ntohs(th->dest),
                             skb->skb_iif);
          if (sk) {
              skb->sk = sk;
      
      where the socket is assigned unconditionally to skb->sk, also bumping
      the refcnt on it.  This is problematic, because in our case the skb
      has already a socket assigned in the TPROXY target.  This then results
      in the leak I see.
      
      The very same issue seems to be with IPv6, but haven't tested.
      
      Reviewed-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarHolger Eitzenberger <holger@eitzenberger.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a452ce34
  15. Jan 27, 2014
  16. Jan 25, 2014
  17. Jan 24, 2014
  18. Jan 23, 2014
Loading