Skip to content
Snippets Groups Projects
  1. Oct 24, 2009
  2. Oct 21, 2009
  3. Oct 20, 2009
  4. Oct 19, 2009
  5. Oct 15, 2009
  6. Oct 13, 2009
    • Eric Dumazet's avatar
      tcp: replace ehash_size by ehash_mask · f373b53b
      Eric Dumazet authored
      
      Storing the mask (size - 1) instead of the size allows fast path to be
      a bit faster.
      
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f373b53b
    • Eric Dumazet's avatar
      udp: Fix udp_poll() and ioctl() · 85584672
      Eric Dumazet authored
      
      udp_poll() can in some circumstances drop frames with incorrect checksums.
      
      Problem is we now have to lock the socket while dropping frames, or risk
      sk_forward corruption.
      
      This bug is present since commit 95766fff
      ([UDP]: Add memory accounting.)
      
      While we are at it, we can correct ioctl(SIOCINQ) to also drop bad frames.
      
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      85584672
    • Willy Tarreau's avatar
      tcp: fix tcp_defer_accept to consider the timeout · 6d01a026
      Willy Tarreau authored
      
      I was trying to use TCP_DEFER_ACCEPT and noticed that if the
      client does not talk, the connection is never accepted and
      remains in SYN_RECV state until the retransmits expire, where
      it finally is deleted. This is bad when some firewall such as
      netfilter sits between the client and the server because the
      firewall sees the connection in ESTABLISHED state while the
      server will finally silently drop it without sending an RST.
      
      This behaviour contradicts the man page which says it should
      wait only for some time :
      
             TCP_DEFER_ACCEPT (since Linux 2.4)
                Allows a listener to be awakened only when data arrives
                on the socket.  Takes an integer value  (seconds), this
                can  bound  the  maximum  number  of attempts TCP will
                make to complete the connection. This option should not
                be used in code intended to be portable.
      
      Also, looking at ipv4/tcp.c, a retransmit counter is correctly
      computed :
      
              case TCP_DEFER_ACCEPT:
                      icsk->icsk_accept_queue.rskq_defer_accept = 0;
                      if (val > 0) {
                              /* Translate value in seconds to number of
                               * retransmits */
                              while (icsk->icsk_accept_queue.rskq_defer_accept < 32 &&
                                     val > ((TCP_TIMEOUT_INIT / HZ) <<
                                             icsk->icsk_accept_queue.rskq_defer_accept))
                                      icsk->icsk_accept_queue.rskq_defer_accept++;
                              icsk->icsk_accept_queue.rskq_defer_accept++;
                      }
                      break;
      
      ==> rskq_defer_accept is used as a counter of retransmits.
      
      But in tcp_minisocks.c, this counter is only checked. And in
      fact, I have found no location which updates it. So I think
      that what was intended was to decrease it in tcp_minisocks
      whenever it is checked, which the trivial patch below does.
      
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6d01a026
  7. Oct 12, 2009
    • Neil Horman's avatar
      net: Generalize socket rx gap / receive queue overflow cmsg · 3b885787
      Neil Horman authored
      
      Create a new socket level option to report number of queue overflows
      
      Recently I augmented the AF_PACKET protocol to report the number of frames lost
      on the socket receive queue between any two enqueued frames.  This value was
      exported via a SOL_PACKET level cmsg.  AFter I completed that work it was
      requested that this feature be generalized so that any datagram oriented socket
      could make use of this option.  As such I've created this patch, It creates a
      new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
      SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
      overflowed between any two given frames.  It also augments the AF_PACKET
      protocol to take advantage of this new feature (as it previously did not touch
      sk->sk_drops, which this patch uses to record the overflow count).  Tested
      successfully by me.
      
      Notes:
      
      1) Unlike my previous patch, this patch simply records the sk_drops value, which
      is not a number of drops between packets, but rather a total number of drops.
      Deltas must be computed in user space.
      
      2) While this patch currently works with datagram oriented protocols, it will
      also be accepted by non-datagram oriented protocols. I'm not sure if thats
      agreeable to everyone, but my argument in favor of doing so is that, for those
      protocols which aren't applicable to this option, sk_drops will always be zero,
      and reporting no drops on a receive queue that isn't used for those
      non-participating protocols seems reasonable to me.  This also saves us having
      to code in a per-protocol opt in mechanism.
      
      3) This applies cleanly to net-next assuming that commit
      97775007 (my af packet cmsg patch) is reverted
      
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b885787
  8. Oct 08, 2009
    • Eric Dumazet's avatar
      udp: dynamically size hash tables at boot time · f86dcc5a
      Eric Dumazet authored
      
      UDP_HTABLE_SIZE was initialy defined to 128, which is a bit small for
      several setups.
      
      4000 active UDP sockets -> 32 sockets per chain in average. An
      incoming frame has to lookup all sockets to find best match, so long
      chains hurt latency.
      
      Instead of a fixed size hash table that cant be perfect for every
      needs, let UDP stack choose its table size at boot time like tcp/ip
      route, using alloc_large_system_hash() helper
      
      Add an optional boot parameter, uhash_entries=x so that an admin can
      force a size between 256 and 65536 if needed, like thash_entries and
      rhash_entries.
      
      dmesg logs two new lines :
      [    0.647039] UDP hash table entries: 512 (order: 0, 4096 bytes)
      [    0.647099] UDP Lite hash table entries: 512 (order: 0, 4096 bytes)
      
      Maximal size on 64bit arches would be 65536 slots, ie 1 MBytes for non
      debugging spinlocks.
      
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f86dcc5a
  9. Oct 07, 2009
  10. Oct 05, 2009
  11. Oct 02, 2009
  12. Oct 01, 2009
  13. Sep 25, 2009
  14. Sep 24, 2009
  15. Sep 22, 2009
  16. Sep 16, 2009
    • Robert Varga's avatar
      tcp: fix CONFIG_TCP_MD5SIG + CONFIG_PREEMPT timer BUG() · 657e9649
      Robert Varga authored
      
      I have recently came across a preemption imbalance detected by:
      
      <4>huh, entered ffffffff80644630 with preempt_count 00000102, exited with 00000101?
      <0>------------[ cut here ]------------
      <2>kernel BUG at /usr/src/linux/kernel/timer.c:664!
      <0>invalid opcode: 0000 [1] PREEMPT SMP
      
      with ffffffff80644630 being inet_twdr_hangman().
      
      This appeared after I enabled CONFIG_TCP_MD5SIG and played with it a
      bit, so I looked at what might have caused it.
      
      One thing that struck me as strange is tcp_twsk_destructor(), as it
      calls tcp_put_md5sig_pool() -- which entails a put_cpu(), causing the
      detected imbalance. Found on 2.6.23.9, but 2.6.31 is affected as well,
      as far as I can tell.
      
      Signed-off-by: default avatarRobert Varga <nite@hq.alert.sk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      657e9649
  17. Sep 15, 2009
  18. Sep 09, 2009
  19. Sep 03, 2009
    • Wu Fengguang's avatar
      tcp: replace hard coded GFP_KERNEL with sk_allocation · aa133076
      Wu Fengguang authored
      
      This fixed a lockdep warning which appeared when doing stress
      memory tests over NFS:
      
      	inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
      
      	page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
      
      	mount_root => nfs_root_data => tcp_close => lock sk_lock =>
      			tcp_send_fin => alloc_skb_fclone => page reclaim
      
      David raised a concern that if the allocation fails in tcp_send_fin(), and it's
      GFP_ATOMIC, we are going to yield() (which sleeps) and loop endlessly waiting
      for the allocation to succeed.
      
      But fact is, the original GFP_KERNEL also sleeps. GFP_ATOMIC+yield() looks
      weird, but it is no worse the implicit sleep inside GFP_KERNEL. Both could
      loop endlessly under memory pressure.
      
      CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      CC: David S. Miller <davem@davemloft.net>
      CC: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa133076
    • Eric Dumazet's avatar
      ip: Report qdisc packet drops · 6ce9e7b5
      Eric Dumazet authored
      
      Christoph Lameter pointed out that packet drops at qdisc level where not
      accounted in SNMP counters. Only if application sets IP_RECVERR, drops
      are reported to user (-ENOBUFS errors) and SNMP counters updated.
      
      IP_RECVERR is used to enable extended reliable error message passing,
      but these are not needed to update system wide SNMP stats.
      
      This patch changes things a bit to allow SNMP counters to be updated,
      regardless of IP_RECVERR being set or not on the socket.
      
      Example after an UDP tx flood
      # netstat -s 
      ...
      IP:
          1487048 outgoing packets dropped
      ...
      Udp:
      ...
          SndbufErrors: 1487048
      
      
      send() syscalls, do however still return an OK status, to not
      break applications.
      
      Note : send() manual page explicitly says for -ENOBUFS error :
      
       "The output queue for a network interface was full.
        This generally indicates that the interface has stopped sending,
        but may be caused by transient congestion.
        (Normally, this does not occur in Linux. Packets are just silently
        dropped when a device queue overflows.) "
      
      This is not true for IP_RECVERR enabled sockets : a send() syscall
      that hit a qdisc drop returns an ENOBUFS error.
      
      Many thanks to Christoph, David, and last but not least, Alexey !
      
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ce9e7b5
  20. Sep 02, 2009
  21. Sep 01, 2009
    • Damian Lukowski's avatar
      Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value. · 6fa12c85
      Damian Lukowski authored
      RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
      which may represent a number of allowed retransmissions or a timeout value.
      Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
      in number of allowed retransmissions.
      
      For any desired threshold R2 (by means of time) one can specify tcp_retries2
      (by means of number of retransmissions) such that TCP will not time out
      earlier than R2. This is the case, because the RTO schedule follows a fixed
      pattern, namely exponential backoff.
      
      However, the RTO behaviour is not predictable any more if RTO backoffs can be
      reverted, as it is the case in the draft
      "Make TCP more Robust to Long Connectivity Disruptions"
      (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd
      
      ).
      
      In the worst case TCP would time out a connection after 3.2 seconds, if the
      initial RTO equaled MIN_RTO and each backoff has been reverted.
      
      This patch introduces a function retransmits_timed_out(N),
      which calculates the timeout of a TCP connection, assuming an initial
      RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
      
      Whenever timeout decisions are made by comparing the retransmission counter
      to some value N, this function can be used, instead.
      
      The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
      can occur than the value indicates. However, it yields a timeout which is
      similar to the one of an unpatched, exponentially backing off TCP in the same
      scenario. As no application could rely on an RTO greater than MIN_RTO, there
      should be no risk of a regression.
      
      Signed-off-by: default avatarDamian Lukowski <damian@tvk.rwth-aachen.de>
      Acked-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6fa12c85
Loading