Skip to content
Snippets Groups Projects
  1. Aug 29, 2009
    • John Dykstra's avatar
      tcp: Remove redundant copy of MD5 authentication key · 9a7030b7
      John Dykstra authored
      
      Remove the copy of the MD5 authentication key from tcp_check_req().
      This key has already been copied by tcp_v4_syn_recv_sock() or
      tcp_v6_syn_recv_sock().
      
      Signed-off-by: default avatarJohn Dykstra <john.dykstra1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a7030b7
    • Octavian Purdila's avatar
      tcp: fix premature termination of FIN_WAIT2 time-wait sockets · 80a1096b
      Octavian Purdila authored
      
      There is a race condition in the time-wait sockets code that can lead
      to premature termination of FIN_WAIT2 and, subsequently, to RST
      generation when the FIN,ACK from the peer finally arrives:
      
      Time     TCP header
      0.000000 30755 > http [SYN] Seq=0 Win=2920 Len=0 MSS=1460 TSV=282912 TSER=0
      0.000008 http > 30755 aSYN, ACK] Seq=0 Ack=1 Win=2896 Len=0 MSS=1460 TSV=...
      0.136899 HEAD /1b.html?n1Lg=v1 HTTP/1.0 [Packet size limited during capture]
      0.136934 HTTP/1.0 200 OK [Packet size limited during capture]
      0.136945 http > 30755 [FIN, ACK] Seq=187 Ack=207 Win=2690 Len=0 TSV=270521...
      0.136974 30755 > http [ACK] Seq=207 Ack=187 Win=2734 Len=0 TSV=283049 TSER=...
      0.177983 30755 > http [ACK] Seq=207 Ack=188 Win=2733 Len=0 TSV=283089 TSER=...
      0.238618 30755 > http [FIN, ACK] Seq=207 Ack=188 Win=2733 Len=0 TSV=283151...
      0.238625 http > 30755 [RST] Seq=188 Win=0 Len=0
      
      Say twdr->slot = 1 and we are running inet_twdr_hangman and in this
      instance inet_twdr_do_twkill_work returns 1. At that point we will
      mark slot 1 and schedule inet_twdr_twkill_work. We will also make
      twdr->slot = 2.
      
      Next, a connection is closed and tcp_time_wait(TCP_FIN_WAIT2, timeo)
      is called which will create a new FIN_WAIT2 time-wait socket and will
      place it in the last to be reached slot, i.e. twdr->slot = 1.
      
      At this point say inet_twdr_twkill_work will run which will start
      destroying the time-wait sockets in slot 1, including the just added
      TCP_FIN_WAIT2 one.
      
      To avoid this issue we increment the slot only if all entries in the
      slot have been purged.
      
      This change may delay the slots cleanup by a time-wait death row
      period but only if the worker thread didn't had the time to run/purge
      the current slot in the next period (6 seconds with default sysctl
      settings). However, on such a busy system even without this change we
      would probably see delays...
      
      Signed-off-by: default avatarOctavian Purdila <opurdila@ixiacom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80a1096b
    • Jens Låås's avatar
      fib_trie: resize rework · 80b71b80
      Jens Låås authored
      
      Here is rework and cleanup of the resize function.
      
      Some bugs we had. We were using ->parent when we should use 
      node_parent(). Also we used ->parent which is not assigned by
      inflate in inflate loop.
      
      Also a fix to set thresholds to power 2 to fit halve 
      and double strategy.
      
      max_resize is renamed to max_work which better indicates
      it's function.
      
      Reaching max_work is not an error, so warning is removed. 
      max_work only limits amount of work done per resize.
      (limits CPU-usage, outstanding memory etc).
      
      The clean-up makes it relatively easy to add fixed sized 
      root-nodes if we would like to decrease the memory pressure
      on routers with large routing tables and dynamic routing.
      If we'll need that...
      
      Its been tested with 280k routes.
      
      Work done together with Robert Olsson.
      
      Signed-off-by: default avatarJens Låås <jens.laas@its.uu.se>
      Signed-off-by: default avatarRobert Olsson <robert.olsson@its.uu.se>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80b71b80
    • Eric Dumazet's avatar
      net: ip_rt_send_redirect() optimization · 30038fc6
      Eric Dumazet authored
      
      While doing some forwarding benchmarks, I noticed
      ip_rt_send_redirect() is rather expensive, even if send_redirects is
      false for the device.
      
      Fix is to avoid two atomic ops, we dont really need to take a
      reference on in_dev
      
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      30038fc6
    • Eric Dumazet's avatar
      tcp: keepalive cleanups · df19a626
      Eric Dumazet authored
      
      Introduce keepalive_probes(tp) helper, and use it, like 
      keepalive_time_when(tp) and keepalive_intvl_when(tp)
      
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df19a626
    • Eric Dumazet's avatar
      ipv4: af_inet.c cleanups · 3d1427f8
      Eric Dumazet authored
      
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d1427f8
  2. Aug 05, 2009
  3. Jul 31, 2009
    • Neil Horman's avatar
      xfrm: select sane defaults for xfrm[4|6] gc_thresh · a33bc5c1
      Neil Horman authored
      
      Choose saner defaults for xfrm[4|6] gc_thresh values on init
      
      Currently, the xfrm[4|6] code has hard-coded initial gc_thresh values
      (set to 1024).  Given that the ipv4 and ipv6 routing caches are sized
      dynamically at boot time, the static selections can be non-sensical.
      This patch dynamically selects an appropriate gc threshold based on
      the corresponding main routing table size, using the assumption that
      we should in the worst case be able to handle as many connections as
      the routing table can.
      
      For ipv4, the maximum route cache size is 16 * the number of hash
      buckets in the route cache.  Given that xfrm4 starts garbage
      collection at the gc_thresh and prevents new allocations at 2 *
      gc_thresh, we set gc_thresh to half the maximum route cache size.
      
      For ipv6, its a bit trickier.  there is no maximum route cache size,
      but the ipv6 dst_ops gc_thresh is statically set to 1024.  It seems
      sane to select a simmilar gc_thresh for the xfrm6 code that is half
      the number of hash buckets in the v6 route cache times 16 (like the v4
      code does).
      
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a33bc5c1
  4. Jul 30, 2009
  5. Jul 27, 2009
    • Neil Horman's avatar
      xfrm: export xfrm garbage collector thresholds via sysctl · a44a4a00
      Neil Horman authored
      
      Export garbage collector thresholds for xfrm[4|6]_dst_ops
      
      Had a problem reported to me recently in which a high volume of ipsec
      connections on a system began reporting ENOBUFS for new connections
      eventually.
      
      It seemed that after about 2000 connections we started being unable to
      create more.  A quick look revealed that the xfrm code used a dst_ops
      structure that limited the gc_thresh value to 1024, and always
      dropped route cache entries after 2x the gc_thresh.
      
      It seems the most direct solution is to export the gc_thresh values in
      the xfrm[4|6] dst_ops as sysctls, like the main routing table does, so
      that higher volumes of connections can be supported.  This patch has
      been tested and allows the reporter to increase their ipsec connection
      volume successfully.
      
      Reported-by: default avatarJoe Nall <joe@nall.com>
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      
      ipv4/xfrm4_policy.c |   18 ++++++++++++++++++
      ipv6/xfrm6_policy.c |   18 ++++++++++++++++++
      2 files changed, 36 insertions(+)
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a44a4a00
  6. Jul 24, 2009
  7. Jul 20, 2009
  8. Jul 17, 2009
  9. Jul 14, 2009
  10. Jul 12, 2009
  11. Jul 10, 2009
    • Jiri Olsa's avatar
      net: adding memory barrier to the poll and receive callbacks · a57de0b4
      Jiri Olsa authored
      
      Adding memory barrier after the poll_wait function, paired with
      receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper
      to wrap the memory barrier.
      
      Without the memory barrier, following race can happen.
      The race fires, when following code paths meet, and the tp->rcv_nxt
      and __add_wait_queue updates stay in CPU caches.
      
      CPU1                         CPU2
      
      sys_select                   receive packet
        ...                        ...
        __add_wait_queue           update tp->rcv_nxt
        ...                        ...
        tp->rcv_nxt check          sock_def_readable
        ...                        {
        schedule                      ...
                                      if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
                                              wake_up_interruptible(sk->sk_sleep)
                                      ...
                                   }
      
      If there was no cache the code would work ok, since the wait_queue and
      rcv_nxt are opposit to each other.
      
      Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
      passed the tp->rcv_nxt check and sleeps, or will get the new value for
      tp->rcv_nxt and will return with new data mask.
      In both cases the process (CPU1) is being added to the wait queue, so the
      waitqueue_active (CPU2) call cannot miss and will wake up CPU1.
      
      The bad case is when the __add_wait_queue changes done by CPU1 stay in its
      cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1 will then
      endup calling schedule and sleep forever if there are no more data on the
      socket.
      
      Calls to poll_wait in following modules were ommited:
      	net/bluetooth/af_bluetooth.c
      	net/irda/af_irda.c
      	net/irda/irnet/irnet_ppp.c
      	net/mac80211/rc80211_pid_debugfs.c
      	net/phonet/socket.c
      	net/rds/af_rds.c
      	net/rfkill/core.c
      	net/sunrpc/cache.c
      	net/sunrpc/rpc_pipe.c
      	net/tipc/socket.c
      
      Signed-off-by: default avatarJiri Olsa <jolsa@redhat.com>
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a57de0b4
  12. Jul 08, 2009
    • Jarek Poplawski's avatar
      ipv4: Fix fib_trie rebalancing, part 4 (root thresholds) · 345aa031
      Jarek Poplawski authored
      Pawel Staszewski wrote:
      <blockquote>
      Some time ago i report this:
      http://bugzilla.kernel.org/show_bug.cgi?id=6648
      
      
      
      and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
      dmesg output:
      oprofile: using NMI interrupt.
      Fix inflate_threshold_root. Now=15 size=11 bits
      ...
      Fix inflate_threshold_root. Now=15 size=11 bits
      
      cat /proc/net/fib_triestat
      Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes.
      Main:
              Aver depth:     2.28
              Max depth:      6
              Leaves:         276539
              Prefixes:       289922
              Internal nodes: 66762
                1: 35046  2: 13824  3: 9508  4: 4897  5: 2331  6: 1149  7: 5
      9: 1  18: 1
              Pointers: 691228
      Null ptrs: 347928
      Total size: 35709  kB
      </blockquote>
      
      It seems, the current threshold for root resizing is too aggressive,
      and it causes misleading warnings during big updates, but it might be
      also responsible for memory problems, especially with non-preempt
      configs, when RCU freeing is delayed long after call_rcu.
      
      It should be also mentioned that because of non-atomic changes during
      resizing/rebalancing the current lookup algorithm can miss valid leaves
      so it's additional argument to shorten these activities even at a cost
      of a minimally longer searching.
      
      This patch restores values before the patch "[IPV4]: fib_trie root
      node settings", commit: 965ffea4 from
      v2.6.22.
      
      Pawel's report:
      <blockquote>
      I dont see any big change of (cpu load or faster/slower
      routing/propagating routes from bgpd or something else) - in avg there
      is from 2% to 3% more of CPU load i dont know why but it is - i change
      from "preempt" to "no preempt" 3 times and check this my "mpstat -P ALL
      1 30"
      always avg cpu load was from 2 to 3% more compared to "no preempt"
      [...]
      cat /proc/net/fib_triestat
      Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
      Main:
              Aver depth:     2.44
              Max depth:      6
              Leaves:         277814
              Prefixes:       291306
              Internal nodes: 66420
                1: 32737  2: 14850  3: 10332  4: 4871  5: 2313  6: 942  7: 371  8: 3  17: 1
              Pointers: 599098
      Null ptrs: 254865
      Total size: 18067  kB
      </blockquote>
      
      According to this and other similar reports average depth is slightly
      increased (~0.2), and root nodes are shorter (log 17 vs. 18), but
      there is no visible performance decrease. So, until memory handling is
      improved or added parameters for changing this individually, this
      patch resets to safer defaults.
      
      Reported-by: default avatarPawel Staszewski <pstaszewski@itcare.pl>
      Reported-by: default avatarJorge Boncompte [DTI2] <jorge@dti2.net>
      Signed-off-by: default avatarJarek Poplawski <jarkao2@gmail.com>
      Tested-by: default avatarPawel Staszewski <pstaszewski@itcare.pl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      345aa031
  13. Jul 06, 2009
  14. Jul 04, 2009
  15. Jul 01, 2009
  16. Jun 30, 2009
  17. Jun 29, 2009
  18. Jun 27, 2009
  19. Jun 26, 2009
    • Wei Yongjun's avatar
      tcp: missing check ACK flag of received segment in FIN-WAIT-2 state · 1ac530b3
      Wei Yongjun authored
      
      RFC0793 defined that in FIN-WAIT-2 state if the ACK bit is off drop
      the segment and return[Page 72]. But this check is missing in function
      tcp_timewait_state_process(). This cause the segment with FIN flag but
      no ACK has two diffent action:
      
      Case 1:
          Node A                      Node B
                    <-------------    FIN,ACK
                                      (enter FIN-WAIT-1)
          ACK       ------------->
                                      (enter FIN-WAIT-2)
          FIN       ------------->    discard
                                      (move sk to tw list)
      
      Case 2:
          Node A                      Node B
                    <-------------    FIN,ACK
                                      (enter FIN-WAIT-1)
          ACK       ------------->
                                      (enter FIN-WAIT-2)
                                      (move sk to tw list)
          FIN       ------------->
      
                    <-------------    ACK
      
      This patch fixed the problem.
      
      Signed-off-by: default avatarWei Yongjun <yjwei@cn.fujitsu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ac530b3
  20. Jun 24, 2009
    • Neil Horman's avatar
      ipv4 routing: Ensure that route cache entries are usable and reclaimable with caching is off · b6280b47
      Neil Horman authored
      
      When route caching is disabled (rt_caching returns false), We still use route
      cache entries that are created and passed into rt_intern_hash once.  These
      routes need to be made usable for the one call path that holds a reference to
      them, and they need to be reclaimed when they're finished with their use.  To be
      made usable, they need to be associated with a neighbor table entry (which they
      currently are not), otherwise iproute_finish2 just discards the packet, since we
      don't know which L2 peer to send the packet to.  To do this binding, we need to
      follow the path a bit higher up in rt_intern_hash, which calls
      arp_bind_neighbour, but not assign the route entry to the hash table.
      Currently, if caching is off, we simply assign the route to the rp pointer and
      are reutrn success.  This patch associates us with a neighbor entry first.
      
      Secondly, we need to make sure that any single use routes like this are known to
      the garbage collector when caching is off.  If caching is off, and we try to
      hash in a route, it will leak when its refcount reaches zero.  To avoid this,
      this patch calls rt_free on the route cache entry passed into rt_intern_hash.
      This places us on the gc list for the route cache garbage collector, so that
      when its refcount reaches zero, it will be reclaimed (Thanks to Alexey for this
      suggestion).
      
      I've tested this on a local system here, and with these patches in place, I'm
      able to maintain routed connectivity to remote systems, even if I set
      /proc/sys/net/ipv4/rt_cache_rebuild_count to -1, which forces rt_caching to
      return false.
      
      Signed-off-by: default avatarNeil Horman <nhorman@redhat.com>
      Reported-by: default avatarJarek Poplawski <jarkao2@gmail.com>
      Reported-by: default avatarMaxime Bizon <mbizon@freebox.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6280b47
  21. Jun 20, 2009
    • Neil Horman's avatar
      ipv4: fix NULL pointer + success return in route lookup path · 73e42897
      Neil Horman authored
      
      Don't drop route if we're not caching	
      
      	I recently got a report of an oops on a route lookup.  Maxime was
      testing what would happen if route caching was turned off (doing so by setting
      making rt_caching always return 0), and found that it triggered an oops.  I
      looked at it and found that the problem stemmed from the fact that the route
      lookup routines were returning success from their lookup paths (which is good),
      but never set the **rp pointer to anything (which is bad).  This happens because
      in rt_intern_hash, if rt_caching returns false, we call rt_drop and return 0.
      This almost emulates slient success.  What we should be doing is assigning *rp =
      rt and _not_ dropping the route.  This way, during slow path lookups, when we
      create a new route cache entry, we don't immediately discard it, rather we just
      don't add it into the cache hash table, but we let this one lookup use it for
      the purpose of this route request.  Maxime has tested and reports it prevents
      the oops.  There is still a subsequent routing issue that I'm looking into
      further, but I'm confident that, even if its related to this same path, this
      patch makes sense to take.
      
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      73e42897
  22. Jun 18, 2009
  23. Jun 15, 2009
  24. Jun 14, 2009
    • Tom Goff's avatar
      PIM-SM: namespace changes · 403dbb97
      Tom Goff authored
      
      IPv4:
        - make PIM register vifs netns local
        - set the netns when a PIM register vif is created
        - make PIM available in all network namespaces (if CONFIG_IP_PIMSM_V2)
          by adding the protocol handler when multicast routing is initialized
      
      IPv6:
        - make PIM register vifs netns local
        - make PIM available in all network namespaces (if CONFIG_IPV6_PIMSM_V2)
          by adding the protocol handler when multicast routing is initialized
      
      Signed-off-by: default avatarTom Goff <thomas.goff@boeing.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      403dbb97
    • Timo Teräs's avatar
      ipv4: update ARPD help text · e61a4b63
      Timo Teräs authored
      
      Removed the statements about ARP cache size as this config option does
      not affect it. The cache size is controlled by neigh_table gc thresholds.
      
      Remove also expiremental and obsolete markings as the API originally
      intended for arp caching is useful for implementing ARP-like protocols
      (e.g. NHRP) in user space and has been there for a long enough time.
      
      Signed-off-by: default avatarTimo Teras <timo.teras@iki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e61a4b63
Loading