Skip to content
Snippets Groups Projects
  1. Jan 15, 2014
  2. Oct 10, 2013
    • Eric Dumazet's avatar
      inet: rename ir_loc_port to ir_num · b44084c2
      Eric Dumazet authored
      
      In commit 634fb979 ("inet: includes a sock_common in request_sock")
      I forgot that the two ports in sock_common do not have same byte order :
      
      skc_dport is __be16 (network order), but skc_num is __u16 (host order)
      
      So sparse complains because ir_loc_port (mapped into skc_num) is
      considered as __u16 while it should be __be16
      
      Let rename ir_loc_port to ireq->ir_num (analogy with inet->inet_num),
      and perform appropriate htons/ntohs conversions.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b44084c2
    • Eric Dumazet's avatar
      inet: includes a sock_common in request_sock · 634fb979
      Eric Dumazet authored
      
      TCP listener refactoring, part 5 :
      
      We want to be able to insert request sockets (SYN_RECV) into main
      ehash table instead of the per listener hash table to allow RCU
      lookups and remove listener lock contention.
      
      This patch includes the needed struct sock_common in front
      of struct request_sock
      
      This means there is no more inet6_request_sock IPv6 specific
      structure.
      
      Following inet_request_sock fields were renamed as they became
      macros to reference fields from struct sock_common.
      Prefix ir_ was chosen to avoid name collisions.
      
      loc_port   -> ir_loc_port
      loc_addr   -> ir_loc_addr
      rmt_addr   -> ir_rmt_addr
      rmt_port   -> ir_rmt_port
      iif        -> ir_iif
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      634fb979
  3. Oct 03, 2013
    • Eric Dumazet's avatar
      inet: consolidate INET_TW_MATCH · 50805466
      Eric Dumazet authored
      
      TCP listener refactoring, part 2 :
      
      We can use a generic lookup, sockets being in whatever state, if
      we are sure all relevant fields are at the same place in all socket
      types (ESTABLISH, TIME_WAIT, SYN_RECV)
      
      This patch removes these macros :
      
       inet_addrpair, inet_addrpair, tw_addrpair, tw_portpair
      
      And adds :
      
       sk_portpair, sk_addrpair, sk_daddr, sk_rcv_saddr
      
      Then, INET_TW_MATCH() is really the same than INET_MATCH()
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50805466
  4. Oct 01, 2013
  5. Mar 17, 2013
    • Christoph Paasch's avatar
      tcp: Remove TCPCT · 1a2c6181
      Christoph Paasch authored
      
      TCPCT uses option-number 253, reserved for experimental use and should
      not be used in production environments.
      Further, TCPCT does not fully implement RFC 6013.
      
      As a nice side-effect, removing TCPCT increases TCP's performance for
      very short flows:
      
      Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
      for files of 1KB size.
      
      before this patch:
      	average (among 7 runs) of 20845.5 Requests/Second
      after:
      	average (among 7 runs) of 21403.6 Requests/Second
      
      Signed-off-by: default avatarChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a2c6181
  6. Mar 07, 2013
  7. Feb 28, 2013
    • Sasha Levin's avatar
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin authored
      
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: default avatarPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  8. Jan 27, 2013
  9. Jan 23, 2013
    • Tom Herbert's avatar
      soreuseport: TCP/IPv4 implementation · da5e3630
      Tom Herbert authored
      
      Allow multiple listener sockets to bind to the same port.
      
      Motivation for soresuseport would be something like a web server
      binding to port 80 running with multiple threads, where each thread
      might have it's own listener socket.  This could be done as an
      alternative to other models: 1) have one listener thread which
      dispatches completed connections to workers. 2) accept on a single
      listener socket from multiple threads.  In case #1 the listener thread
      can easily become the bottleneck with high connection turn-over rate.
      In case #2, the proportion of connections accepted per thread tends
      to be uneven under high connection load (assuming simple event loop:
      while (1) { accept(); process() }, wakeup does not promote fairness
      among the sockets.  We have seen the  disproportion to be as high
      as 3:1 ratio between thread accepting most connections and the one
      accepting the fewest.  With so_reusport the distribution is
      uniform.
      
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      da5e3630
  10. Dec 14, 2012
    • Christoph Paasch's avatar
      inet: Fix kmemleak in tcp_v4/6_syn_recv_sock and dccp_v4/6_request_recv_sock · e337e24d
      Christoph Paasch authored
      
      If in either of the above functions inet_csk_route_child_sock() or
      __inet_inherit_port() fails, the newsk will not be freed:
      
      unreferenced object 0xffff88022e8a92c0 (size 1592):
        comm "softirq", pid 0, jiffies 4294946244 (age 726.160s)
        hex dump (first 32 bytes):
          0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00  ................
          02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff8153d190>] kmemleak_alloc+0x21/0x3e
          [<ffffffff810ab3e7>] kmem_cache_alloc+0xb5/0xc5
          [<ffffffff8149b65b>] sk_prot_alloc.isra.53+0x2b/0xcd
          [<ffffffff8149b784>] sk_clone_lock+0x16/0x21e
          [<ffffffff814d711a>] inet_csk_clone_lock+0x10/0x7b
          [<ffffffff814ebbc3>] tcp_create_openreq_child+0x21/0x481
          [<ffffffff814e8fa5>] tcp_v4_syn_recv_sock+0x3a/0x23b
          [<ffffffff814ec5ba>] tcp_check_req+0x29f/0x416
          [<ffffffff814e8e10>] tcp_v4_do_rcv+0x161/0x2bc
          [<ffffffff814eb917>] tcp_v4_rcv+0x6c9/0x701
          [<ffffffff814cea9f>] ip_local_deliver_finish+0x70/0xc4
          [<ffffffff814cec20>] ip_local_deliver+0x4e/0x7f
          [<ffffffff814ce9f8>] ip_rcv_finish+0x1fc/0x233
          [<ffffffff814cee68>] ip_rcv+0x217/0x267
          [<ffffffff814a7bbe>] __netif_receive_skb+0x49e/0x553
          [<ffffffff814a7cc3>] netif_receive_skb+0x50/0x82
      
      This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus
      a single sock_put() is not enough to free the memory. Additionally, things
      like xfrm, memcg, cookie_values,... may have been initialized.
      We have to free them properly.
      
      This is fixed by forcing a call to tcp_done(), ending up in
      inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary,
      because it ends up doing all the cleanup on xfrm, memcg, cookie_values,
      xfrm,...
      
      Before calling tcp_done, we have to set the socket to SOCK_DEAD, to
      force it entering inet_csk_destroy_sock. To avoid the warning in
      inet_csk_destroy_sock, inet_num has to be set to 0.
      As inet_csk_destroy_sock does a dec on orphan_count, we first have to
      increase it.
      
      Calling tcp_done() allows us to remove the calls to
      tcp_clear_xmit_timer() and tcp_cleanup_congestion_control().
      
      A similar approach is taken for dccp by calling dccp_done().
      
      This is in the kernel since 093d2823 (tproxy: fix hash locking issue
      when using port redirection in __inet_inherit_port()), thus since
      version >= 2.6.37.
      
      Signed-off-by: default avatarChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e337e24d
  11. Nov 03, 2012
    • Eric Dumazet's avatar
      tcp: better retrans tracking for defer-accept · e6c022a4
      Eric Dumazet authored
      
      For passive TCP connections using TCP_DEFER_ACCEPT facility,
      we incorrectly increment req->retrans each time timeout triggers
      while no SYNACK is sent.
      
      SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
      which we received the ACK from client). Only the last SYNACK is sent
      so that we can receive again an ACK from client, to move the req into
      accept queue. We plan to change this later to avoid the useless
      retransmit (and potential problem as this SYNACK could be lost)
      
      TCP_INFO later gives wrong information to user, claiming imaginary
      retransmits.
      
      Decouple req->retrans field into two independent fields :
      
      num_retrans : number of retransmit
      num_timeout : number of timeouts
      
      num_timeout is the counter that is incremented at each timeout,
      regardless of actual SYNACK being sent or not, and used to
      compute the exponential timeout.
      
      Introduce inet_rtx_syn_ack() helper to increment num_retrans
      only if ->rtx_syn_ack() succeeded.
      
      Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
      when we re-send a SYNACK in answer to a (retransmitted) SYN.
      Prior to this patch, we were not counting these retransmits.
      
      Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
      only if a synack packet was successfully queued.
      
      Reported-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Elliott Hughes <enh@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e6c022a4
  12. Oct 08, 2012
    • Julian Anastasov's avatar
      ipv4: introduce rt_uses_gateway · 155e8336
      Julian Anastasov authored
      
      Add new flag to remember when route is via gateway.
      We will use it to allow rt_gateway to contain address of
      directly connected host for the cases when DST_NOCACHE is
      used or when the NH exception caches per-destination route
      without DST_NOCACHE flag, i.e. when routes are not used for
      other destinations. By this way we force the neighbour
      resolving to work with the routed destination but we
      can use different address in the packet, feature needed
      for IPVS-DR where original packet for virtual IP is routed
      via route to real IP.
      
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      155e8336
  13. Sep 06, 2012
    • Eric Dumazet's avatar
      tcp: fix TFO regression · 7ab4551f
      Eric Dumazet authored
      
      Fengguang Wu reported various panics and bisected to commit
      8336886f (tcp: TCP Fast Open Server - support TFO listeners)
      
      Fix this by making sure socket is a TCP socket before accessing TFO data
      structures.
      
      [  233.046014] kfree_debugcheck: out of range ptr ea6000000bb8h.
      [  233.047399] ------------[ cut here ]------------
      [  233.048393] kernel BUG at /c/kernel-tests/src/stable/mm/slab.c:3074!
      [  233.048393] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
      [  233.048393] Modules linked in:
      [  233.048393] CPU 0
      [  233.048393] Pid: 3929, comm: trinity-watchdo Not tainted 3.6.0-rc3+
      #4192 Bochs Bochs
      [  233.048393] RIP: 0010:[<ffffffff81169653>]  [<ffffffff81169653>]
      kfree_debugcheck+0x27/0x2d
      [  233.048393] RSP: 0018:ffff88000facbca8  EFLAGS: 00010092
      [  233.048393] RAX: 0000000000000031 RBX: 0000ea6000000bb8 RCX:
      00000000a189a188
      [  233.048393] RDX: 000000000000a189 RSI: ffffffff8108ad32 RDI:
      ffffffff810d30f9
      [  233.048393] RBP: ffff88000facbcb8 R08: 0000000000000002 R09:
      ffffffff843846f0
      [  233.048393] R10: ffffffff810ae37c R11: 0000000000000908 R12:
      0000000000000202
      [  233.048393] R13: ffffffff823dbd5a R14: ffff88000ec5bea8 R15:
      ffffffff8363c780
      [  233.048393] FS:  00007faa6899c700(0000) GS:ffff88001f200000(0000)
      knlGS:0000000000000000
      [  233.048393] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  233.048393] CR2: 00007faa6841019c CR3: 0000000012c82000 CR4:
      00000000000006f0
      [  233.048393] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000
      [  233.048393] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
      0000000000000400
      [  233.048393] Process trinity-watchdo (pid: 3929, threadinfo
      ffff88000faca000, task ffff88000faec600)
      [  233.048393] Stack:
      [  233.048393]  0000000000000000 0000ea6000000bb8 ffff88000facbce8
      ffffffff8116ad81
      [  233.048393]  ffff88000ff588a0 ffff88000ff58850 ffff88000ff588a0
      0000000000000000
      [  233.048393]  ffff88000facbd08 ffffffff823dbd5a ffffffff823dbcb0
      ffff88000ff58850
      [  233.048393] Call Trace:
      [  233.048393]  [<ffffffff8116ad81>] kfree+0x5f/0xca
      [  233.048393]  [<ffffffff823dbd5a>] inet_sock_destruct+0xaa/0x13c
      [  233.048393]  [<ffffffff823dbcb0>] ? inet_sk_rebuild_header
      +0x319/0x319
      [  233.048393]  [<ffffffff8231c307>] __sk_free+0x21/0x14b
      [  233.048393]  [<ffffffff8231c4bd>] sk_free+0x26/0x2a
      [  233.048393]  [<ffffffff825372db>] sctp_close+0x215/0x224
      [  233.048393]  [<ffffffff810d6835>] ? lock_release+0x16f/0x1b9
      [  233.048393]  [<ffffffff823daf12>] inet_release+0x7e/0x85
      [  233.048393]  [<ffffffff82317d15>] sock_release+0x1f/0x77
      [  233.048393]  [<ffffffff82317d94>] sock_close+0x27/0x2b
      [  233.048393]  [<ffffffff81173bbe>] __fput+0x101/0x20a
      [  233.048393]  [<ffffffff81173cd5>] ____fput+0xe/0x10
      [  233.048393]  [<ffffffff810a3794>] task_work_run+0x5d/0x75
      [  233.048393]  [<ffffffff8108da70>] do_exit+0x290/0x7f5
      [  233.048393]  [<ffffffff82707415>] ? retint_swapgs+0x13/0x1b
      [  233.048393]  [<ffffffff8108e23f>] do_group_exit+0x7b/0xba
      [  233.048393]  [<ffffffff8108e295>] sys_exit_group+0x17/0x17
      [  233.048393]  [<ffffffff8270de10>] tracesys+0xdd/0xe2
      [  233.048393] Code: 59 01 5d c3 55 48 89 e5 53 41 50 0f 1f 44 00 00 48
      89 fb e8 d4 b0 f0 ff 84 c0 75 11 48 89 de 48 c7 c7 fc fa f7 82 e8 0d 0f
      57 01 <0f> 0b 5f 5b 5d c3 55 48 89 e5 0f 1f 44 00 00 48 63 87 d8 00 00
      [  233.048393] RIP  [<ffffffff81169653>] kfree_debugcheck+0x27/0x2d
      [  233.048393]  RSP <ffff88000facbca8>
      
      Reported-by: default avatarFengguang Wu <wfg@linux.intel.com>
      Tested-by: default avatarFengguang Wu <wfg@linux.intel.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: "H.K. Jerry Chu" <hkchu@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarH.K. Jerry Chu <hkchu@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7ab4551f
  14. Sep 01, 2012
    • Jerry Chu's avatar
      tcp: TCP Fast Open Server - support TFO listeners · 8336886f
      Jerry Chu authored
      
      This patch builds on top of the previous patch to add the support
      for TFO listeners. This includes -
      
      1. allocating, properly initializing, and managing the per listener
      fastopen_queue structure when TFO is enabled
      
      2. changes to the inet_csk_accept code to support TFO. E.g., the
      request_sock can no longer be freed upon accept(), not until 3WHS
      finishes
      
      3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
      if it's a TFO socket
      
      4. properly closing a TFO listener, and a TFO socket before 3WHS
      finishes
      
      5. supporting TCP_FASTOPEN socket option
      
      6. modifying tcp_check_req() to use to check a TFO socket as well
      as request_sock
      
      7. supporting TCP's TFO cookie option
      
      8. adding a new SYN-ACK retransmit handler to use the timer directly
      off the TFO socket rather than the listener socket. Note that TFO
      server side will not retransmit anything other than SYN-ACK until
      the 3WHS is completed.
      
      The patch also contains an important function
      "reqsk_fastopen_remove()" to manage the somewhat complex relation
      between a listener, its request_sock, and the corresponding child
      socket. See the comment above the function for the detail.
      
      Signed-off-by: default avatarH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8336886f
  15. Aug 21, 2012
  16. Jul 20, 2012
  17. Jul 17, 2012
  18. Jul 16, 2012
    • David S. Miller's avatar
      ipv4: Add helper inet_csk_update_pmtu(). · 80d0a69f
      David S. Miller authored
      
      This abstracts away the call to dst_ops->update_pmtu() so that we can
      transparently handle the fact that, in the future, the dst itself can
      be invalidated by the PMTU update (when we have non-host routes cached
      in sockets).
      
      So we try to rebuild the socket cached route after the method
      invocation if necessary.
      
      This isn't used by SCTP because it needs to cache dsts per-transport,
      and thus will need it's own local version of this helper.
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80d0a69f
  19. Jul 11, 2012
  20. Jun 23, 2012
  21. Jun 01, 2012
    • Eric Dumazet's avatar
      tcp: do not create inetpeer on SYNACK message · 7433819a
      Eric Dumazet authored
      
      Another problem on SYNFLOOD/DDOS attack is the inetpeer cache getting
      larger and larger, using lots of memory and cpu time.
      
      tcp_v4_send_synack()
      ->inet_csk_route_req()
       ->ip_route_output_flow()
        ->rt_set_nexthop()
         ->rt_init_metrics()
          ->inet_getpeer( create = true)
      
      This is a side effect of commit a4daad6b (net: Pre-COW metrics for
      TCP) added in 2.6.39
      
      Possible solution :
      
      Instruct inet_csk_route_req() to remove FLOWI_FLAG_PRECOW_METRICS
      
      Before patch :
      
      # grep peer /proc/slabinfo
      inet_peer_cache   4175430 4175430    192   42    2 : tunables    0    0    0 : slabdata  99415  99415      0
      
      Samples: 41K of event 'cycles', Event count (approx.): 30716565122
      +  20,24%      ksoftirqd/0  [kernel.kallsyms]           [k] inet_getpeer
      +   8,19%      ksoftirqd/0  [kernel.kallsyms]           [k] peer_avl_rebalance.isra.1
      +   4,81%      ksoftirqd/0  [kernel.kallsyms]           [k] sha_transform
      +   3,64%      ksoftirqd/0  [kernel.kallsyms]           [k] fib_table_lookup
      +   2,36%      ksoftirqd/0  [ixgbe]                     [k] ixgbe_poll
      +   2,16%      ksoftirqd/0  [kernel.kallsyms]           [k] __ip_route_output_key
      +   2,11%      ksoftirqd/0  [kernel.kallsyms]           [k] kernel_map_pages
      +   2,11%      ksoftirqd/0  [kernel.kallsyms]           [k] ip_route_input_common
      +   2,01%      ksoftirqd/0  [kernel.kallsyms]           [k] __inet_lookup_established
      +   1,83%      ksoftirqd/0  [kernel.kallsyms]           [k] md5_transform
      +   1,75%      ksoftirqd/0  [kernel.kallsyms]           [k] check_leaf.isra.9
      +   1,49%      ksoftirqd/0  [kernel.kallsyms]           [k] ipt_do_table
      +   1,46%      ksoftirqd/0  [kernel.kallsyms]           [k] hrtimer_interrupt
      +   1,45%      ksoftirqd/0  [kernel.kallsyms]           [k] kmem_cache_alloc
      +   1,29%      ksoftirqd/0  [kernel.kallsyms]           [k] inet_csk_search_req
      +   1,29%      ksoftirqd/0  [kernel.kallsyms]           [k] __netif_receive_skb
      +   1,16%      ksoftirqd/0  [kernel.kallsyms]           [k] copy_user_generic_string
      +   1,15%      ksoftirqd/0  [kernel.kallsyms]           [k] kmem_cache_free
      +   1,02%      ksoftirqd/0  [kernel.kallsyms]           [k] tcp_make_synack
      +   0,93%      ksoftirqd/0  [kernel.kallsyms]           [k] _raw_spin_lock_bh
      +   0,87%      ksoftirqd/0  [kernel.kallsyms]           [k] __call_rcu
      +   0,84%      ksoftirqd/0  [kernel.kallsyms]           [k] rt_garbage_collect
      +   0,84%      ksoftirqd/0  [kernel.kallsyms]           [k] fib_rules_lookup
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7433819a
  22. Apr 21, 2012
  23. Apr 15, 2012
  24. Apr 14, 2012
  25. Jan 26, 2012
  26. Dec 12, 2011
  27. Nov 08, 2011
  28. May 24, 2011
  29. May 19, 2011
  30. May 08, 2011
  31. Apr 29, 2011
  32. Apr 28, 2011
    • Eric Dumazet's avatar
      inet: add RCU protection to inet->opt · f6d8bd05
      Eric Dumazet authored
      
      We lack proper synchronization to manipulate inet->opt ip_options
      
      Problem is ip_make_skb() calls ip_setup_cork() and
      ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
      without any protection against another thread manipulating inet->opt.
      
      Another thread can change inet->opt pointer and free old one under us.
      
      Use RCU to protect inet->opt (changed to inet->inet_opt).
      
      Instead of handling atomic refcounts, just copy ip_options when
      necessary, to avoid cache line dirtying.
      
      We cant insert an rcu_head in struct ip_options since its included in
      skb->cb[], so this patch is large because I had to introduce a new
      ip_options_rcu structure.
      
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f6d8bd05
  33. Apr 13, 2011
  34. Mar 31, 2011
Loading