Skip to content
Snippets Groups Projects
tcp_input.c 164 KiB
Newer Older
  • Learn to ignore specific revisions
  • Linus Torvalds's avatar
    Linus Torvalds committed
     * "Disorder"   In all the respects it is "Open",
     *		but requires a bit more attention. It is entered when
     *		we see some SACKs or dupacks. It is split of "Open"
     *		mainly to move some processing from fast path to slow one.
     * "CWR"	CWND was reduced due to some Congestion Notification event.
     *		It can be ECN, ICMP source quench, local device congestion.
     * "Recovery"	CWND was reduced, we are fast-retransmitting.
     * "Loss"	CWND was reduced due to RTO timeout or SACK reneging.
     *
     * tcp_fastretrans_alert() is entered:
     * - each incoming ACK, if state is not "Open"
     * - when arrived ACK is unusual, namely:
     *	* SACK
     *	* Duplicate ACK.
     *	* ECN ECE.
     *
     * Counting packets in flight is pretty simple.
     *
     *	in_flight = packets_out - left_out + retrans_out
     *
     *	packets_out is SND.NXT-SND.UNA counted in packets.
     *
     *	retrans_out is number of retransmitted segments.
     *
     *	left_out is number of segments left network, but not ACKed yet.
     *
     *		left_out = sacked_out + lost_out
     *
     *     sacked_out: Packets, which arrived to receiver out of order
     *		   and hence not ACKed. With SACKs this number is simply
     *		   amount of SACKed data. Even without SACKs
     *		   it is easy to give pretty reliable estimate of this number,
     *		   counting duplicate ACKs.
     *
     *       lost_out: Packets lost by network. TCP has no explicit
     *		   "loss notification" feedback from network (for now).
     *		   It means that this number can be only _guessed_.
     *		   Actually, it is the heuristics to predict lossage that
     *		   distinguishes different algorithms.
     *
     *	F.e. after RTO, when all the queue is considered as lost,
     *	lost_out = packets_out and in_flight = retrans_out.
     *
     *		Essentially, we have now two algorithms counting
     *		lost packets.
     *
     *		FACK: It is the simplest heuristics. As soon as we decided
     *		that something is lost, we decide that _all_ not SACKed
     *		packets until the most forward SACK are lost. I.e.
     *		lost_out = fackets_out - sacked_out and left_out = fackets_out.
     *		It is absolutely correct estimate, if network does not reorder
     *		packets. And it loses any connection to reality when reordering
     *		takes place. We use FACK by default until reordering
     *		is suspected on the path to this destination.
     *
     *		NewReno: when Recovery is entered, we assume that one segment
     *		is lost (classic Reno). While we are in Recovery and
     *		a partial ACK arrives, we assume that one more packet
     *		is lost (NewReno). This heuristics are the same in NewReno
     *		and SACK.
     *
     *  Imagine, that's all! Forget about all this shamanism about CWND inflation
     *  deflation etc. CWND is real congestion window, never inflated, changes
     *  only according to classic VJ rules.
     *
     * Really tricky (and requiring careful tuning) part of algorithm
     * is hidden in functions tcp_time_to_recover() and tcp_xmit_retransmit_queue().
     * The first determines the moment _when_ we should reduce CWND and,
     * hence, slow down forward transmission. In fact, it determines the moment
     * when we decide that hole is caused by loss, rather than by a reorder.
     *
     * tcp_xmit_retransmit_queue() decides, _what_ we should retransmit to fill
     * holes, caused by lost packets.
     *
     * And the most logically complicated part of algorithm is undo
     * heuristics. We detect false retransmits due to both too early
     * fast retransmit (reordering) and underestimated RTO, analyzing
     * timestamps and D-SACKs. When we detect that some segments were
     * retransmitted by mistake and CWND reduction was wrong, we undo
     * window reduction and abort recovery phase. This logic is hidden
     * inside several functions named tcp_try_undo_<something>.
     */
    
    /* This function decides, when we should leave Disordered state
     * and enter Recovery phase, reducing congestion window.
     *
     * Main question: may we further continue forward transmission
     * with the same cwnd?
     */
    
    Eric Dumazet's avatar
    Eric Dumazet committed
    static bool tcp_time_to_recover(struct sock *sk, int flag)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    {
    
    	struct tcp_sock *tp = tcp_sk(sk);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	__u32 packets_out;
    
    	/* Trick#1: The loss is proven. */
    	if (tp->lost_out)
    
    Eric Dumazet's avatar
    Eric Dumazet committed
    		return true;
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    
    	/* Not-A-Trick#2 : Classic rule... */
    
    	if (tcp_dupack_heuristics(tp) > tp->reordering)
    
    Eric Dumazet's avatar
    Eric Dumazet committed
    		return true;
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    
    	/* Trick#4: It is still not OK... But will it be useful to delay
    	 * recovery more?
    	 */
    	packets_out = tp->packets_out;
    	if (packets_out <= tp->reordering &&
    	    tp->sacked_out >= max_t(__u32, packets_out/2, sysctl_tcp_reordering) &&
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		/* We have nothing to send. This connection is limited
    		 * either by receiver window or by application.
    		 */
    
    Eric Dumazet's avatar
    Eric Dumazet committed
    		return true;
    
    Andreas Petlund's avatar
    Andreas Petlund committed
    	/* If a thin stream is detected, retransmit after first
    	 * received dupack. Employ only if SACK is supported in order
    	 * to avoid possible corner-case series of spurious retransmissions
    	 * Use only if there are no unsent data.
    	 */
    	if ((tp->thin_dupack || sysctl_tcp_thin_dupack) &&
    	    tcp_stream_is_thin(tp) && tcp_dupack_heuristics(tp) > 1 &&
    	    tcp_is_sack(tp) && !tcp_send_head(sk))
    
    Eric Dumazet's avatar
    Eric Dumazet committed
    		return true;
    
    Yuchung Cheng's avatar
    Yuchung Cheng committed
    	/* Trick#6: TCP early retransmit, per RFC5827.  To avoid spurious
    	 * retransmissions due to small network reorderings, we implement
    	 * Mitigation A.3 in the RFC and delay the retransmission for a short
    	 * interval if appropriate.
    	 */
    	if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
    
    	    (tp->packets_out >= (tp->sacked_out + 1) && tp->packets_out < 4) &&
    
    Yuchung Cheng's avatar
    Yuchung Cheng committed
    	    !tcp_may_send_now(sk))
    
    		return !tcp_pause_early_retransmit(sk, flag);
    
    Eric Dumazet's avatar
    Eric Dumazet committed
    	return false;
    
    /* Detect loss in event "A" above by marking head of queue up as lost.
     * For FACK or non-SACK(Reno) senders, the first "packets" number of segments
     * are considered lost. For RFC3517 SACK, a segment is considered lost if it
     * has at least tp->reordering SACKed seqments above it; "packets" refers to
     * the maximum SACKed segments to pass before reaching this limit.
    
    static void tcp_mark_head_lost(struct sock *sk, int packets, int mark_head)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    {
    
    	struct tcp_sock *tp = tcp_sk(sk);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	struct sk_buff *skb;
    
    	int cnt, oldcnt;
    	int err;
    	unsigned int mss;
    
    	/* Use SACK to deduce losses of new sequences sent during recovery */
    	const u32 loss_high = tcp_is_sack(tp) ?  tp->snd_nxt : tp->high_seq;
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    
    
    	WARN_ON(packets > tp->packets_out);
    
    	if (tp->lost_skb_hint) {
    		skb = tp->lost_skb_hint;
    		cnt = tp->lost_cnt_hint;
    
    		/* Head already handled? */
    		if (mark_head && skb != tcp_write_queue_head(sk))
    			return;
    
    		skb = tcp_write_queue_head(sk);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    
    
    	tcp_for_write_queue_from(skb, sk) {
    		if (skb == tcp_send_head(sk))
    			break;
    
    		/* TODO: do this better */
    		/* this is not the most efficient way to do this... */
    		tp->lost_skb_hint = skb;
    		tp->lost_cnt_hint = cnt;
    
    		if (after(TCP_SKB_CB(skb)->end_seq, loss_high))
    
    		if (tcp_is_fack(tp) || tcp_is_reno(tp) ||
    
    		    (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
    			cnt += tcp_skb_pcount(skb);
    
    
    			if ((tcp_is_sack(tp) && !tcp_is_fack(tp)) ||
    
    			    (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) ||
    
    				break;
    
    			mss = skb_shinfo(skb)->gso_size;
    			err = tcp_fragment(sk, skb, (packets - oldcnt) * mss, mss);
    			if (err < 0)
    				break;
    			cnt = packets;
    		}
    
    
    		tcp_skb_mark_lost(tp, skb);
    
    
    		if (mark_head)
    			break;
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	}
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    }
    
    /* Account newly detected lost packet(s) */
    
    
    static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    {
    
    	struct tcp_sock *tp = tcp_sk(sk);
    
    
    		tcp_mark_head_lost(sk, 1, 1);
    
    	} else if (tcp_is_fack(tp)) {
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		int lost = tp->fackets_out - tp->reordering;
    		if (lost <= 0)
    			lost = 1;
    
    		tcp_mark_head_lost(sk, lost, 0);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	} else {
    
    		int sacked_upto = tp->sacked_out - tp->reordering;
    
    		if (sacked_upto >= 0)
    			tcp_mark_head_lost(sk, sacked_upto, 0);
    		else if (fast_rexmit)
    			tcp_mark_head_lost(sk, 1, 1);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	}
    }
    
    /* CWND moderation, preventing bursts due to too big ACKs
     * in dubious situations.
     */
    static inline void tcp_moderate_cwnd(struct tcp_sock *tp)
    {
    	tp->snd_cwnd = min(tp->snd_cwnd,
    
    			   tcp_packets_in_flight(tp) + tcp_max_burst(tp));
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	tp->snd_cwnd_stamp = tcp_time_stamp;
    }
    
    /* Nothing was retransmitted or returned timestamp is less
     * than timestamp of the first retransmission.
     */
    
    static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    {
    	return !tp->retrans_stamp ||
    		(tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
    
    		 before(tp->rx_opt.rcv_tsecr, tp->retrans_stamp));
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    }
    
    /* Undo procedures. */
    
    #if FASTRETRANS_DEBUG > 1
    
    static void DBGUNDO(struct sock *sk, const char *msg)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    {
    
    	struct tcp_sock *tp = tcp_sk(sk);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	struct inet_sock *inet = inet_sk(sk);
    
    	if (sk->sk_family == AF_INET) {
    
    		pr_debug("Undo %s %pI4/%u c%u l%u ss%u/%u p%u\n",
    			 msg,
    			 &inet->inet_daddr, ntohs(inet->inet_dport),
    			 tp->snd_cwnd, tcp_left_out(tp),
    			 tp->snd_ssthresh, tp->prior_ssthresh,
    			 tp->packets_out);
    
    #if IS_ENABLED(CONFIG_IPV6)
    
    	else if (sk->sk_family == AF_INET6) {
    		struct ipv6_pinfo *np = inet6_sk(sk);
    
    		pr_debug("Undo %s %pI6/%u c%u l%u ss%u/%u p%u\n",
    			 msg,
    			 &np->daddr, ntohs(inet->inet_dport),
    			 tp->snd_cwnd, tcp_left_out(tp),
    			 tp->snd_ssthresh, tp->prior_ssthresh,
    			 tp->packets_out);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    }
    #else
    #define DBGUNDO(x...) do { } while (0)
    #endif
    
    
    static void tcp_undo_cwnd_reduction(struct sock *sk, bool unmark_loss)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    {
    
    	if (unmark_loss) {
    		struct sk_buff *skb;
    
    		tcp_for_write_queue(skb, sk) {
    			if (skb == tcp_send_head(sk))
    				break;
    			TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST;
    		}
    		tp->lost_out = 0;
    		tcp_clear_all_retrans_hints(tp);
    	}
    
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	if (tp->prior_ssthresh) {
    
    		const struct inet_connection_sock *icsk = inet_csk(sk);
    
    		if (icsk->icsk_ca_ops->undo_cwnd)
    			tp->snd_cwnd = icsk->icsk_ca_ops->undo_cwnd(sk);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		else
    
    			tp->snd_cwnd = max(tp->snd_cwnd, tp->snd_ssthresh << 1);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    
    
    		if (tp->prior_ssthresh > tp->snd_ssthresh) {
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    			tp->snd_ssthresh = tp->prior_ssthresh;
    			TCP_ECN_withdraw_cwr(tp);
    		}
    	} else {
    		tp->snd_cwnd = max(tp->snd_cwnd, tp->snd_ssthresh);
    	}
    	tp->snd_cwnd_stamp = tcp_time_stamp;
    
    	tp->undo_marker = 0;
    
    static inline bool tcp_may_undo(const struct tcp_sock *tp)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    {
    
    	return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(tp));
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    }
    
    /* People celebrate: "We love our President!" */
    
    Eric Dumazet's avatar
    Eric Dumazet committed
    static bool tcp_try_undo_recovery(struct sock *sk)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    {
    
    	struct tcp_sock *tp = tcp_sk(sk);
    
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	if (tcp_may_undo(tp)) {
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		/* Happy end! We did not retransmit anything
    		 * or our original transmission succeeded.
    		 */
    
    		DBGUNDO(sk, inet_csk(sk)->icsk_ca_state == TCP_CA_Loss ? "loss" : "retrans");
    
    		tcp_undo_cwnd_reduction(sk, false);
    
    		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss)
    
    			mib_idx = LINUX_MIB_TCPLOSSUNDO;
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		else
    
    			mib_idx = LINUX_MIB_TCPFULLUNDO;
    
    
    		NET_INC_STATS_BH(sock_net(sk), mib_idx);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	}
    
    	if (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) {
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		/* Hold old state until something *above* high_seq
    		 * is ACKed. For Reno it is MUST to prevent false
    		 * fast retransmits (RFC2582). SACK TCP is safe. */
    		tcp_moderate_cwnd(tp);
    
    Eric Dumazet's avatar
    Eric Dumazet committed
    		return true;
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	}
    
    	tcp_set_ca_state(sk, TCP_CA_Open);
    
    Eric Dumazet's avatar
    Eric Dumazet committed
    	return false;
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    }
    
    /* Try to undo cwnd reduction, because D-SACKs acked all retransmitted data */
    
    static bool tcp_try_undo_dsack(struct sock *sk)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    {
    
    	struct tcp_sock *tp = tcp_sk(sk);
    
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	if (tp->undo_marker && !tp->undo_retrans) {
    
    		tcp_undo_cwnd_reduction(sk, false);
    
    		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPDSACKUNDO);
    
    		return true;
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	}
    
    	return false;
    
    /* We can clear retrans_stamp when there are no retransmissions in the
     * window. It would seem that it is trivially available for us in
     * tp->retrans_out, however, that kind of assumptions doesn't consider
     * what will happen if errors occur when sending retransmission for the
     * second time. ...It could the that such segment has only
     * TCPCB_EVER_RETRANS set at the present time. It seems that checking
     * the head skb is enough except for some reneging corner cases that
     * are not worth the effort.
     *
     * Main reason for all this complexity is the fact that connection dying
     * time now depends on the validity of the retrans_stamp, in particular,
     * that successive retransmissions of a segment must not advance
     * retrans_stamp under any conditions.
     */
    
    Eric Dumazet's avatar
    Eric Dumazet committed
    static bool tcp_any_retrans_done(const struct sock *sk)
    
    	const struct tcp_sock *tp = tcp_sk(sk);
    
    	struct sk_buff *skb;
    
    	if (tp->retrans_out)
    
    Eric Dumazet's avatar
    Eric Dumazet committed
    		return true;
    
    
    	skb = tcp_write_queue_head(sk);
    	if (unlikely(skb && TCP_SKB_CB(skb)->sacked & TCPCB_EVER_RETRANS))
    
    Eric Dumazet's avatar
    Eric Dumazet committed
    		return true;
    
    Eric Dumazet's avatar
    Eric Dumazet committed
    	return false;
    
    /* Undo during loss recovery after partial ACK or using F-RTO. */
    static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    {
    
    	struct tcp_sock *tp = tcp_sk(sk);
    
    
    	if (frto_undo || tcp_may_undo(tp)) {
    
    		tcp_undo_cwnd_reduction(sk, true);
    
    		DBGUNDO(sk, "partial loss");
    
    		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSSUNDO);
    
    		if (frto_undo)
    			NET_INC_STATS_BH(sock_net(sk),
    					 LINUX_MIB_TCPSPURIOUSRTOS);
    
    		inet_csk(sk)->icsk_retransmits = 0;
    
    		if (frto_undo || tcp_is_sack(tp))
    
    			tcp_set_ca_state(sk, TCP_CA_Open);
    
    Eric Dumazet's avatar
    Eric Dumazet committed
    		return true;
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	}
    
    Eric Dumazet's avatar
    Eric Dumazet committed
    	return false;
    
    /* The cwnd reduction in CWR and Recovery use the PRR algorithm
     * https://datatracker.ietf.org/doc/draft-ietf-tcpm-proportional-rate-reduction/
    
     * It computes the number of packets to send (sndcnt) based on packets newly
     * delivered:
     *   1) If the packets in flight is larger than ssthresh, PRR spreads the
     *	cwnd reductions across a full RTT.
     *   2) If packets in flight is lower than ssthresh (such as due to excess
     *	losses and/or application stalls), do not perform any further cwnd
     *	reductions, but instead slow start up to ssthresh.
     */
    
    static void tcp_init_cwnd_reduction(struct sock *sk, const bool set_ssthresh)
    {
    	struct tcp_sock *tp = tcp_sk(sk);
    
    	tp->high_seq = tp->snd_nxt;
    
    	tp->tlp_high_seq = 0;
    
    	tp->snd_cwnd_cnt = 0;
    	tp->prior_cwnd = tp->snd_cwnd;
    	tp->prr_delivered = 0;
    	tp->prr_out = 0;
    	if (set_ssthresh)
    		tp->snd_ssthresh = inet_csk(sk)->icsk_ca_ops->ssthresh(sk);
    	TCP_ECN_queue_cwr(tp);
    }
    
    
    static void tcp_cwnd_reduction(struct sock *sk, const int prior_unsacked,
    
    			       int fast_rexmit)
    
    {
    	struct tcp_sock *tp = tcp_sk(sk);
    	int sndcnt = 0;
    	int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
    
    	int newly_acked_sacked = prior_unsacked -
    				 (tp->packets_out - tp->sacked_out);
    
    	tp->prr_delivered += newly_acked_sacked;
    
    	if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) {
    		u64 dividend = (u64)tp->snd_ssthresh * tp->prr_delivered +
    			       tp->prior_cwnd - 1;
    		sndcnt = div_u64(dividend, tp->prior_cwnd) - tp->prr_out;
    	} else {
    		sndcnt = min_t(int, delta,
    			       max_t(int, tp->prr_delivered - tp->prr_out,
    				     newly_acked_sacked) + 1);
    	}
    
    	sndcnt = max(sndcnt, (fast_rexmit ? 1 : 0));
    	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
    }
    
    
    static inline void tcp_end_cwnd_reduction(struct sock *sk)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    {
    
    	struct tcp_sock *tp = tcp_sk(sk);
    
    	/* Reset cwnd to ssthresh in CWR or Recovery (unless it's undone) */
    	if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR ||
    	    (tp->undo_marker && tp->snd_ssthresh < TCP_INFINITE_SSTHRESH)) {
    		tp->snd_cwnd = tp->snd_ssthresh;
    		tp->snd_cwnd_stamp = tcp_time_stamp;
    
    	tcp_ca_event(sk, CA_EVENT_COMPLETE_CWR);
    
    /* Enter CWR state. Disable cwnd undo since congestion is proven with ECN */
    
    Yuchung Cheng's avatar
    Yuchung Cheng committed
    void tcp_enter_cwr(struct sock *sk, const int set_ssthresh)
    {
    	struct tcp_sock *tp = tcp_sk(sk);
    
    	tp->prior_ssthresh = 0;
    
    	if (inet_csk(sk)->icsk_ca_state < TCP_CA_CWR) {
    
    Yuchung Cheng's avatar
    Yuchung Cheng committed
    		tp->undo_marker = 0;
    
    		tcp_init_cwnd_reduction(sk, set_ssthresh);
    
    Yuchung Cheng's avatar
    Yuchung Cheng committed
    		tcp_set_ca_state(sk, TCP_CA_CWR);
    	}
    }
    
    
    static void tcp_try_keep_open(struct sock *sk)
    {
    	struct tcp_sock *tp = tcp_sk(sk);
    	int state = TCP_CA_Open;
    
    
    	if (tcp_left_out(tp) || tcp_any_retrans_done(sk))
    
    		state = TCP_CA_Disorder;
    
    	if (inet_csk(sk)->icsk_ca_state != state) {
    		tcp_set_ca_state(sk, state);
    		tp->high_seq = tp->snd_nxt;
    	}
    }
    
    
    static void tcp_try_to_open(struct sock *sk, int flag, const int prior_unsacked)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    {
    
    	struct tcp_sock *tp = tcp_sk(sk);
    
    
    Yuchung Cheng's avatar
    Yuchung Cheng committed
    	if (!tcp_any_retrans_done(sk))
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		tp->retrans_stamp = 0;
    
    
    	if (flag & FLAG_ECE)
    
    		tcp_enter_cwr(sk, 1);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    
    
    	if (inet_csk(sk)->icsk_ca_state != TCP_CA_CWR) {
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	} else {
    
    		tcp_cwnd_reduction(sk, prior_unsacked, 0);
    
    John Heffner's avatar
    John Heffner committed
    static void tcp_mtup_probe_failed(struct sock *sk)
    {
    	struct inet_connection_sock *icsk = inet_csk(sk);
    
    	icsk->icsk_mtup.search_high = icsk->icsk_mtup.probe_size - 1;
    	icsk->icsk_mtup.probe_size = 0;
    }
    
    
    static void tcp_mtup_probe_success(struct sock *sk)
    
    John Heffner's avatar
    John Heffner committed
    {
    	struct tcp_sock *tp = tcp_sk(sk);
    	struct inet_connection_sock *icsk = inet_csk(sk);
    
    	/* FIXME: breaks with very large cwnd */
    	tp->prior_ssthresh = tcp_current_ssthresh(sk);
    	tp->snd_cwnd = tp->snd_cwnd *
    		       tcp_mss_to_mtu(sk, tp->mss_cache) /
    		       icsk->icsk_mtup.probe_size;
    	tp->snd_cwnd_cnt = 0;
    	tp->snd_cwnd_stamp = tcp_time_stamp;
    
    	tp->snd_ssthresh = tcp_current_ssthresh(sk);
    
    John Heffner's avatar
    John Heffner committed
    
    	icsk->icsk_mtup.search_low = icsk->icsk_mtup.probe_size;
    	icsk->icsk_mtup.probe_size = 0;
    	tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
    }
    
    
    /* Do a simple retransmit without using the backoff mechanisms in
     * tcp_timer. This is used for path mtu discovery.
     * The socket is already locked here.
     */
    void tcp_simple_retransmit(struct sock *sk)
    {
    	const struct inet_connection_sock *icsk = inet_csk(sk);
    	struct tcp_sock *tp = tcp_sk(sk);
    	struct sk_buff *skb;
    
    	unsigned int mss = tcp_current_mss(sk);
    
    	u32 prior_lost = tp->lost_out;
    
    	tcp_for_write_queue(skb, sk) {
    		if (skb == tcp_send_head(sk))
    			break;
    
    		if (tcp_skb_seglen(skb) > mss &&
    
    		    !(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) {
    			if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS) {
    				TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
    				tp->retrans_out -= tcp_skb_pcount(skb);
    			}
    			tcp_skb_mark_lost_uncond_verify(tp, skb);
    		}
    	}
    
    	tcp_clear_retrans_hints_partial(tp);
    
    	if (prior_lost == tp->lost_out)
    		return;
    
    	if (tcp_is_reno(tp))
    		tcp_limit_reno_sacked(tp);
    
    	tcp_verify_left_out(tp);
    
    	/* Don't muck with the congestion window here.
    	 * Reason is that we do not increase amount of _data_
    	 * in network, but units changed and effective
    	 * cwnd/ssthresh really reduced now.
    	 */
    	if (icsk->icsk_ca_state != TCP_CA_Loss) {
    		tp->high_seq = tp->snd_nxt;
    		tp->snd_ssthresh = tcp_current_ssthresh(sk);
    		tp->prior_ssthresh = 0;
    		tp->undo_marker = 0;
    		tcp_set_ca_state(sk, TCP_CA_Loss);
    	}
    	tcp_xmit_retransmit_queue(sk);
    }
    
    EXPORT_SYMBOL(tcp_simple_retransmit);
    
    static void tcp_enter_recovery(struct sock *sk, bool ece_ack)
    {
    	struct tcp_sock *tp = tcp_sk(sk);
    	int mib_idx;
    
    	if (tcp_is_reno(tp))
    		mib_idx = LINUX_MIB_TCPRENORECOVERY;
    	else
    		mib_idx = LINUX_MIB_TCPSACKRECOVERY;
    
    	NET_INC_STATS_BH(sock_net(sk), mib_idx);
    
    	tp->prior_ssthresh = 0;
    	tp->undo_marker = tp->snd_una;
    	tp->undo_retrans = tp->retrans_out;
    
    	if (inet_csk(sk)->icsk_ca_state < TCP_CA_CWR) {
    		if (!ece_ack)
    			tp->prior_ssthresh = tcp_current_ssthresh(sk);
    
    		tcp_init_cwnd_reduction(sk, true);
    
    	}
    	tcp_set_ca_state(sk, TCP_CA_Recovery);
    }
    
    
    /* Process an ACK in CA_Loss state. Move to CA_Open if lost data are
     * recovered or spurious. Otherwise retransmits more on partial ACKs.
     */
    
    static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack)
    
    {
    	struct inet_connection_sock *icsk = inet_csk(sk);
    	struct tcp_sock *tp = tcp_sk(sk);
    
    	bool recovered = !before(tp->snd_una, tp->high_seq);
    
    	if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
    		if (flag & FLAG_ORIG_SACK_ACKED) {
    			/* Step 3.b. A timeout is spurious if not all data are
    			 * lost, i.e., never-retransmitted data are (s)acked.
    			 */
    			tcp_try_undo_loss(sk, true);
    			return;
    		}
    		if (after(tp->snd_nxt, tp->high_seq) &&
    		    (flag & FLAG_DATA_SACKED || is_dupack)) {
    			tp->frto = 0; /* Loss was real: 2nd part of step 3.a */
    		} else if (flag & FLAG_SND_UNA_ADVANCED && !recovered) {
    			tp->high_seq = tp->snd_nxt;
    			__tcp_push_pending_frames(sk, tcp_current_mss(sk),
    						  TCP_NAGLE_OFF);
    			if (after(tp->snd_nxt, tp->high_seq))
    				return; /* Step 2.b */
    			tp->frto = 0;
    		}
    	}
    
    	if (recovered) {
    		/* F-RTO RFC5682 sec 3.1 step 2.a and 1st part of step 3.a */
    
    		icsk->icsk_retransmits = 0;
    		tcp_try_undo_recovery(sk);
    		return;
    	}
    	if (flag & FLAG_DATA_ACKED)
    		icsk->icsk_retransmits = 0;
    
    	if (tcp_is_reno(tp)) {
    		/* A Reno DUPACK means new data in F-RTO step 2.b above are
    		 * delivered. Lower inflight to clock out (re)tranmissions.
    		 */
    		if (after(tp->snd_nxt, tp->high_seq) && is_dupack)
    			tcp_add_reno_sack(sk);
    		else if (flag & FLAG_SND_UNA_ADVANCED)
    			tcp_reset_reno_sack(tp);
    	}
    	if (tcp_try_undo_loss(sk, false))
    
    		return;
    	tcp_xmit_retransmit_queue(sk);
    }
    
    
    /* Undo during fast recovery after partial ACK. */
    
    static bool tcp_try_undo_partial(struct sock *sk, const int acked,
    				 const int prior_unsacked)
    
    {
    	struct tcp_sock *tp = tcp_sk(sk);
    
    
    	if (tp->undo_marker && tcp_packet_delayed(tp)) {
    
    		/* Plain luck! Hole if filled with delayed
    		 * packet, rather than with a retransmit.
    		 */
    
    		tcp_update_reordering(sk, tcp_fackets_out(tp) + acked, 1);
    
    		/* We are getting evidence that the reordering degree is higher
    		 * than we realized. If there are no retransmits out then we
    		 * can undo. Otherwise we clock out new packets but do not
    		 * mark more packets lost or retransmit more.
    		 */
    		if (tp->retrans_out) {
    			tcp_cwnd_reduction(sk, prior_unsacked, 0);
    			return true;
    		}
    
    
    		if (!tcp_any_retrans_done(sk))
    			tp->retrans_stamp = 0;
    
    
    		DBGUNDO(sk, "partial recovery");
    		tcp_undo_cwnd_reduction(sk, true);
    
    		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPPARTIALUNDO);
    
    		tcp_try_keep_open(sk);
    		return true;
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    /* Process an event, which can update packets-in-flight not trivially.
     * Main goal of this function is to calculate new estimate for left_out,
     * taking into account both packets sitting in receiver's buffer and
     * packets lost by network.
     *
     * Besides that it does CWND reduction, when packet loss is detected
     * and changes state of machine.
     *
     * It does _not_ decide what to send, it is made in function
     * tcp_xmit_retransmit_queue().
     */
    
    static void tcp_fastretrans_alert(struct sock *sk, const int acked,
    				  const int prior_unsacked,
    
    				  bool is_dupack, int flag)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    {
    
    	struct inet_connection_sock *icsk = inet_csk(sk);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	struct tcp_sock *tp = tcp_sk(sk);
    
    	bool do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) &&
    
    				    (tcp_fackets_out(tp) > tp->reordering));
    
    	int fast_rexmit = 0;
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    
    
    	if (WARN_ON(!tp->packets_out && tp->sacked_out))
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		tp->sacked_out = 0;
    
    	if (WARN_ON(!tp->sacked_out && tp->fackets_out))
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		tp->fackets_out = 0;
    
    
    	/* Now state machine starts.
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	 * A. ECE, hence prohibit cwnd undoing, the reduction is required. */
    
    	if (flag & FLAG_ECE)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		tp->prior_ssthresh = 0;
    
    	/* B. In all the states check for reneging SACKs. */
    
    	if (tcp_check_sack_reneging(sk, flag))
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		return;
    
    
    	/* C. Check consistency of the current state. */
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    
    
    	/* D. Check state exit conditions. State can be terminated
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	 *    when high_seq is ACKed. */
    
    	if (icsk->icsk_ca_state == TCP_CA_Open) {
    
    		WARN_ON(tp->retrans_out != 0);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		tp->retrans_stamp = 0;
    	} else if (!before(tp->snd_una, tp->high_seq)) {
    
    		switch (icsk->icsk_ca_state) {
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		case TCP_CA_CWR:
    			/* CWR is to be held something *above* high_seq
    			 * is ACKed for CWR bit to reach receiver. */
    			if (tp->snd_una != tp->high_seq) {
    
    				tcp_end_cwnd_reduction(sk);
    
    				tcp_set_ca_state(sk, TCP_CA_Open);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    			}
    			break;
    
    		case TCP_CA_Recovery:
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    				tcp_reset_reno_sack(tp);
    
    			if (tcp_try_undo_recovery(sk))
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    				return;
    
    			tcp_end_cwnd_reduction(sk);
    
    	/* E. Process state. */
    
    	switch (icsk->icsk_ca_state) {
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	case TCP_CA_Recovery:
    
    		if (!(flag & FLAG_SND_UNA_ADVANCED)) {
    
    			if (tcp_is_reno(tp) && is_dupack)
    
    		} else {
    			if (tcp_try_undo_partial(sk, acked, prior_unsacked))
    				return;
    			/* Partial ACK arrived. Force fast retransmit. */
    			do_lost = tcp_is_reno(tp) ||
    				  tcp_fackets_out(tp) > tp->reordering;
    		}
    
    		if (tcp_try_undo_dsack(sk)) {
    			tcp_try_keep_open(sk);
    			return;
    		}
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		break;
    	case TCP_CA_Loss:
    
    		tcp_process_loss(sk, flag, is_dupack);
    
    		if (icsk->icsk_ca_state != TCP_CA_Open)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    			return;
    
    		/* Fall through to processing in Open state. */
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	default:
    
    			if (flag & FLAG_SND_UNA_ADVANCED)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    				tcp_reset_reno_sack(tp);
    			if (is_dupack)
    
    		if (icsk->icsk_ca_state <= TCP_CA_Disorder)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    
    
    		if (!tcp_time_to_recover(sk, flag)) {
    
    			tcp_try_to_open(sk, flag, prior_unsacked);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    			return;
    		}
    
    
    John Heffner's avatar
    John Heffner committed
    		/* MTU probe failure: don't reduce cwnd */
    		if (icsk->icsk_ca_state < TCP_CA_CWR &&
    		    icsk->icsk_mtup.probe_size &&
    
    		    tp->snd_una == tp->mtu_probe.probe_seq_start) {
    
    John Heffner's avatar
    John Heffner committed
    			tcp_mtup_probe_failed(sk);
    			/* Restores the reduction we did in tcp_mtup_probe() */
    			tp->snd_cwnd++;
    			tcp_simple_retransmit(sk);
    			return;
    		}
    
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		/* Otherwise enter Recovery state */
    
    		tcp_enter_recovery(sk, (flag & FLAG_ECE));
    
    		tcp_update_scoreboard(sk, fast_rexmit);
    
    	tcp_cwnd_reduction(sk, prior_unsacked, fast_rexmit);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	tcp_xmit_retransmit_queue(sk);
    }
    
    
    static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
    				      s32 seq_rtt, s32 sack_rtt)
    
    	const struct tcp_sock *tp = tcp_sk(sk);
    
    	/* Prefer RTT measured from ACK's timing to TS-ECR. This is because
    	 * broken middle-boxes or peers may corrupt TS-ECR fields. But
    	 * Karn's algorithm forbids taking RTT if some retransmitted data
    	 * is acked (RFC6298).
    	 */
    	if (flag & FLAG_RETRANS_DATA_ACKED)
    		seq_rtt = -1;
    
    	if (seq_rtt < 0)
    		seq_rtt = sack_rtt;
    
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	/* RTTM Rule: A TSecr value received in a segment is used to
    	 * update the averaged RTT measurement only if the segment
    	 * acknowledges some new data, i.e., only if it advances the
    	 * left edge of the send window.
    	 * See draft-ietf-tcplw-high-performance-00, section 3.3.
    	 */
    
    	if (seq_rtt < 0 && tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
    		seq_rtt = tcp_time_stamp - tp->rx_opt.rcv_tsecr;
    
    	if (seq_rtt < 0)
    
    		return false;
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    
    
    	tcp_rtt_estimator(sk, seq_rtt);
    	tcp_set_rto(sk);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    
    
    	/* RFC6298: only reset backoff on valid RTT measurement. */
    	inet_csk(sk)->icsk_backoff = 0;
    
    	return true;
    
    /* Compute time elapsed between (last) SYNACK and the ACK completing 3WHS. */
    static void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req)
    {
    	struct tcp_sock *tp = tcp_sk(sk);
    	s32 seq_rtt = -1;
    
    	if (tp->lsndtime && !tp->total_retrans)
    		seq_rtt = tcp_time_stamp - tp->lsndtime;
    
    	tcp_ack_update_rtt(sk, FLAG_SYN_ACKED, seq_rtt, -1);
    
    static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    {
    
    	const struct inet_connection_sock *icsk = inet_csk(sk);
    
    	icsk->icsk_ca_ops->cong_avoid(sk, ack, in_flight);
    
    	tcp_sk(sk)->snd_cwnd_stamp = tcp_time_stamp;
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    }
    
    /* Restart timer after forward progress on connection.
     * RFC2988 recommends to restart timer to now+rto.
     */
    
    void tcp_rearm_rto(struct sock *sk)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    {
    
    	const struct inet_connection_sock *icsk = inet_csk(sk);
    
    	struct tcp_sock *tp = tcp_sk(sk);
    
    	/* If the retrans timer is currently being used by Fast Open
    	 * for SYN-ACK retrans purpose, stay put.
    	 */
    	if (tp->fastopen_rsk)
    		return;
    
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	if (!tp->packets_out) {
    
    		inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	} else {
    
    		u32 rto = inet_csk(sk)->icsk_rto;
    		/* Offset the time elapsed after installing regular RTO */
    
    		if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
    		    icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
    
    			struct sk_buff *skb = tcp_write_queue_head(sk);
    			const u32 rto_time_stamp = TCP_SKB_CB(skb)->when + rto;
    			s32 delta = (s32)(rto_time_stamp - tcp_time_stamp);
    			/* delta may not be positive if the socket is locked
    
    			 * when the retrans timer fires and is rescheduled.
    
    			 */
    			if (delta > 0)
    				rto = delta;
    		}
    		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
    					  TCP_RTO_MAX);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	}
    
    }
    
    /* This function is called when the delayed ER timer fires. TCP enters
     * fast recovery and performs fast-retransmit.
     */
    void tcp_resume_early_retransmit(struct sock *sk)
    {
    	struct tcp_sock *tp = tcp_sk(sk);
    
    	tcp_rearm_rto(sk);
    
    	/* Stop if ER is disabled after the delayed ER timer is scheduled */
    	if (!tp->do_early_retrans)
    		return;
    
    	tcp_enter_recovery(sk, false);
    	tcp_update_scoreboard(sk, 1);
    	tcp_xmit_retransmit_queue(sk);
    
    /* If we get here, the whole TSO packet has not been acked. */
    
    static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    {
    	struct tcp_sock *tp = tcp_sk(sk);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    
    
    	BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una));
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    
    	packets_acked = tcp_skb_pcount(skb);
    
    	if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		return 0;
    	packets_acked -= tcp_skb_pcount(skb);
    
    	if (packets_acked) {
    		BUG_ON(tcp_skb_pcount(skb) == 0);
    
    		BUG_ON(!before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq));
    
    /* Remove acknowledged frames from the retransmission queue. If our packet
     * is before the ack sequence we can discard it as it's confirmed to have
     * arrived at the other end.
     */
    
    static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
    
    			       u32 prior_snd_una, s32 sack_rtt)
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    {
    	struct tcp_sock *tp = tcp_sk(sk);
    
    	const struct inet_connection_sock *icsk = inet_csk(sk);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    	struct sk_buff *skb;
    
    	u32 now = tcp_time_stamp;
    
    Eric Dumazet's avatar
    Eric Dumazet committed
    	int fully_acked = true;
    
    	u32 reord = tp->packets_out;
    
    	u32 prior_sacked = tp->sacked_out;
    
    	ktime_t last_ackt = net_invalid_timestamp();
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    
    
    	while ((skb = tcp_write_queue_head(sk)) && skb != tcp_send_head(sk)) {
    
    		struct tcp_skb_cb *scb = TCP_SKB_CB(skb);
    
    		u8 sacked = scb->sacked;
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    
    
    		/* Determine how many packets and what bytes were acked, tso and else */
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		if (after(scb->end_seq, tp->snd_una)) {
    
    			if (tcp_skb_pcount(skb) == 1 ||
    			    !after(tp->snd_una, scb->seq))
    				break;
    
    
    			acked_pcount = tcp_tso_acked(sk, skb);
    			if (!acked_pcount)
    
    Eric Dumazet's avatar
    Eric Dumazet committed
    			fully_acked = false;
    
    			acked_pcount = tcp_skb_pcount(skb);
    
    Linus Torvalds's avatar
    Linus Torvalds committed
    		}