Newer
Older
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
* A. Scoreboard estimator decided the packet is lost.
* A'. Reno "three dupacks" marks head of queue lost.
* A''. Its FACK modfication, head until snd.fack is lost.
* B. SACK arrives sacking data transmitted after never retransmitted
* hole was sent out.
* C. SACK arrives sacking SND.NXT at the moment, when the
* segment was retransmitted.
* 4. D-SACK added new rule: D-SACK changes any tag to S.
*
* It is pleasant to note, that state diagram turns out to be commutative,
* so that we are allowed not to be bothered by order of our actions,
* when multiple events arrive simultaneously. (see the function below).
*
* Reordering detection.
* --------------------
* Reordering metric is maximal distance, which a packet can be displaced
* in packet stream. With SACKs we can estimate it:
*
* 1. SACK fills old hole and the corresponding segment was not
* ever retransmitted -> reordering. Alas, we cannot use it
* when segment was retransmitted.
* 2. The last flaw is solved with D-SACK. D-SACK arrives
* for retransmitted and already SACKed segment -> reordering..
* Both of these heuristics are not used in Loss state, when we cannot
* account for retransmits accurately.
*
* SACK block validation.
* ----------------------
*
* SACK block range validation checks that the received SACK block fits to
* the expected sequence limits, i.e., it is between SND.UNA and SND.NXT.
* Note that SND.UNA is not included to the range though being valid because
* it means that the receiver is rather inconsistent with itself reporting
* SACK reneging when it should advance SND.UNA. Such SACK block this is
* perfectly valid, however, in light of RFC2018 which explicitly states
* that "SACK block MUST reflect the newest segment. Even if the newest
* segment is going to be discarded ...", not that it looks very clever
* in case of head skb. Due to potentional receiver driven attacks, we
* choose to avoid immediate execution of a walk in write queue due to
* reneging and defer head skb's loss recovery to standard loss recovery
* procedure that will eventually trigger (nothing forbids us doing this).
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
*
* Implements also blockage to start_seq wrap-around. Problem lies in the
* fact that though start_seq (s) is before end_seq (i.e., not reversed),
* there's no guarantee that it will be before snd_nxt (n). The problem
* happens when start_seq resides between end_seq wrap (e_w) and snd_nxt
* wrap (s_w):
*
* <- outs wnd -> <- wrapzone ->
* u e n u_w e_w s n_w
* | | | | | | |
* |<------------+------+----- TCP seqno space --------------+---------->|
* ...-- <2^31 ->| |<--------...
* ...---- >2^31 ------>| |<--------...
*
* Current code wouldn't be vulnerable but it's better still to discard such
* crazy SACK blocks. Doing this check for start_seq alone closes somewhat
* similar case (end_seq after snd_nxt wrap) as earlier reversed check in
* snd_nxt wrap -> snd_una region will then become "well defined", i.e.,
* equal to the ideal case (infinite seqno space without wrap caused issues).
*
* With D-SACK the lower bound is extended to cover sequence space below
* SND.UNA down to undo_marker, which is the last point of interest. Yet
* again, D-SACK block must not to go across snd_una (for the same reason as
* for the normal SACK blocks, explained above). But there all simplicity
* ends, TCP might receive valid D-SACKs below that. As long as they reside
* fully below undo_marker they do not affect behavior in anyway and can
* therefore be safely ignored. In rare cases (which are more or less
* theoretical ones), the D-SACK will nicely cross that boundary due to skb
* fragmentation and packet reordering past skb's retransmission. To consider
* them correctly, the acceptable range must be extended even more though
* the exact amount is rather hard to quantify. However, tp->max_window can
* be used as an exaggerated estimate.
static int tcp_is_sackblock_valid(struct tcp_sock *tp, int is_dsack,
u32 start_seq, u32 end_seq)
{
/* Too far in future, or reversed (interpretation is ambiguous) */
if (after(end_seq, tp->snd_nxt) || !before(start_seq, end_seq))
return 0;
/* Nasty start_seq wrap-around check (see comments above) */
if (!before(start_seq, tp->snd_nxt))
return 0;
/* In outstanding window? ...This is valid exit for D-SACKs too.
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
* start_seq == snd_una is non-sensical (see comments above)
*/
if (after(start_seq, tp->snd_una))
return 1;
if (!is_dsack || !tp->undo_marker)
return 0;
/* ...Then it's D-SACK, and must reside below snd_una completely */
if (!after(end_seq, tp->snd_una))
return 0;
if (!before(start_seq, tp->undo_marker))
return 1;
/* Too old */
if (!after(end_seq, tp->undo_marker))
return 0;
/* Undo_marker boundary crossing (overestimates a lot). Known already:
* start_seq < undo_marker and end_seq >= undo_marker.
*/
return !before(start_seq, end_seq - tp->max_window);
}
/* Check for lost retransmit. This superb idea is borrowed from "ratehalving".
* Event "C". Later note: FACK people cheated me again 8), we have to account
* for reordering! Ugly, but should help.
*
* Search retransmitted skbs from write_queue that were sent when snd_nxt was
* less than what is now known to be received by the other end (derived from
* highest SACK block). Also calculate the lowest snd_nxt among the remaining
* retransmitted skbs to avoid some costly processing per ACKs.
static int tcp_mark_lost_retrans(struct sock *sk)
const struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
int flag = 0;
u32 new_low_seq = tp->snd_nxt;
u32 received_upto = TCP_SKB_CB(tp->highest_sack)->end_seq;
if (!tcp_is_fack(tp) || !tp->retrans_out ||
!after(received_upto, tp->lost_retrans_low) ||
icsk->icsk_ca_state != TCP_CA_Recovery)
return flag;
tcp_for_write_queue(skb, sk) {
u32 ack_seq = TCP_SKB_CB(skb)->ack_seq;
if (skb == tcp_send_head(sk))
break;
if (cnt == tp->retrans_out)
break;
if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
continue;
if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS))
continue;
if (after(received_upto, ack_seq) &&
(tcp_is_fack(tp) ||
!before(received_upto,
ack_seq + tp->reordering * tp->mss_cache))) {
TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
tp->retrans_out -= tcp_skb_pcount(skb);
/* clear lost hint */
tp->retransmit_skb_hint = NULL;
if (!(TCP_SKB_CB(skb)->sacked & (TCPCB_LOST|TCPCB_SACKED_ACKED))) {
tp->lost_out += tcp_skb_pcount(skb);
TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
flag |= FLAG_DATA_SACKED;
NET_INC_STATS_BH(LINUX_MIB_TCPLOSTRETRANSMIT);
}
if (before(ack_seq, new_low_seq))
new_low_seq = ack_seq;
cnt += tcp_skb_pcount(skb);
if (tp->retrans_out)
tp->lost_retrans_low = new_low_seq;
return flag;
}
static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb,
struct tcp_sack_block_wire *sp, int num_sacks,
u32 prior_snd_una)
{
u32 start_seq_0 = ntohl(get_unaligned(&sp[0].start_seq));
u32 end_seq_0 = ntohl(get_unaligned(&sp[0].end_seq));
int dup_sack = 0;
if (before(start_seq_0, TCP_SKB_CB(ack_skb)->ack_seq)) {
dup_sack = 1;
tcp_dsack_seen(tp);
NET_INC_STATS_BH(LINUX_MIB_TCPDSACKRECV);
} else if (num_sacks > 1) {
u32 end_seq_1 = ntohl(get_unaligned(&sp[1].end_seq));
u32 start_seq_1 = ntohl(get_unaligned(&sp[1].start_seq));
if (!after(end_seq_0, end_seq_1) &&
!before(start_seq_0, start_seq_1)) {
dup_sack = 1;
tcp_dsack_seen(tp);
NET_INC_STATS_BH(LINUX_MIB_TCPDSACKOFORECV);
}
}
/* D-SACK for already forgotten data... Do dumb counting. */
if (dup_sack &&
!after(end_seq_0, prior_snd_una) &&
after(end_seq_0, tp->undo_marker))
tp->undo_retrans--;
return dup_sack;
}
/* Check if skb is fully within the SACK block. In presence of GSO skbs,
* the incoming SACK may not exactly match but we can find smaller MSS
* aligned portion of it that matches. Therefore we might need to fragment
* which may fail and creates some hassle (caller must handle error case
* returns).
*/
static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb,
u32 start_seq, u32 end_seq)
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
{
int in_sack, err;
unsigned int pkt_len;
in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq) &&
!before(end_seq, TCP_SKB_CB(skb)->end_seq);
if (tcp_skb_pcount(skb) > 1 && !in_sack &&
after(TCP_SKB_CB(skb)->end_seq, start_seq)) {
in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq);
if (!in_sack)
pkt_len = start_seq - TCP_SKB_CB(skb)->seq;
else
pkt_len = end_seq - TCP_SKB_CB(skb)->seq;
err = tcp_fragment(sk, skb, pkt_len, skb_shinfo(skb)->gso_size);
if (err < 0)
return err;
}
return in_sack;
}
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
static int tcp_sacktag_one(struct sk_buff *skb, struct tcp_sock *tp,
int *reord, int dup_sack, int fack_count)
{
u8 sacked = TCP_SKB_CB(skb)->sacked;
int flag = 0;
/* Account D-SACK for retransmitted packet. */
if (dup_sack && (sacked & TCPCB_RETRANS)) {
if (after(TCP_SKB_CB(skb)->end_seq, tp->undo_marker))
tp->undo_retrans--;
if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una) &&
(sacked & TCPCB_SACKED_ACKED))
*reord = min(fack_count, *reord);
}
/* Nothing to do; acked frame is about to be dropped (was ACKed). */
if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
return flag;
if (!(sacked & TCPCB_SACKED_ACKED)) {
if (sacked & TCPCB_SACKED_RETRANS) {
/* If the segment is not tagged as lost,
* we do not clear RETRANS, believing
* that retransmission is still in flight.
*/
if (sacked & TCPCB_LOST) {
TCP_SKB_CB(skb)->sacked &=
~(TCPCB_LOST|TCPCB_SACKED_RETRANS);
tp->lost_out -= tcp_skb_pcount(skb);
tp->retrans_out -= tcp_skb_pcount(skb);
/* clear lost hint */
tp->retransmit_skb_hint = NULL;
}
} else {
if (!(sacked & TCPCB_RETRANS)) {
/* New sack for not retransmitted frame,
* which was in hole. It is reordering.
*/
if (before(TCP_SKB_CB(skb)->seq,
tcp_highest_sack_seq(tp)))
*reord = min(fack_count, *reord);
/* SACK enhanced F-RTO (RFC4138; Appendix B) */
if (!after(TCP_SKB_CB(skb)->end_seq, tp->frto_highmark))
flag |= FLAG_ONLY_ORIG_SACKED;
}
if (sacked & TCPCB_LOST) {
TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST;
tp->lost_out -= tcp_skb_pcount(skb);
/* clear lost hint */
tp->retransmit_skb_hint = NULL;
}
}
TCP_SKB_CB(skb)->sacked |= TCPCB_SACKED_ACKED;
flag |= FLAG_DATA_SACKED;
tp->sacked_out += tcp_skb_pcount(skb);
fack_count += tcp_skb_pcount(skb);
/* Lost marker hint past SACKed? Tweak RFC3517 cnt */
if (!tcp_is_fack(tp) && (tp->lost_skb_hint != NULL) &&
before(TCP_SKB_CB(skb)->seq,
TCP_SKB_CB(tp->lost_skb_hint)->seq))
tp->lost_cnt_hint += tcp_skb_pcount(skb);
if (fack_count > tp->fackets_out)
tp->fackets_out = fack_count;
if (after(TCP_SKB_CB(skb)->seq, tcp_highest_sack_seq(tp)))
tp->highest_sack = skb;
} else {
if (dup_sack && (sacked & TCPCB_RETRANS))
*reord = min(fack_count, *reord);
}
/* D-SACK. We can detect redundant retransmission in S|R and plain R
* frames and clear it. undo_retrans is decreased above, L|R frames
* are accounted above as well.
*/
if (dup_sack && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS)) {
TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
tp->retrans_out -= tcp_skb_pcount(skb);
tp->retransmit_skb_hint = NULL;
}
return flag;
}
static int
tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_una)
{
const struct inet_connection_sock *icsk = inet_csk(sk);
unsigned char *ptr = (skb_transport_header(ack_skb) +
TCP_SKB_CB(ack_skb)->sacked);
struct tcp_sack_block_wire *sp_wire = (struct tcp_sack_block_wire *)(ptr+2);
struct tcp_sack_block sp[4];
struct sk_buff *cached_skb;
int used_sacks;
int found_dup_sack = 0;
int cached_fack_count;
int first_sack_index;
if (!tp->sacked_out) {
if (WARN_ON(tp->fackets_out))
tp->fackets_out = 0;
tp->highest_sack = tcp_write_queue_head(sk);
found_dup_sack = tcp_check_dsack(tp, ack_skb, sp_wire,
num_sacks, prior_snd_una);
if (found_dup_sack)
flag |= FLAG_DSACKING_ACK;
/* Eliminate too old ACKs, but take into
* account more or less fresh ones, they can
* contain valid SACK info.
*/
if (before(TCP_SKB_CB(ack_skb)->ack_seq, prior_snd_una - tp->max_window))
return 0;
if (!tp->packets_out)
goto out;
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
used_sacks = 0;
first_sack_index = 0;
for (i = 0; i < num_sacks; i++) {
int dup_sack = !i && found_dup_sack;
sp[used_sacks].start_seq = ntohl(get_unaligned(&sp_wire[i].start_seq));
sp[used_sacks].end_seq = ntohl(get_unaligned(&sp_wire[i].end_seq));
if (!tcp_is_sackblock_valid(tp, dup_sack,
sp[used_sacks].start_seq,
sp[used_sacks].end_seq)) {
if (dup_sack) {
if (!tp->undo_marker)
NET_INC_STATS_BH(LINUX_MIB_TCPDSACKIGNOREDNOUNDO);
else
NET_INC_STATS_BH(LINUX_MIB_TCPDSACKIGNOREDOLD);
} else {
/* Don't count olds caused by ACK reordering */
if ((TCP_SKB_CB(ack_skb)->ack_seq != tp->snd_una) &&
!after(sp[used_sacks].end_seq, tp->snd_una))
continue;
NET_INC_STATS_BH(LINUX_MIB_TCPSACKDISCARD);
}
if (i == 0)
first_sack_index = -1;
continue;
}
/* Ignore very old stuff early */
if (!after(sp[used_sacks].end_seq, prior_snd_una))
continue;
used_sacks++;
}
/* SACK fastpath:
* if the only SACK change is the increase of the end_seq of
* the first block then only apply that SACK block
* and use retrans queue hinting otherwise slowpath */
for (i = 0; i < used_sacks; i++) {
u32 start_seq = sp[i].start_seq;
u32 end_seq = sp[i].end_seq;
if (tp->recv_sack_cache[i].start_seq != start_seq)
} else {
if ((tp->recv_sack_cache[i].start_seq != start_seq) ||
(tp->recv_sack_cache[i].end_seq != end_seq))
}
tp->recv_sack_cache[i].start_seq = start_seq;
tp->recv_sack_cache[i].end_seq = end_seq;
}
/* Clear the rest of the cache sack blocks so they won't match mistakenly. */
for (; i < ARRAY_SIZE(tp->recv_sack_cache); i++) {
tp->recv_sack_cache[i].start_seq = 0;
tp->recv_sack_cache[i].end_seq = 0;
}
used_sacks = 1;
else {
int j;
tp->fastpath_skb_hint = NULL;
/* order SACK blocks to allow in order walk of the retrans queue */
for (i = used_sacks - 1; i > 0; i--) {
if (after(sp[j].start_seq, sp[j+1].start_seq)) {
struct tcp_sack_block tmp;
tmp = sp[j];
sp[j] = sp[j+1];
sp[j+1] = tmp;
/* Track where the first SACK block goes to */
if (j == first_sack_index)
first_sack_index = j+1;
/* Use SACK fastpath hint if valid */
cached_skb = tp->fastpath_skb_hint;
cached_fack_count = tp->fastpath_cnt_hint;
if (!cached_skb) {
cached_skb = tcp_write_queue_head(sk);
cached_fack_count = 0;
}
for (i = 0; i < used_sacks; i++) {
u32 start_seq = sp[i].start_seq;
u32 end_seq = sp[i].end_seq;
int dup_sack = (found_dup_sack && (i == first_sack_index));
int next_dup = (found_dup_sack && (i+1 == first_sack_index));
skb = cached_skb;
fack_count = cached_fack_count;
/* Event "B" in the comment above. */
if (after(end_seq, tp->high_seq))
flag |= FLAG_DATA_LOST;
tcp_for_write_queue_from(skb, sk) {
if (skb == tcp_send_head(sk))
break;
cached_skb = skb;
cached_fack_count = fack_count;
if (i == first_sack_index) {
tp->fastpath_skb_hint = skb;
tp->fastpath_cnt_hint = fack_count;
}
/* The retransmission queue is always in order, so
* we can short-circuit the walk early.
*/
if (!before(TCP_SKB_CB(skb)->seq, end_seq))
dup_sack = (found_dup_sack && (i == first_sack_index));
/* Due to sorting DSACK may reside within this SACK block! */
if (next_dup) {
u32 dup_start = sp[i+1].start_seq;
u32 dup_end = sp[i+1].end_seq;
if (before(TCP_SKB_CB(skb)->seq, dup_end)) {
in_sack = tcp_match_skb_to_sack(sk, skb, dup_start, dup_end);
if (in_sack > 0)
dup_sack = 1;
}
}
/* DSACK info lost if out-of-mem, try SACK still */
if (in_sack <= 0)
in_sack = tcp_match_skb_to_sack(sk, skb, start_seq, end_seq);
if (unlikely(in_sack < 0))
if (in_sack)
flag |= tcp_sacktag_one(skb, tp, &reord, dup_sack, fack_count);
/* SACK enhanced FRTO (RFC4138, Appendix B): Clearing correct
* due to in-order walk
*/
if (after(end_seq, tp->frto_highmark))
flag &= ~FLAG_ONLY_ORIG_SACKED;
flag |= tcp_mark_lost_retrans(sk);
tcp_verify_left_out(tp);
if ((reord < tp->fackets_out) &&
((icsk->icsk_ca_state != TCP_CA_Loss) || tp->undo_marker) &&
(!tp->frto_highmark || after(tp->snd_una, tp->frto_highmark)))
tcp_update_reordering(sk, tp->fackets_out - reord, 0);
out:
#if FASTRETRANS_DEBUG > 0
BUG_TRAP((int)tp->sacked_out >= 0);
BUG_TRAP((int)tp->lost_out >= 0);
BUG_TRAP((int)tp->retrans_out >= 0);
BUG_TRAP((int)tcp_packets_in_flight(tp) >= 0);
#endif
return flag;
}
/* If we receive more dupacks than we expected counting segments
* in assumption of absent reordering, interpret this as reordering.
* The only another reason could be bug in receiver TCP.
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
static void tcp_check_reno_reordering(struct sock *sk, const int addend)
{
struct tcp_sock *tp = tcp_sk(sk);
u32 holes;
holes = max(tp->lost_out, 1U);
holes = min(holes, tp->packets_out);
if ((tp->sacked_out + holes) > tp->packets_out) {
tp->sacked_out = tp->packets_out - holes;
tcp_update_reordering(sk, tp->packets_out + addend, 0);
}
}
/* Emulate SACKs for SACKless connection: account for a new dupack. */
static void tcp_add_reno_sack(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
tp->sacked_out++;
tcp_check_reno_reordering(sk, 0);
tcp_verify_left_out(tp);
}
/* Account for ACK, ACKing some data in Reno Recovery phase. */
static void tcp_remove_reno_sacks(struct sock *sk, int acked)
{
struct tcp_sock *tp = tcp_sk(sk);
if (acked > 0) {
/* One ACK acked hole. The rest eat duplicate ACKs. */
if (acked-1 >= tp->sacked_out)
tp->sacked_out = 0;
else
tp->sacked_out -= acked-1;
}
tcp_check_reno_reordering(sk, acked);
tcp_verify_left_out(tp);
}
static inline void tcp_reset_reno_sack(struct tcp_sock *tp)
{
tp->sacked_out = 0;
}
/* F-RTO can only be used if TCP has never retransmitted anything other than
* head (SACK enhanced variant from Appendix B of RFC4138 is more robust here)
*/
int tcp_use_frto(struct sock *sk)
{
const struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
if (!sysctl_tcp_frto)
return 0;
/* Avoid expensive walking of rexmit queue if possible */
if (tp->retrans_out > 1)
return 0;
skb = tcp_write_queue_head(sk);
skb = tcp_write_queue_next(sk, skb); /* Skips head */
tcp_for_write_queue_from(skb, sk) {
if (skb == tcp_send_head(sk))
break;
if (TCP_SKB_CB(skb)->sacked&TCPCB_RETRANS)
return 0;
/* Short-circuit when first non-SACKed skb has been checked */
if (!(TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_ACKED))
break;
}
return 1;
/* RTO occurred, but do not yet enter Loss state. Instead, defer RTO
* recovery a bit and use heuristics in tcp_process_frto() to detect if
* the RTO was spurious. Only clear SACKED_RETRANS of the head here to
* keep retrans_out counting accurate (with SACK F-RTO, other than head
* may still have that bit set); TCPCB_LOST and remaining SACKED_RETRANS
* bits are handled if the Loss state is really to be entered (in
* tcp_enter_frto_loss).
*
* Do like tcp_enter_loss() would; when RTO expires the second time it
* does:
* "Reduce ssthresh if it has not yet been made inside this window."
const struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
if ((!tp->frto_counter && icsk->icsk_ca_state <= TCP_CA_Disorder) ||
((icsk->icsk_ca_state == TCP_CA_Loss || tp->frto_counter) &&
!icsk->icsk_retransmits)) {
tp->prior_ssthresh = tcp_current_ssthresh(sk);
/* Our state is too optimistic in ssthresh() call because cwnd
* is not reduced until tcp_enter_frto_loss() when previous F-RTO
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
* recovery has not yet completed. Pattern would be this: RTO,
* Cumulative ACK, RTO (2xRTO for the same segment does not end
* up here twice).
* RFC4138 should be more specific on what to do, even though
* RTO is quite unlikely to occur after the first Cumulative ACK
* due to back-off and complexity of triggering events ...
*/
if (tp->frto_counter) {
u32 stored_cwnd;
stored_cwnd = tp->snd_cwnd;
tp->snd_cwnd = 2;
tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
tp->snd_cwnd = stored_cwnd;
} else {
tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
}
/* ... in theory, cong.control module could do "any tricks" in
* ssthresh(), which means that ca_state, lost bits and lost_out
* counter would have to be faked before the call occurs. We
* consider that too expensive, unlikely and hacky, so modules
* using these in ssthresh() must deal these incompatibility
* issues if they receives CA_EVENT_FRTO and frto_counter != 0
*/
tcp_ca_event(sk, CA_EVENT_FRTO);
}
tp->undo_marker = tp->snd_una;
tp->undo_retrans = 0;
skb = tcp_write_queue_head(sk);
if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS)
tp->undo_marker = 0;
if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS) {
TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
tp->retrans_out -= tcp_skb_pcount(skb);
tcp_verify_left_out(tp);
/* Too bad if TCP was application limited */
tp->snd_cwnd = min(tp->snd_cwnd, tcp_packets_in_flight(tp) + 1);
/* Earlier loss recovery underway (see RFC4138; Appendix B).
* The last condition is necessary at least in tp->frto_counter case.
*/
if (IsSackFrto() && (tp->frto_counter ||
((1 << icsk->icsk_ca_state) & (TCPF_CA_Recovery|TCPF_CA_Loss))) &&
after(tp->high_seq, tp->snd_una)) {
tp->frto_highmark = tp->high_seq;
} else {
tp->frto_highmark = tp->snd_nxt;
}
tcp_set_ca_state(sk, TCP_CA_Disorder);
tp->high_seq = tp->snd_nxt;
tp->frto_counter = 1;
}
/* Enter Loss state after F-RTO was applied. Dupack arrived after RTO,
* which indicates that we should follow the traditional RTO recovery,
* i.e. mark everything lost and do go-back-N retransmission.
*/
static void tcp_enter_frto_loss(struct sock *sk, int allowed_segments, int flag)
{
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
tp->lost_out = 0;
if (tcp_is_reno(tp))
tcp_reset_reno_sack(tp);
tcp_for_write_queue(skb, sk) {
if (skb == tcp_send_head(sk))
break;
TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST;
/*
* Count the retransmission made on RTO correctly (only when
* waiting for the first ACK and did not get it)...
*/
if ((tp->frto_counter == 1) && !(flag&FLAG_DATA_ACKED)) {
/* For some reason this R-bit might get cleared? */
if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS)
tp->retrans_out += tcp_skb_pcount(skb);
/* ...enter this if branch just for the first segment */
flag |= FLAG_DATA_ACKED;
} else {
if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS)
tp->undo_marker = 0;
TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
/* Don't lost mark skbs that were fwd transmitted after RTO */
if (!(TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_ACKED) &&
!after(TCP_SKB_CB(skb)->end_seq, tp->frto_highmark)) {
TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
tp->lost_out += tcp_skb_pcount(skb);
tcp_verify_left_out(tp);
tp->snd_cwnd = tcp_packets_in_flight(tp) + allowed_segments;
tp->snd_cwnd_cnt = 0;
tp->snd_cwnd_stamp = tcp_time_stamp;
tp->frto_counter = 0;
tp->reordering = min_t(unsigned int, tp->reordering,
sysctl_tcp_reordering);
tcp_set_ca_state(sk, TCP_CA_Loss);
tp->high_seq = tp->frto_highmark;
TCP_ECN_queue_cwr(tp);
tcp_clear_retrans_hints_partial(tp);
static void tcp_clear_retrans_partial(struct tcp_sock *tp)
{
tp->retrans_out = 0;
tp->lost_out = 0;
tp->undo_marker = 0;
tp->undo_retrans = 0;
}
void tcp_clear_retrans(struct tcp_sock *tp)
{
tcp_clear_retrans_partial(tp);
tp->fackets_out = 0;
tp->sacked_out = 0;
}
/* Enter Loss state. If "how" is not zero, forget all SACK information
* and reset tags completely, otherwise preserve SACKs. If receiver
* dropped its ofo queue, we will know this due to reneging detection.
*/
void tcp_enter_loss(struct sock *sk, int how)
{
const struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
/* Reduce ssthresh if it has not yet been made inside this window. */
if (icsk->icsk_ca_state <= TCP_CA_Disorder || tp->snd_una == tp->high_seq ||
(icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
tp->prior_ssthresh = tcp_current_ssthresh(sk);
tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
tcp_ca_event(sk, CA_EVENT_LOSS);
}
tp->snd_cwnd = 1;
tp->snd_cwnd_cnt = 0;
tp->snd_cwnd_stamp = tcp_time_stamp;
tcp_clear_retrans_partial(tp);
if (tcp_is_reno(tp))
tcp_reset_reno_sack(tp);
if (!how) {
/* Push undo marker, if it was plain RTO and nothing
* was retransmitted. */
tcp_clear_retrans_hints_partial(tp);
} else {
tp->sacked_out = 0;
tp->fackets_out = 0;
tcp_clear_all_retrans_hints(tp);
}
tcp_for_write_queue(skb, sk) {
if (skb == tcp_send_head(sk))
break;
if (TCP_SKB_CB(skb)->sacked&TCPCB_RETRANS)
tp->undo_marker = 0;
TCP_SKB_CB(skb)->sacked &= (~TCPCB_TAGBITS)|TCPCB_SACKED_ACKED;
if (!(TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_ACKED) || how) {
TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
tp->lost_out += tcp_skb_pcount(skb);
}
}
tcp_verify_left_out(tp);
tp->reordering = min_t(unsigned int, tp->reordering,
sysctl_tcp_reordering);
tcp_set_ca_state(sk, TCP_CA_Loss);
tp->high_seq = tp->snd_nxt;
TCP_ECN_queue_cwr(tp);
/* Abort F-RTO algorithm if one is in progress */
tp->frto_counter = 0;
static int tcp_check_sack_reneging(struct sock *sk)
{
struct sk_buff *skb;
/* If ACK arrived pointing to a remembered SACK,
* it means that our remembered SACKs do not reflect
* real state of receiver i.e.
* receiver _host_ is heavily congested (or buggy).
* Do processing similar to RTO timeout.
*/
if ((skb = tcp_write_queue_head(sk)) != NULL &&
struct inet_connection_sock *icsk = inet_csk(sk);
NET_INC_STATS_BH(LINUX_MIB_TCPSACKRENEGING);
tcp_enter_loss(sk, 1);
icsk->icsk_retransmits++;
tcp_retransmit_skb(sk, tcp_write_queue_head(sk));
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
icsk->icsk_rto, TCP_RTO_MAX);
return 1;
}
return 0;
}
static inline int tcp_fackets_out(struct tcp_sock *tp)
{
return tcp_is_reno(tp) ? tp->sacked_out+1 : tp->fackets_out;
/* Heurestics to calculate number of duplicate ACKs. There's no dupACKs
* counter when SACK is enabled (without SACK, sacked_out is used for
* that purpose).
*
* Instead, with FACK TCP uses fackets_out that includes both SACKed
* segments up to the highest received SACK block so far and holes in
* between them.
*
* With reordering, holes may still be in flight, so RFC3517 recovery
* uses pure sacked_out (total number of SACKed segments) even though
* it violates the RFC that uses duplicate ACKs, often these are equal
* but when e.g. out-of-window ACKs or packet duplication occurs,
* they differ. Since neither occurs due to loss, TCP should really
* ignore them.
*/
static inline int tcp_dupack_heurestics(struct tcp_sock *tp)
{
return tcp_is_fack(tp) ? tp->fackets_out : tp->sacked_out + 1;
}
static inline int tcp_skb_timedout(struct sock *sk, struct sk_buff *skb)
return (tcp_time_stamp - TCP_SKB_CB(skb)->when > inet_csk(sk)->icsk_rto);
static inline int tcp_head_timedout(struct sock *sk)
struct tcp_sock *tp = tcp_sk(sk);
tcp_skb_timedout(sk, tcp_write_queue_head(sk));
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
}
/* Linux NewReno/SACK/FACK/ECN state machine.
* --------------------------------------
*
* "Open" Normal state, no dubious events, fast path.
* "Disorder" In all the respects it is "Open",
* but requires a bit more attention. It is entered when
* we see some SACKs or dupacks. It is split of "Open"
* mainly to move some processing from fast path to slow one.
* "CWR" CWND was reduced due to some Congestion Notification event.
* It can be ECN, ICMP source quench, local device congestion.
* "Recovery" CWND was reduced, we are fast-retransmitting.
* "Loss" CWND was reduced due to RTO timeout or SACK reneging.
*
* tcp_fastretrans_alert() is entered:
* - each incoming ACK, if state is not "Open"
* - when arrived ACK is unusual, namely:
* * SACK
* * Duplicate ACK.
* * ECN ECE.
*
* Counting packets in flight is pretty simple.
*
* in_flight = packets_out - left_out + retrans_out
*
* packets_out is SND.NXT-SND.UNA counted in packets.
*
* retrans_out is number of retransmitted segments.
*
* left_out is number of segments left network, but not ACKed yet.
*
* left_out = sacked_out + lost_out
*
* sacked_out: Packets, which arrived to receiver out of order
* and hence not ACKed. With SACKs this number is simply
* amount of SACKed data. Even without SACKs
* it is easy to give pretty reliable estimate of this number,
* counting duplicate ACKs.
*
* lost_out: Packets lost by network. TCP has no explicit
* "loss notification" feedback from network (for now).
* It means that this number can be only _guessed_.
* Actually, it is the heuristics to predict lossage that
* distinguishes different algorithms.
*
* F.e. after RTO, when all the queue is considered as lost,
* lost_out = packets_out and in_flight = retrans_out.
*
* Essentially, we have now two algorithms counting
* lost packets.
*
* FACK: It is the simplest heuristics. As soon as we decided
* that something is lost, we decide that _all_ not SACKed
* packets until the most forward SACK are lost. I.e.
* lost_out = fackets_out - sacked_out and left_out = fackets_out.
* It is absolutely correct estimate, if network does not reorder
* packets. And it loses any connection to reality when reordering
* takes place. We use FACK by default until reordering
* is suspected on the path to this destination.
*
* NewReno: when Recovery is entered, we assume that one segment
* is lost (classic Reno). While we are in Recovery and
* a partial ACK arrives, we assume that one more packet
* is lost (NewReno). This heuristics are the same in NewReno
* and SACK.
*
* Imagine, that's all! Forget about all this shamanism about CWND inflation
* deflation etc. CWND is real congestion window, never inflated, changes
* only according to classic VJ rules.
*
* Really tricky (and requiring careful tuning) part of algorithm
* is hidden in functions tcp_time_to_recover() and tcp_xmit_retransmit_queue().
* The first determines the moment _when_ we should reduce CWND and,
* hence, slow down forward transmission. In fact, it determines the moment
* when we decide that hole is caused by loss, rather than by a reorder.
*
* tcp_xmit_retransmit_queue() decides, _what_ we should retransmit to fill
* holes, caused by lost packets.
*
* And the most logically complicated part of algorithm is undo