One of the major constraints on actually running this thing is the
amount of memory used by tcpconn structures (ten million of them, at
100 bytes each, is a lot of memory).

At some point in the future, I'd like to move from mallocing each one
individually to having a big array with the whole lot in.  Since we're
limited to about 10mil conns, this allows us to replace all of the
pointers with 24 bit offsets in the array, instantly saving us 8 bytes
per structure.  We also lose the malloc overhead, saving a further 8
bytes, and padding, which would be another 9 bytes, for a total saving
of 25 bytes, or 250MB of memory.  Likewise, the buffer areas could
easily be identified by 24 bit indexes.  16 bits might be enough in
many cases.

Moving the flow flags from the flow structure to the tcpconn allows us
to eliminate some padding, saving another byte.

It might also be worthwhile keeping structures for different protocols
separate.  UDP flows waste eight bytes in stream ids; portless flows
waste a further 4 bytes in port numbers, plus keeping separate pools
would allow us to save a byte in the sid structure.  Non-TCPs also
lack a concept of comatose or aborted, which would allow us to save a
couple of bits; not obvious that this is actually useful.

So, the new, revised structure would, for TCP, look like this:

unsigned long saddr;		      /* 0 */
unsigned long daddr;		      /* 4 */
unsigned short sport;		      /* 8 */
unsigned short dport;		      /* 10 */
unsigned long s_fin_seq;	      /* 12 */
unsigned long d_fin_seq;	      /* 16 */
unsigned char s_sent_fin:1,	      /* 20 */
	      d_sent_fin:1,
	      s_acked_fin:1,
	      d_acked_fin:1,
	      s_has_packets:1,
	      d_has_packets:1,
	      s_is_client:1,
	      comatose:1;
unsigned aborted:1,                   /* 21 */
	 buffer_used:15;
unsigned fd:10,			      /* 23 */
	 buffer:22;
unsigned start_sec:24;		      /* 27 */
unsigned start_usec:24;               /* 30 */
unsigned last_dgram;		      /* 33 */

unsigned hash_prev:24;                /* 37 */
unsigned hash_next:24;                /* 40 */
unsigned lru_prev:24;		      /* 43 */
unsigned lru_next:24;		      /* 46 */
unsigned fd_prev:24;		      /* 49 */
unsigned fd_next:24;		      /* 52 */
unsigned buffer_prev:24;	      /* 55 */
unsigned buffer_next:24;	      /* 58 */

For a grand total of 61 bytes, and possibly the most horrific data
structure I've ever had the misfortune to use.

It's annoying that, at any given time, most connections won't have fds
or buffers, wasting 12 bytes in the linked lists.  There's also 37
bits between buffer_used and buffer, plus 10 bits for fd, which is
usually wasted.

Replacing buffer and fd with indexes into another table which just
contains buffers and fds could save us a little bit:


unsigned long saddr;		      /* 0 */
unsigned long daddr;		      /* 4 */
unsigned short sport;		      /* 8 */
unsigned short dport;		      /* 10 */
unsigned long s_fin_seq;	      /* 12 */
unsigned long d_fin_seq;	      /* 16 */
unsigned char s_sent_fin:1,	      /* 20 */
	      d_sent_fin:1,
	      s_acked_fin:1,
	      d_acked_fin:1,
	      s_has_packets:1,
	      d_has_packets:1,
	      s_is_client:1,
	      comatose:1;
unsigned aborted:1,                   /* 21 */
	 fd:10,
	 buffer:21;
unsigned start_sec:24;		      /* 25 */
unsigned start_usec:24;               /* 28 */
unsigned last_dgram;		      /* 31 */

unsigned hash_prev:24;                /* 35 */
unsigned hash_next:24;                /* 38 */
unsigned lru_prev:24;		      /* 41 */
unsigned lru_next:24;		      /* 44 */

and 47 bytes per connection.  We also need to keep 8 bytes per buffer,
and 3 bytes per fd.  We're limited to 1024 fds and about 2 million
buffers, so that shouldn't be too much a problem.

(In practise, the buffer count is unlikely to exceed a few thousand,
making it even less of a problem.)

Summary:

12 bytes for lists
12 bytes for sid
10 bytes for timing info
21 bits for buffer index (could shave a few here; not obviously worth it)
10 bits for fd
9 bits of flags

There's still some redundancy here, since e.g. s_sent_fin implies
s_has_packets.  The s_* flags only actually have four available
states, despite being encoded onto three bits; likewise the d_* states.

State	Encoding (has, sent, acked)
0	0 0 0
1	1 1 0
2	1 0 1
3	1 1 1

Representing this state rather than the underlying three-tuple saves
two bits.  comatose can clearly be derived from this (it's just
(s_state == 2 || s_state == 3) && (d_state == 2 || d_state == 3)),
saving three bits.

The s_is_client bit is also slightly redundant: if s_state is 0, we
know that s_is_client will also be zero.  It isn't obvious that we can
do anything with this, however.  Likewise the observation that
(s_state == 0 && d_state == 0) -> aborted == 0.

We can save 10 bits by making last_dgram second granularity rather
than millisecond; it's only used to calculate the idle threshold, so
this is probably good enough.  We can trim 2 bits from each of
start_sec and last_dgram by limiting trace lengths to 48 days.
start_usec only needs to be 20 bits, for a further 4 bit saving.


This gives us:

unsigned long saddr;		      /* 0 */
unsigned long daddr;		      /* 4 */
unsigned short sport;		      /* 8 */
unsigned short dport;		      /* 10 */
unsigned long s_fin_seq;	      /* 12 */
unsigned long d_fin_seq;	      /* 16 */
unsigned hash_prev:24;                /* 20 */
unsigned hash_next:24;                /* 23 */
unsigned lru_prev:24;		      /* 26 */
unsigned lru_next:24;		      /* 29 */
unsigned s_state:2,		      /* 32 */
	 d_state:2,
	 aborted:1,
	 s_is_client:1,
	 fd:10,
	 buffer:16,
	 start_sec:22,
	 start_usec:20,
	 last_dgram:22;

for a grand total of 44 bytes.  Note that this design would limit us
to a maximum trace length of 48 days, and a maximum of 65536 buffers.

The next thing we notice is that the fin sequences are usually just 0
for most of a connection's life.  As such, they can be farmed out to
another data structure referenced with one of our magic 24 bit
pointers.  This would save us 5 bytes per non-finned connection,
bringing us down to just 39 bytes per tcpconn.

(This other table would have to have one entry for every tcpconn
structure if we're going to avoid potentially running out; the point
is that at any given time, most of it will be swapped out and can
therefore be ignored.  Except that we could easily end up with used
entries interleaved with unused ones, and hence die.  Hmm.)

Data structures needed:

-- File descriptor table.  1024 entries, each consisting of
   (prev,next,user) pairs forming an LRU list.

	       -- Total 8k

-- Buffer management table.  About 16k entries
   (prev:16,next:16,used:32,user:32) -> 12 bytes -> 192k

-- Buffer table.  16k entries, 1024 bytes each, 16M.

-- tcpconn fin data.  8 bytes each, 16mil entries -> 128M, but mostly
   swapped out: expect to use the first 30M or so.

-- tcpconn structures.  39 bytes, 16 mil entries -> about 650M.

So the tcp flow representation is about 700M of actual memory, plus
about 100M of stuff which is mostly swapped out.  The buffer and fd
management stuff can be shared with the UDP and Other protocol stuff.

UDPconn structures look like this:

unsigned long saddr;		      /* 0 */
unsigned long daddr;		      /* 4 */
unsigned short sport;		      /* 8 */
unsigned short dport;		      /* 10 */
unsigned hash_prev:19;                /* 12 */
unsigned hash_next:19;
unsigned lru_prev:19;
unsigned lru_next:19;
unsigned s_has_packets:1,
	 d_has_packets:1,
	 fd:10,
	 buffer:16,
	 start_sec:22,
	 start_usec:20,
	 last_dgram:22;

Note that we can probably assume we'll have less UDP flows than TCP
flows live at any given time, so the magic pointers become just 19
bits (for 512k connections).  This brings the whole structure in at
just 33 bytes; with 512k connections, we spend about 4.2M on UDP conn
heads.  I'd imagine the other protocols are a similar sort of weight,
so our total memory usage drops to 1G while the number of live
tcpconns rises to about 16 million.

This might actually be worth implementing.
