Some remedial computer science, for the PCI auditors in my audience.
I hand you an array of random integers. How can you tell if the number
three is in it?
Well, there’s the obvious way: check the numbers sequentially until
you find the “3” or exhaust the array. Linear search. Given 10
numbers, you have to assume it could take 10 steps; N numbers, N
steps.
Picture 1.png
Linear search is bad. It’s hard to do worse than linear. Let’s improve
on it. Sort the array.
Picture 2.png
A sorted array suggests a different strategy: jump the middle of the
array and see if the value you’re looking for is less than (to the
left) or greater than (to the right). Repeat, cutting the array in
half each time, until you find the value.
Binary search. Given 10 numbers, it will take as many as 3 steps —-
log2 of 10 —- to find one of them in a sorted array. O(log n) search
is awesome. If you have 65,000 elements, it’s only going to take 16
steps to find one of them. Double the elements, and it’s 17 steps.
But sorted arrays suck; for one thing, sorting is more expensive than
linear search. So we don’t use binary search much; instead, we use
binary trees.
Picture 3.png
To search a binary tree, you start at the top, and ask yourself “is my
key less than (left) or greater than (right) the current node”, and
repeat until ok, ok, ok, you know this stuff already. But that tree is
pretty, isn’t it?
Search with a (balanced) binary tree is O(log n), like binary search,
varying with the number of elements in the tree. Binary trees are
awesome: you get quick lookup and sorted traversal, something you
don’t get out of a hash table. Binary trees are a better default table
implementation than hash tables.
2.
But binary trees aren’t the only tree-structured lookup mechanism.
Binary radix tries, also called a PATRICIA trees, work like binary
trees with one fundamental difference. Instead of comparing
greater-than/less-than at each node, you check to see if a bit is set,
branching right if it’s set and left if it isn’t.
Picture 4.png
I’m leaving out a lot about how binary radix tries work. This is a
shame, because radix tries are notoriously underdocumented —-
Sedgewick infamously screwed them up in “Algorithms”, and the
Wikipedia page for them sucks. People still argue about what to call
them! In lieu of an explanation of backlinks and bit-position-labelled
edges, here’s a tiny Ruby implementation.
Here’s why radix tries are cool:
Search performance varies with the key size, not the number of elements in the tree. With 16 bit keys, you’re guaranteed 16 steps
regardless of the number of the elements in the tree, without
balancing.
More importantly, radix tries give you lexicographic matching, which is a puffed-up way of saying “search with trailing wildcard”, or
“command-line-completion-style search”. In a radix tree, you can
quickly search for “ro*” and get “rome” and “romulous” and “roswell”.
3.
I’ve lost you.
Let’s put this in context. Tries are a crucial data structure for
Internet routing. The routing problem goes like this:
You have a routing table with entries for “10.0.1.20/32 -> a” and “10.0.0.0/16 -> b”.
You need packets for 10.0.1.20 to go to “a”
You need packets for 10.0.1.21 to to to “b”
This is a hard problem to solve with a basic binary tree, but with a
radix trie, you’re just asking for
“1010.0000.0000.0000.0000.0001.0100” (for 10.0.1.20) and “1010.” (for
10.0.0.0). Lexicographic search gives you “best match” for routing. You can try it in the Ruby code above; add *”10.0.0.0”.to_ip to the
trie, and search for “10.0.0.1”.to_ip.
The correspondance between routing and radix tries is so strong that
the most popular general-purpose radix trie library (the one from
CPAN) is actually stolen out of GateD. It’s a mess, by the way, and
don’t use it.
If you understand how a trie works, you also understand how regular
expressions work. Tries are a special case of deterministic finite
automata (DFAs), where branches are based exclusively on bit
comparisons and always branch forwards. A good regex engine is just
handling DFAs with more “features”. If my pictures make sense to you,
the pictures in this excellent article on Thompson’s NFA-DFA reduction
algorithm will too, and that article will make you smarter.
4.
You’re a network operator at a backbone ISP. Your world largely
consists of “prefixes” —- IP network/netmask pairs. The netmasks in
those prefixes are hugely important to you. For instance, 121/8
belongs to Korea; 121.128/10 belongs to Korea Telecom, 121.128.10/24
belongs to a KT customer, and 121.128.10.53 is one computer inside
that customer. If you’re tracking down a botnet or a spamming
operation or worm propagation, that netmask number is pretty important
to you.
Unfortunately, important though they are, nowhere on an IP packet is
there stamped a “netmask” —- netmasks are entirely a configuration
detail. So, when you’re watching traffic, you essentially have this
data to work with:
ips.png
Surprisingly, given enough packets to look at, this is enough
information to guess netmasks with. While working at Sony, Kenjiro Cho
came up with a really elegant way to do that, based on tries. Here’s
how:
Take a basic binary radix trie, just like the ones used by software
routers. But bound the number of nodes in the tree, say to 10,000. On
a backbone link, recording addresses out of IP headers, you’ll exhaust
10,000 nodes in moments.
Store the list of nodes in a list, sorted in LRU order. In other
words, when you match an IP address with a node, “touch” the node,
sticking it at the top of the list. Gradually, frequently seen
addresses bubble up to the top, and infrequently seen nodes sink to
the bottom.
Picture 6.png
Now the trick. When you run out of nodes and need a new one, reclaim
from the bottom of the list. But when you do, roll the data from the
node up into its parent, like so:
Picture 5.png
10.0.1.2 and 10.0.1.3 are sibling /32s, the two halves of 10.0.1.2/31. To reclaim them, merge them into 10.0.1.2/31. If you need to reclaim
10.0.1.2/31, you can merge it with 10.0.1.0/31 to form 10.0.1.0/30.
Do this over, say, a minute, and the standout sources will defend
their position in the tree by staying at the top of the LRU list,
while ambient /32 noise bubbles up to /0. For the raw list of IP’s
above, with a 100 node tree, you get this.
Cho calls this heuristic Aguri.
5.
Aguri is BSD licensed. You can go download it, and a driver program
that watches packets via pcap, from Cho’s old home page.
6.
I’m going somewhere with this, but I’m 1300 words into this post now,
and if you’re an algorithms person you’re tired of me by now, and if
you’re not, you’re tired of me by now. So, let Aguri sink in, and I’ll
give you something cool and useless to do with it later this week.
Ci sono numerosi link sparsi lì dentro. Sfortunatamente, Archive.org non conserva le immagini, solo il testo, quindi molte di esse sono andate perse. Ecco quelli che ha archiviato: