Bitsquatting PCAP Analysis Part 1: Analyzing PCAPs using Unix command line tools

This blog post was originally going to be about domain name distribution in the bitsquatting PCAPs, but I found a problem with my first analysis. The problem has been turned into an opportunity, and now this blog post is about domain name distribution in the bitsquatting PCAPs, and a tutorial on how to determine the distribution yourself!

This blog post/tutorial will follow the process I used to answer the following questions:

How many unique domains appear in queries directed at the bitsquatting nameserver? Answer: 4271.
What is the frequency distribution of queried domains? Answer: long tail; percentages here.

Prerequisites

A basic familiarity with Unix is assumed throughout this tutorial. While all the commands listed were run on Mac OS X, any sufficiently Unix-y environment should work.

To do the analysis we are going to install some extra software. We will need the following:

wireshark (specifically, the tshark and mergecap utilities) to dissect packet captures
the GNU version of coreutils (for ls -v)
7zip to decompress the compressed DNS packet captures
and wget because I hate remembering to use "curl -O" to download via the command line.

All the prerequisites should be easily available with your favorite package manager. Since all command examples in this tutorial were run on Mac OS X, I installed the prerequisites via homebrew:

# brew install coreutils
# brew install wireshark
# brew install p7zip
# brew install wget

Downloading and Extracting the Data

The first step of analysis is to get the data. Lets download and extract the Bitsquatting PCAPs:

$ wget http://dinaburg.org/data/dnslogs.tar.7z
$ 7z x dnslogs.tar.7z
$ tar xvf dnslogs.tar
$ rm dnslogs.tar dnslogs.tar.7z

Numerous files named dnslog, dnslog1, dnslog2, etc. should now be in your working directory. These files contain packet captures (PCAPs) of DNS traffic.

The tcpdump utility is the most basic way to analyze PCAP contents. Lets take a look to see what the logs contain:

$ tcpdump -n -v -r dnslog

All of the output should be details about DNS queries. The output format is described in detail on the tcpdump man page. This tutorial is not about tcpdump, I included this step since it is a very good idea to investigate any unknown PCAPs with tcpdump and look for oddities before opening them in more complex tools. Opening files from unknown sources in wireshark can be dangerous. Even though it wont be further referenced in this blog post, the tcpdump utility is extremely handy; I highly recommend reading some tcpdump tutorials for background knowledge.

Combining PCAPs

The PCAPs are cumbersome to work with since they are split into several files. To make analysis easier, lets re-assemble all the disparate PCAPs into a single file. There is a tool called mergecap that comes with wireshark that is made exactly for this purpose.

$ mergecap -a -w completelog.pcap `gls -1v`

The above command will use mergecap in append mode (-a), and save the result into completelog.pcap. Append mode instructs mergecap to simply concatenate the files with correct headers, otherwise mergecap will use packet timestamps to create combined file. The files to merge are given by "gls -1v". Note: gls is GNU ls, it is used because the default ls on Mac OS X does not have a numeric sort option. If you are using Linux just use "ls -1v" in your command line.

Initial Analysis

Now that we have a merged PCAP, lets do some analysis. To review, the two questions we will answer in this blog post are:

How many unique domains appear in queries directed at the bitsquatting nameserver?
What is the frequency distribution of queried domains ?

The answers to both of these questions depend on extracting the query name from every incoming DNS query. Luckily we will not need two write any PCAP reading code; there are many great projects specifically meant for dissecting PCAPs. In this post, we will be using using tshark, the text-only part of the wireshark network traffic analyzer as our PCAP dissector.

The following tshark command will display the query name field of every DNS query in completelog.pcap:

$ tshark -n -r completelog.pcap  -o column.format:'"QNAME", "%Cus:dns.qry.name"'

The above command instructs tshark to not attempt DNS resolution (-n), to read from completelog.pcap as the packet source (-r), and to override the default output format to be the dns.qry.name field of the packet (-o).

The tshark utility supports many output formats. The column.c file in the wireshark source specifies allowed formats. It is my understanding that the custom format specifier (%Cus) can accept any protocol filter field. The wireshark Display Filter DNS Protocol reference specifies all filter fields for the DNS protocol.

If you ran the command you will notice it takes a long time to finish. Its best to pick a small subset of the data first and ensure there are no problems before working with the full set. Lets verify that it is possible to count domain frequency in the first 1000 queries:

$ tshark -n -r completelog.pcap  -o column.format:'"QNAME", "%Cus:dns.qry.name"' | head -n 1000 | sort | uniq -c | sort -rn
 183 ns1.0mdn.net
 171 ns2.0mdn.net
 129 eicrosoft.com
 108 micro3oft.com
  85 mic2osoft.com
  65 static.ak.fbgdn.net
  44 www.micro3oft.com
  38 iicrosoft.com
  32 gmaml.com
  24 www.mic2osoft.com
  20 www.miarosoft.com
  19 www.amazgn.com
  14 profile.ak.fjcdn.net
  13 forum.micro3oft.com
  12 mscrl.eicrosoft.com
  12 0mdn.net
  11 amazgn.com
   8 profile.ak.fbgdn.net
   8 aeazon.com
   2 www.gmaml.com
   2 www.eicrosoft.com

It turns out there are casing issues with the query name field: micro3oft.com and MICRO3OFT.COM are counted as different domains although they are semantically the same. Domain resolution is case-insensitive, but query name case can matter. For instance, 0x20 encoding uses query name casing to increase DNS query entropy. Increasing query entropy makes DNS forgery attacks more difficult. More details can be found in the 0x20 encoding paper.

Since we are interested in only the semantic meaning of domains, they should all be converted to lower case before frequency counting. This can be done by piping the names through tr. The new command line should only output lowercased domains:

$ tshark -n -r completelog.pcap  -o column.format:'"QNAME", "%Cus:dns.qry.name"' | head -n 1000 | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -rn
 183 ns1.0mdn.net
 171 ns2.0mdn.net
 129 eicrosoft.com
 108 micro3oft.com
  85 mic2osoft.com
  65 static.ak.fbgdn.net
  44 www.micro3oft.com
  38 iicrosoft.com
  32 gmaml.com
  24 www.mic2osoft.com
  20 www.miarosoft.com
  19 www.amazgn.com
  14 profile.ak.fjcdn.net
  13 forum.micro3oft.com
  12 mscrl.eicrosoft.com
  12 0mdn.net
  11 amazgn.com
   8 profile.ak.fbgdn.net
   8 aeazon.com
   2 www.gmaml.com
   2 www.eicrosoft.com

Our problem is solved. Lets create a directory for our analysis outputs and count the domain frequency in the full data set:

$ mkdir analysis
$ tshark -n -r completelog.pcap  -o column.format:'"QNAME", "%Cus:dns.qry.name"' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -rn > analysis/requested_domains.txt

The full frequency count can be viewed here: (requested_domains.txt, 117KB, text).

Now we can finally answer our first question: How many unique domains appear in queries directed at the bitsquatting name server? Since the resulting file has one domain per line, the line count will be the number of unique domains:

$ wc -l analysis/requested_domains.txt
  4271 analysis/requested_domains.txt

There was a total of 4271 unique requested domains.

Removing Outliers

Lets try to get a feel for the distribution of domains in the query name field. First, lets look at the most frequently requested domains:

$ head -n 10 analysis/requested_domains.txt 
1124949 ns1.0mdn.net
1101708 ns2.0mdn.net
184405 gmaml.com
142283 support.doublechick.net
80865 static.ak.dbcdn.net
78839 miarosoft.com
70174 mail.gmaml.com
59500 g.mic2osoft.com
57514 microsmft.com
54125 www.amazgn.com

The left column is the number of times the domain appeared in the query name field, and the right column is the domain.

The domains ns1.0mdn.net and ns2.0mdn.net are outliers, they are by far the most frequently requested. These domains were the authoritative name servers for my bitsquatting domains. The high frequency of queries for these domains has nothing to do with their popularity and has everything to do with their authoritative name server status. Including them in the top 10 count would be improper. The new top ten most frequently queried domains, with authoritative servers excluded are:

184405 gmaml.com
142283 support.doublechick.net
80865 static.ak.dbcdn.net
78839 miarosoft.com
70174 mail.gmaml.com
59500 g.mic2osoft.com
57514 microsmft.com
54125 www.amazgn.com
45866 amazgn.com
32021 mssupport.micrgsoft.com

Calculating Percentages

Raw query numbers are interesting, but to better comprehend query name frequency the percentage of total queries is a better measurement. To calculate the percentages we first need to calculate the total number of queries excluding queries for authoritative name servers. The following awk script will add all values in the the query count column (the first column) of requested_domains.txt, excluding the first two rows (the query counts for the authoritative name servers):

$ awk 'NR > 2 {sum+=$1} END {print sum}' < analysis/requested_domains.txt
1451284

Using the total number of queries we can write another awk script to convert query frequencies into percentages. Lets look at the percentage of queries represented by each of the top 10 most frequently queried domains:

$ awk 'NR > 2 {printf "%2.2f %s\n", $1/1451284*100, $2}' < analysis/requested_domains.txt | head -n 10
12.71 gmaml.com
9.80 support.doublechick.net
5.57 static.ak.dbcdn.net
5.43 miarosoft.com
4.84 mail.gmaml.com
4.10 g.mic2osoft.com
3.96 microsmft.com
3.73 www.amazgn.com
3.16 amazgn.com
2.21 mssupport.micrgsoft.com

The full percentages can be downloaded here: (requested_domain_percentage.txt, 117KB, text).

We have now answered the second question: What is the frequency distribution of queried domains? The domain frequency distribution is a superb example of a long tail.

Update:
Part 2 is now up, Bitsquatting PCAP Analysis Part 2: Query Types, IPv6.

Artem Dinaburg's Blog

Pages

Monday, November 5, 2012