Tuesday, April 30, 2013

JavaScript Frustrations and Solutions

Since there's no better way to learn than by doing, I've been teaching myself JavaScript by writing a structured binary data fuzzer. The fuzzer currently generates Windows ICO files, and will soon be released. In the meantime, I wanted to describe some frustrating experiences learning JavaScript and include solutions to them.

Object Orientation in JS is Confusing 


Some of this may be because I am used to class-ical inheritance, but considering the number of JavaScript OOP libraries (e.g. oolib, dejavuKlass, selfish), I'm not alone.

The first confusing thing is that objects are functions declared via the function keyword and instantiated via the new operator. The overloaded use of function doesn't let you know right away if the code you are reading is an object a traditional function. The use of the new operator gives a false impression of class-ical inheritance and has other deficiencies. For instance, until the introduction of Object.create it was impossible to validate arguments to an object's constructor. The deficiency is shown in the following example.

In this motivating example, we want to create an object to encapsulate integers and validate certain properties in the object's constructor. The initial code could look something like this:

function Int(arg) {
    console.log("Int constructor");
    this.name = arg['name'];
    if(this.name === undefined)
    {
        alert('a name is required!');
    }
    this.size = arg['size'];
};
Int.prototype.getName = function() {
    console.log("Int: " + this.name);
};
var i = new Int({'name': 'generic int'});
i.getName();

Running this code would print:

Int constructor
Int: generic int

But now lets say I want to write something to deal specifically with 4-byte integers. The initial code to inherit from the Int object would look similar to the following:

function Int4(arg) {
    arg['size'] = 4;
    Int.call(this, arg);
    console.log("Int4 constructor");
};
Int4.prototype = new Int({});
Int4.prototype.constructor = Int4;
Int4.prototype.getName = function() {
    console.log("Int4: " + this.name);
};
var i4 = new Int4({'name': '4-byte int'});
i4.getName();

This code will alert with 'a name is required'! To set Int4's prototype chain we need to create a new Int object. Arguments to the constructor cannot be validated since they are not known when new Int({}) is called. Luckily this has been fixed by use of Object.create:

function Int4(arg) {
    arg['size'] = 4;
    Int.call(this, arg);
    console.log("Int4 constructor");
};
Int4.prototype = Object.create(Int.prototype);  Int4.prototype.constructor = Int4;
Int4.prototype.getName = function() {
    console.log("Int4: " + this.name);
};
var i4 = new Int4({'name': '4-byte int'});
i4.getName();

All Functions are Function Objects and all Objects are Associative Arrays.


All functions are actually Function objects, all objects are associative arrays. There are also Arrays, which are not functions and but are also associative and also objects. Sometimes you want Arrays to be arrays, and sometimes you actually want Objects to be arrays. Confused yet?

Scoping Rules and Variable Definition Rules that Lead to Subtle Bugs


Scoping rules are a bit confusing, since there is at least three ways to declare variables: assignment, var, and let. Of course, all of these have different semantics. The biggest problem for me was that creating a variable by assignment adds it to the global scope, but using var will keep it in function scope. And when using identically named variables, a missing var in one function will make that function use the global variable instead of the local. Using the wrong variable will lead to lots of frustrating errors.

The solution is to always "use strict" to force variable definitions. Of course, doing this globally will break some existing libraries you are using. Such is life.

Type Coercion With the Equality Operator (==)


Its amazing what is considered equal in JavaScript via ==. Instead of restating all these absurdities, I'll just link to someone else who has:
http://javascriptweblog.wordpress.com/2011/02/07/truth-equality-and-javascript/

When I started my project, I didn't realize that the Strict Equality (===) existed. It should be used anywhere you would expect == to work. It seems more sane to have == be Strict Equality, and another Coercive Equality operator (something like ~= or ~~), but what is done is done.

Problems Modularizing and Importing Code


C/C++ has #include, Python has import, JavaScript has... terrible hacks. There is sadly no standard way to import new code in a .js file, making modularization of your code difficult. I resorted to simply including prerequisite scripts in the HTML where they will be used, but I wish there was a way to include JavaScript from JavaScript.

Browser Compatibility Issues


Not all browsers have Object.create. Not all browsers have console.log in all situations. Not all browsers support "use strict". Turns out every browser is slightly different in a way that will subtly break your code, but of course the main culprit is usually IE.

Wednesday, March 20, 2013

Solution to Printing Blank Pages Problem in Linux

This isn't an overly technical post but I hope it saves someone hours of frustration printing on Linux. 

In my case the problem was a combination of broken generic printer drivers and a bad default value for the "Print Quality" setting. As a word of caution, according to the Anna Karenina Principle odds are your problem is its own unique snowflake and this wont help you print.

Problem 

  • You are trying to print from Linux. 
  • The printer starts, makes printing noises, but only a blank page (i.e. one with no ink on it) comes out.
  • You verified your printer works by printing from another OS. If you have not, do this. If your printer still prints blanks on Windows/MacOS, you have a printer problem, not a Linux problem.

Solution

The solution is two part; both parts were needed to actually see ink on paper.
  1. Install printer-specific software.

    The drivers that came with CUPS and claimed to support my printer didn't work. For HP printers, you need to sudo apt-get install hplip, and run hp-setup. If you have another brand printer, look here for help.

  2. Change the "Print Quality" setting to something else.

    The setting is in the CUPS web interface. Go to http://localhost:631 (you may need to log in with a local account) -> Administration -> Manage Printers -> Your Printer's Name -> Administration Selection Box, pick "Set Default Options". Clicking that will take you to the following page:
Change the Print Quality setting to something else. Try all the values. For me Normal Grayscale worked, Normal Color did not.

Try all the Print Quality options. Hopefully one of them prints. Yes, the setting is hard to find to and obscure, but hey, at least you didn't have to edit another config file!

My next post may be about trying to get network printer sharing to work between Linux and Mac OS X Mountain Lion, which was its own struggle.

Monday, February 18, 2013

Your Missing Package: When Address Correction Fails

Amazon address correction is wrong for large parts of Chicago. This leads to late and missing packages. This handy map shows areas most affected by address correction failure. To avoid delivery problems always use your full ZIP+4 when placing online orders. You can find the full ZIP+4 for your address via the USPS ZIP Code (TM) Lookup Tool.

I don't mean to pick on Amazon -- this problem has happened with several other retailers. I used Amazon because it was easy to cross-check their address verification with USPS. If you are an online retailer, make sure you have a working address correction system. If Amazon can get it wrong, what makes you think yours works? Bad address correction is costing you customers.

The Problem

Have your Amazon packages ever been late or missing?
Have you ever gotten a "notice left" email but no notice?
Did USPS confirm delivery but there was no package?
Do you only use a 5 digit ZIP code when filling out your address?

You may be a victim of address correction failure. And you are not alone.

Here is how to check:

First, go to "manage addresses" and look at your address on Amazon.
Now, go to the USPS ZIP Code (TM) Lookup Tool and check your address.

If the full 9 digit ZIP Codes do not match, there is a problem. If you live in Chicago, I made a heat map of where verification failures are most likely to occur.

Address Verification Failures

Mailers validate your address prior to shipment to save money on shipping costs. The address validation step is called Delivery Point Validation (DPV), and it requires a complete mailing address including a full 9 digit ZIP Code. Since few people know their full ZIP Codes, a suite of software called Coding Accuracy Support System (CASS) will correct an address into one that can be checked via DPV. The correction step can fail, and "correct" your address to a different building. To find out why, its time for a quick lesson on DPV, CASS, and ZIP Codes.

Note: I am not an expert on mailing, this information is what I have learned from judicious searching. It may be wrong. If I am, please correct me.

DPV and CASS

Mailers use DPV to ensure an address is deliverable before passing the mail to USPS. In return, they receive discounted postage rates for reducing the work USPS has to do. From The History of Worksharing Discounts and CASS Certified™ Software:

In 1983, the United States Postal Service (USPS) implemented a program that provided mailers a postage discount for sharing the work to prepare the mail for processing. This allowed the USPS to provide more cost-efficient mail processing based on the advance work performed by the mailer in providing high-quality addresses for their mail.

People are notoriously bad typers and spellers, and tend to omit information. Before a delivery point is verified, an address has to go through a Coding Accuracy Support System (CASS) check. The CASS software will fix an address to one that can be validated by DPV. From the Wikipedia page:

The input of:
1 MICROWSOFT
REDMUND WA
Produces the output of:
1 MICROSOFT WAY
REDMOND WA 98052-8300

CASS software has to be certified by the USPS and has to undergo certification testing every two years. The caveat is that CASS validation only checks address matching, not the accuracy of the matched address. From the USPS:

However, CASS processing does not measure the accuracy of ZIP + 4, delivery point, 5-digit ZIP, or carrier route codes in a mailer’s address file.

If the mailer's ZIP+4 database is wrong, CASS can't fix it.

Why do ZIP+4 Codes matter?

In a city, a ZIP+4 will determine the building or even the floor or group of apartments a piece of mail goes to. From the USPS website (emphasis mine):

The ZIP+4 Code was introduced in 1983. The extra four numbers allow mail to be sorted to a specific group of streets or to a high-rise building. In 1991, two more numbers were added so that mail could be sorted directly to a residence or business. Today, the use of ZIP Codes extends far beyond the mailing industry, and they are a fundamental component in the nation’s 911 emergency system.

If the ZIP+4 code is wrong, your mail goes to the wrong building. Your mailman might not catch this. Mail with electronic mailing information (i.e. pretty much all packages from online retailers) is automatically sorted and binned by machines. On busy urban routes the mailman doesn't know everyone and they aren't going to check every single piece of mail. They're going to take machine sorted mail bin, deposit it at the address they always do, and move on. If you're lucky, you may get a redelivery notice.

... but Amazon ships via UPS/Fedex?

UPS and FedEx may do hand-off to USPS for final delivery. This is a part of USPS work-share programs that UPS calls a mailing innovation.


The Address Verification Failure Map

The following map shows differences between ZIP+4 Codes returned by USPS and ZIP+4 Codes corrected by Amazon for 1,857 addresses in the City of Chicago. Green markers mean a match, blue markers represent ZIP+4 Codes from USPS, and yellow markers represents ZIP+4 codes from Amazon. A red connecting line associates the USPS and Amazon results for the same address.


There are correction mistakes throughout the City, with the most mistakes in the Loop and the area immediately to the north and northwest. This correlates pretty well with the number of large apartments and condos, and hence specificity of ZIP+4 codes.

I chose Chicago addresses because thats where I live. The addresses were a random sampling from the City of Chicago business license holders. The City of Chicago has an excellent open data site at https://data.cityofchicago.org/. This research would not have been possible without it.

I sampled 2000 addresses out of a possible 381677. Of these, 143 (~7%) addresses were not found -- that is, either the USPS or Amazon had a failure in obtaining a ZIP+4 for the address. There were 519 (~26%) addresses with a different ZIP+4 between USPS and Amazon, and 1338 (67%) addresses with the same ZIP+4.

I am making available the addresses used to generate this map.

File Metadata Description
zip_diffs.txt41KB, textZIP+4 Differences
zip_equals.txt100KB, textZIP+4 Matches
zip_fails.txt11KB, textFailure to get ZIP+4 for an address

My verification scripts would select the first suggested address or the automatically corrected address (assuming no address was suggested) given by Amazon. For some streets, the suggested address was very far from the initial input. No human would have selected it, so the most egregious correction errors would likely have been caught. The places where the yellow and blue marker are close together are the most dangerous -- it is likely only a +4 digit difference which most users (like myself) would never notice.

To map ZIP+4 addresses to latitude/longitude and to create the map, I used the MapQuest API. MapQuest may seem like an odd choice, but it had great documentation and examples, and it was the first service I could find with support for mapping a ZIP+4 to latitude/longitude.

Backstory

I recently moved to Chicago with only what I could fit in my car, which meant I had to buy a lot of household items. I do most of my shopping online since I hate the crowds, salesmen, and poor selection at brick and mortar stores. This means I buy a lot of stuff on Amazon.

I first became suspicious when I received the following email:

Fool me once, shame on you.
It is impossible to leave an unattended package at my address. The building has 24/7 front desk staff and a dedicated package receiving room. I dutifully filled out the re-delivery form, and received my package a few days later. I thought nothing of it until I received this second email:

Fool me twice, shame on me.
Around the same time my fiancee had several packages (not from Amazon, but other vendors) never arrive, despite USPS confirming delivery. Something was wrong, it was time to investigate.

The addresses on the re-delivered package labels, order confirmation, and amazon.com all seemed correct. The front desk staff hadn't noticed any delivery attempts, and no packages had been left for me.

I was stumped and considered just not shopping online, until I had a thought: USPS re-delivery worked, but original delivery sent it to a mystery address. Was there a difference between the USPS address and the Amazon.com address?

Sure enough, there was. The ZIP+4 code had the wrong +4 digits. Searching online for the ZIP+4 Code from USPS results only in matches with my building's address. Searching for the ZIP+4 Code from Amazon results only in matches from buildings a few numbers down, with no front desk staff.

Mystery solved.

I immediately emailed Amazon with the problem. This was in mid January. As of February 18th, my address is still corrected to the wrong ZIP+4 Code.

A Bigger Problem

Did I just live at the wrong address, and this was an isolated case, or if there was a more systematic address correction problem?

That is why I made the map. Turns out some areas are more affected than others, and that my address is not the only one. I hope that by exposing this publicly I can help others avoid the hassle and headaches of online ordering. 

Conclusion

Major vendors, including Amazon, get address correction wrong. In my sample of Chicago business addresses, 26% had a ZIP+4 that did not match the one returned by USPS.

If you are an online retailer, please check your CASS and DPV software. Don't just assume it works, but write some scripts to test it yourself. Your customers will thank you. If your customers complain about missing packages, check that their address corrects properly.

If you buy things online, memorize your ZIP+4 Code and use the full code where you can. If you live in an urban area, and the vendor only accepts a 5 digit ZIP, shop somewhere else because you may never get what you bought.

Saturday, January 5, 2013

The Internet Sign


The Internet. It enhances communication, enables global commerce, and has become an indispensable part of people's daily lives. The Internet disseminates information around the globe and helps bypass censorship in repressive regimes. It is a great force for good, and some have said, has resulted in the largest legal creation of wealth on the planet.

What commemorates the creation of the Internet? There is a plaque at Stanford University. And near a "No Parking" sign outside the former ARPA building in Arlington County, Virginia there is a sign.

The Internet Sign.

I refer to the sign as the Internet Sign to make its significance is more obvious, but more technically it is the ARPANET Sign.

The sign is near the corner of Oak St. and Wilson Boulevard in Arlington, Virginia. It is not (yet) visible on Google Street View. The location of the sign is the old ARPA building. ARPA moved to the Wilson Boulevard location from the Pentagon, then as DARPA it moved to 3701 N. Fairfax Dr. DARPA recently moved again, still within Arlington County, to 675 N. Randolph Street.

The following text appears on the sign:

ARPANET
THE ARPANET, A PROJECT OF THE
ADVANCED RESEARCH PROJECTS AGENCY
OF THE DEPARTMENT OF DEFENSE,
DEVELOPED THE TECHNOLOGY THAT
BECAME THE FOUNDATION FOR THE
INTERNET AT THIS SITE FROM 1970 TO
1975. ORIGINALLY INTENDED TO SUPPORT
MILITARY NEEDS, ARPANET TECHNOLOGY
WAS SOON APPLIED TO CIVILIAN USES,
ALLOWING INFORMATION TO BE RAPIDLY
AND WIDELY AVAILABLE. THE INTERNET,
AND SERVICES SUCH AS E-MAIL,
E-COMMERCE AND THE WORLDWIDEWEB,
CONTINUES TO GROW AS THE UNDER-
LYING TECHNOLOGIES EVOLVE. THE
INNOVATIONS INSPIRED BY THE
ARPANET HAVE PROVIDED GREAT
BENEFITS FOR SOCIETY.
ERECTED IN 2008 BY ARLINGTON COUNTY, VIRGINIA 

Below the main text is a smaller plaque with binary digits:

The binary (01000001 01010010 01010000 01000001 01001110 01000101 01010100) spells ARPANET in ASCII.

The Internet Sign wasn't actually erected in 2008; the unveiling ceremony happened in 2011. ARLnow has reasons for the delay:
According to Arlington spokeswoman Diana Sun, the county was unable to get permission from the building owner to put the sign on their property, so they had to go through a lengthy process of getting the sign installed in the public right-of-way (sidewalk). By the time all the pieces were in place, and by the time they could organize a small ceremony at a County Board meeting, it was 2011 — three years later than originally planned.

Which building owner that didn't want the sign on their property? A glance at Google maps will show the adjacent land is used by the US State Department. Why would the State Department refuse to commemorate a tool that has allowed uncensored information to reach the oppressed masses? I imagine security concerns about tourists congregating so close to a government building.

While I am sure the State Department's reasons for not hosting the Internet Sign are sound, the result is a rather sad commemoration. Surely there is a more tactful way to acknowledge the creation of the Internet than by a sign on the sidewalk.

Friday, November 23, 2012

Bitsquatting PCAP Analysis Part 4: Source Country Distribution

This is part 4 of a multipart series, the previous post is Bitsquatting PCAP Analysis Part 3: Bit-error distribution.

This blog post will examine the source country distribution of packets in the bitsquatting PCAPs. To map a source IP address to a physical location, we will use MaxMind's free GeoLite Data (available at http://dev.maxmind.com/geoip/geolite) as the data source, and write a quick Python script using pygeoip to do the IP-to-location translation.

IP to Location Translation


First, lets download and decompress the free GeoLite City Database provided by MaxMind:

$ wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz
$ gunzip GeoLiteCity.dat.gz

Next, we will install pygeoip. The installation procedures for Python packages vary, but its likely that pygeoip can be installed by setuptools:

# easy_install pygeoip

The pygeoip page on github provides all the necessary usage examples to create an IP-to-country  script. My script, which reads in IPv4 addresses line-by-line on from a file (or stdin) and outputs an "ip:country:city" mapping is available here: ip_to_city_country.py.

The example usage:

$ ./ip_to_city_country.py --help
usage: ip_to_city_country.py [-h] [-d GEOIPDB] [ipfile]

Show city and country of IP addresses using MaxMind GeoIP Database

positional arguments:
  ipfile      a file from which to read IP addresses (default: stdin)

optional arguments:
  -h, --help  show this help message and exit
  -d GEOIPDB  Path to the GeoIPCity database (default: GeoLiteCity.dat)


$ echo '8.8.8.8' | ./ip_to_city_country.py
8.8.8.8:US:Mountain View

Source Address Frequency


The first step to mapping source country frequency is to identify source address frequency. While the source address frequency is only an intermediate step to gather source country distribution, it is very handy for a manual analysis of where queries are coming from.

$ tshark -n -r completelog.pcap -o column.format:'"SOURCE", "%s"' | sort | uniq -c | sort -rn > analysis/ips_all.txt

A read-filter can be applied to get the source IPs with the 0mdn.net outliers removed:

$ tshark -n -r completelog.pcap -R '!(dns.qry.name contains 0mdn.net)' -o column.format:'"SOURCE", "%s"' | sort | uniq -c | sort -rn > analysis/ips_nomdn.txt

The results for the frequency of all source IPs (ips_all.txt, 848KB, text) and only IPs not requesting 0mdn.net (ips_nomdn.txt, 740KB, text) are available for download.

These intermediate results show how many packets were received from each IP. The list is interesting in its own right. The top few results are an unresponsive IP in Poland,  IPs with PTR records pointing to subdomains of rscott.org (possibly in related to http://rscott.org/dns/ ?), an open-recursive namserver at a Russian ISP, a resolver for LeaseWeb, and an MTA for WindStream Communications. Feel free to investigate more on your own.

Source Country Frequency



To find the frequency of source countries, each address will be mapped to its origin country. Only unique addresses, not how many packets were received from each address, will be counted for the distribution. Some shell commands and the ip_to_city_country.py script will identify the source countries. In the commands below, gcut, the GNU version of cut is used since the default cut on Mac OS X cannot handle non-ASCII characters.

$ awk '{print $2}' analysis/ips_all.txt | ./ip_to_city_country.py > analysis/ip_all_location_mapping.txt

$ gcut -f 2 -d ':' analysis/ip_all_location_mapping.txt | sort | uniq -c | sort -rn  > analysis/all_country_frequency.txt

$ awk '{print $2}' analysis/ips_nomdn.txt | ./ip_to_city_country.py > analysis/ip_nomdn_location_mapping.txt

$ gcut -f 2 -d ':' analysis/ip_nomdn_location_mapping.txt | sort | uniq -c | sort -rn > analysis/nomdn_country_frequency.txt

The all country frequency table (all_country_frequency.txt, 1.5KB, text) and the frequency table sans requests for 0mdn.net (nomdn_country_frequency.txt, 1.5KB, text) have very similar distributions, only the magnitude changes. This is easier to see in graph form:

Number of packets vs. source country ( all queries )


Number of DNS Packets vs. Source Country (excluding 0mdn.net)


The <error> field means the MaxMind GeoLite database did not have an entry for the particular IP address.

The large numbers for the US is likely due to the US-centric nature of many of the domains I bitsquatted, such as fbcdn.net, and the fact that the US just has considerably more IP allocations than other countries. The extensive world coverage of bitsquatting queries is really quite amazing; there are queries from 192 of the 250 countries in the MaxMind database.


Sunday, November 18, 2012

Bitsquatting PCAP Analysis Part 3: Bit-error distribution

This is the third post in a multi-post series. The previous post is here.

Which bits are more likely to be affected by bit-errors? What does the bit-error distribution look like?  In this blog post, I will attempt to answer those questions by looking at bit-errors in the requested record type field of DNS queries.

This post actually raises more questions than it answers: the bit-errors of the record type field are not distributed uniformly (the distribution one would expect from a random process), but instead mainly occur in bit 6 of the requested record type. I don't know why this is the case. I also don't know if this is only true for the record type field, or if this extends to the query name field as well. If you have any good suggestions, please contact me.

Bit-errors in the requested record type: A records


Astute readers will have noticed that in the previous post I didn't describe some of the top 15 requested record types. As a refresher, lets take another look at the top 15 most requested record types:

Rank Query Count Record Type
1550892a
2509605aaaa
3358926mx
426829any
525039soa
67729cname
74835513
84728ns
92597txt
101148srv
116981025
12232257
13222a6
14143ptr
15138spf

The 7th most popular record type is 513. Type 513 is not mentioned in the Wikipedia list of record types, and it is not in Wireshark's record type list. Why are there 4835 requests for an undefined record type?

The answer is clearer when we look at 513 in binary (zero-extended to 16 bits):

0000 0010 0000 0001

This value is only one bit away from 1, the A record request type. Other requested record types in the top 15 share this similarity: type 1025 and type 257 are both one bit away from type 1. In the full query types table there are other requests with this property, such as requests for type 65, 2049, 16385.

The requested record types one bit away from type 1, including binary representation and how often they were requested, are represented below:

Bit FlippedBinary ValueRR TypeCountUnique CountNote
01000 0000 0000 00013276900
10100 0000 0000 00011638552
20010 0000 0000 0001819300
30001 0000 0000 0001409700
40000 1000 0000 00012049227
50000 0100 0000 0001102569825
60000 0010 0000 00015134835142
70000 0001 0000 000125723250
80000 0000 1000 000112900
90000 0000 0100 00016512837
100000 0000 0010 000133overlaps SRV
110000 0000 0001 00011721overlaps RP
120000 0000 0000 1001900overlaps MR
130000 0000 0000 01015overlaps CNAME
140000 0000 0000 0011300overlaps MD
150000 0000 0000 0000021

Note: some entries are blank due to overlap with other popular record types. Unpopular/deprecated record types, such as RP, were included in the count for bit errors. All query type overlaps are noted in the notes column.

The count column is how often record type was requested. The unique count column is how often each record type was requested from a unique source IP. This was done to minimize the effect of one bit-error repeatedly manifesting itself via many repeated requests.

A visualization of the unique count column:


To obtain the unique count column, first we must get all the unique (source IP, query type) pairs (and disregard any queries for 0mdn.net):

$ tshark -n -r completelog.pcap -R '!(dns.qry.name contains 0mdn.net)' -o column.format:'"SOURCE", "%s", "QTYPE", "%Cus:dns.qry.type"' | sort -u > analysis/src_and_qtype.txt


After getting the (source IP, query type) pairs, a bash for loop can show us how many unique source IPs requested a certain record type.

$ for qt in 32769 16385 8193 4097 2049 1025 513 257 129 65 SRV RP MR CNAME MD unused; do echo "$qt:" `grep -i " $qt$" analysis/src_and_qtype.txt | wc -l`; done
32769: 0
16385: 2
8193: 0
4097: 0
2049: 7
1025: 25
513: 142
257: 50
129: 0
65: 37
SRV: 80
RP: 1
MR: 0
CNAME: 635
MD: 0
unused: 1

Counting only record requests by unique source IP address shows that the same error-prone query is repeated many times from the same source, but the overall distribution stays the same.

There has been much speculation about bit-error distribution and if any bits are more likely to be affected. Judging by bit-errors of the query type field some bits are considerably more likely to be affected: bit 6 accounts for the vast majority of bit-errors, with the error rate dropping sharply with distance from bit 6. This distribution is evident in the query type field; I have not verified if it still holds in the query name field.

I don't know why the distribution is as skewed as it is. Maybe the distribution is an artifact of the query type field and typical allocation alignments? Other thoughts and ideas are welcome.

Bit-errors in the requested record type: AAAA records


There are nearly as many AAAA record requests as there are A record requests. Do bit-errors of AAAA requests exhibit the same distribution?

Bit FlippedBinary ValueValueCountUnique CountNote
01000 0000 0001 11003279600
10100 0000 0001 11001641200
20010 0000 0001 1100822000
30001 0000 0001 1100412400
40000 1000 0001 1100207600
50000 0100 0001 1100105200
60000 0010 0001 110054041
70000 0001 0001 110028441
80000 0000 1001 110015600
90000 0000 0101 11009200
100000 0000 0011 11006000
110000 0000 0000 110012overlaps PTR
120000 0000 0001 01002000
130000 0000 0001 10002400
140000 0000 0001 11103000
150000 0000 0001 11012900

Despite there being a nearly identical number of queries of reach record types (when excluding queries for 0mdn.net), there are almost no bit-errors for AAAA record queries. The errors that do exist though correspond to errors in bit 6 and bit 7. Some of the discrepancy between the amount of bit errors in A and AAAA queries can be explained since there are simply fewer sources of AAAA queries:

$ tshark -n -r completelog.pcap -R '(dns.qry.type == AAAA) and !(dns.qry.name contains 0mdn.net)' -o column.format:'"SOURCE", "%s"' | sort -u > analysis/aaaa_sources.txt

$ tshark -n -r completelog.pcap -R '(dns.qry.type == A) and !(dns.qry.name contains 0mdn.net)' -o column.format:'"SOURCE", "%s"' | sort -u > analysis/a_sources.txt

$ wc -l analysis/aaaa_sources.txt analysis/a_sources.txt
    7206 analysis/aaaa_sources.txt
   29833 analysis/a_sources.txt

There are only ~24% as many sources of AAAA requests as there are of A requests. Still, this would only account for ~76% of the difference in error rate.

Conclusion


The bit-error distribution, at least with respect to the requested record type field, is not uniform. It is centered at bit 6 and sharply falls off with distance from bit 6. I don't have an explanation as to why, but I suspect might have to do with packet alignment in memory. Other possibilities include errant networking equipment or software somewhere on the Internet. Any ideas and suggestions, especially testable ones, are most welcome.

There are also more bit-errors in A records requests than AAAA record requests. The fact that there are fewer sources of AAAA records accounts for a part of this discrepancy, but does not completely eliminate it.

If you have any insight, please contact me.

Update:
Part 4 is now up, Bitsquatting PCAP Analysis Part 4: Source Country Distribution.

Wednesday, November 14, 2012

Bitsquatting PCAP Analysis Part 2: Query Types, IPv6


This is the second post in a multi-part series. The previous post is here.

In this installment of Bitsquatting PCAP analysis we will make an educated guess about the prevalence of IPv6 on the Internet, which services DNS is used for, and identify some mysteries in the bitsquatting PCAPs.

All of this information is going to come from just one field: the requested record type of each DNS query.

Background


First, some background on DNS record types. DNS is essentially a distributed hierarchical database. Values are retrieved by specifying a location and a record type. The location is a fully qualified domain name. The record type is one of several defined record types. The most commonly requested record type is A, which means IPv4 address. When you are using IPv4 and translate www.google.com to an IP address,  you are retrieving the A record for www.google.com.

The dig command is used to manually query for DNS records. The following command will retrieve the A record for www.google.com:

$ dig +short www.google.com a
173.194.75.99
173.194.75.147
173.194.75.104
173.194.75.103
173.194.75.105
173.194.75.106

The above command says: ask my local name server (usually specified in /etc/resolv.conf) for the A record for www.google.com. And output the result in short form. Note: the IP addresses returned for you will likely be different. Google attempts to direct you to a physically closer server based on the geo-ip location of the requesting DNS server. This is one part of how most content delivery networks work. More in a future blog post.

One more common record type is AAAA, which is used to retrieve IPv6 addresses. Why is the record type called AAAA? Because IPv4 addresses are 32 bits wide, and IPv6 addresses are 128 bits wide. If A is 32-bit, then AAAA would be 32+32+32+32=128-bit. Interestingly there used to be another record type for retrieving IPv6 addresses, A6, that has since been deprecated. Even if you are using IPv4, you can still retrieve the AAAA record of wwww.google.com:

$ dig +short www.google.com aaaa
2607:f8b0:400c:c01::67

What is DNS used for?


By tallying the frequency of requested record types, we can determine the popularity of DNS uses. The requested record type is specified by the query type field of each DNS request. We can retrieve the query type from each packet using tshark. Lets get a list of all requested record types, and how often each record type was requested:

$ tshark -n -r completelog.pcap  -o column.format:'"QTYPE", "%Cus:dns.qry.type"' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -rn > analysis/all_qtypes.txt

The full record type frequency table is available:  (all_qtypes.txt, 408B, text).

The table below shows the top 15 requested record types. Amazingly, the most requested DNS record type is IPv6 address resolution! Considering that other places measure IPv6 DNS traffic at only 15% of web traffic, something is definitely amiss. More on this after the discussion of DNS use.

Rank Query Count Record Type
12050660aaaa
21132372a
3359779mx
447335a6
538404any
625954soa
78155cname
85130ns
94835513
102622txt
111149srv
126981025
13232257
14144ptr
15141spf


Name resolution is by far the most popular use of DNS. Name resolution is responsible for the first, second, fourth, and seventh most frequently requested record types. Amazingly there is a very high frequency of deprecated A6 records. Can there really be that many old BIND servers out there?

The second most popular use of DNS is for email related services. The third most requested record type is MX, which is used for determining the incoming mail servers for a domain. MX records can be viewed from the command line as well:

$ dig +short gmail.com mx
10 alt1.gmail-smtp-in.l.google.com.
30 alt3.gmail-smtp-in.l.google.com.
5 gmail-smtp-in.l.google.com.
20 alt2.gmail-smtp-in.l.google.com.
40 alt4.gmail-smtp-in.l.google.com.

Along with MX, the other records commonly used for email are TXT (to hold SPF and DKIM data) which is the tenth most frequently requested, and SPF (used for SPF data) which is the fifteenth most frequent.

The fifth, sixth, and eighth most frequently record types are used all used for DNS infrastructure purposes. The ANY record type simply retrieves all available records, the SOA record type specifies who is the primary source for information about the domain, and the NS type specifies nameservers that can be used to answer queries about the domain.

The next most commonly requested record type, SRV, is used for custom protocol related records. In practice, most SRV queries are used to retrieve information for Jabber/XMPP and other messaging services, including VoIP/Videoconferencing services.

Finally PTR records are used for reverse DNS lookups. A reverse lookup is performed when you want to map an IP address to a domain name. This is one of the few (maybe the only?) time when you will encounter the .arpa TLD. ARPA originally stood for the Advanced Research Projects Agency, the US Government agency that funded the creation of the Internet. These days .arpa has been backronymed to Address and Routing Parameter Area, and what used to be ARPA is now DARPA.

To request a PTR records for an IPv4 address, the octets of the IP are reversed, and .in-addr.arpa is appended. This is because IP addresses are hierarchical from left to right but DNS is hierarchical from right to left. For example, to see what domain 173.194.75.99 (one of the IPs for  www.google.com) corresponds to, we would use the following command:


$ dig +short 99.75.194.173.in-addr.arpa ptr
ve-in-f99.1e100.net.

The returned domain is not www.google.com, but this is due to Google's infrastructure. There is a clever easter egg in the domain: 1e100 means 1.0 × 10100, which is one googol.

What can we learn about the prevalence of IPv6?


Before we jump to conclusions about IPv6, we should remember that there are outliers in the bitsquatting PCAPs. If you recall from the previous post, there were numerous queries for 0mdn.net because that domain was an authoritative name server. Queries for 0mdn.net might be affecting the record type distribution. Lets filter out these queries:

$ tshark -n -r completelog.pcap -R '!(dns.qry.name contains 0mdn.net)' -o column.format:'"QTYPE", "%Cus:dns.qry.type"' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -rn > analysis/nomdn_qtypes.txt

The full list of record types and their frequencies is available: (nomdn_qtypes.txt, 379B, text).

This command works using the -R option of tshark. The -R option specifies a wireshark display filter that is applied when reading PCAPs. The filter of !(dns.qry.name contains 0mdn.net) will match all packets where the query name field does not contain 0mdn.net. Lets examine the new results:

Rank Query Count Record Type
1550892a
2509605aaaa
3358926mx
426829any
525039soa
67729cname
74835513
84728ns
92597txt
101148srv
116981025
12232257
13222a6
14143ptr
15138spf

The new table is a much different picture with regards to IPv6, but there is still a large amount of AAAA record requests.

Lesson Learned: There are enough AAAA record requests to indicate IPv6 connectivity is important. If you are attempting to re-do the bitsquatting experiment, have IPv6 connectivity and answer AAAA requests!

What is the nature of IPv6 traffic (AAAA record requests)?


Why were there so many AAAA record requests for the authoritative nameservers, and how do these compare to other domains? Lets use tshark to retrieve all AAAA record requests, and which domain was the request was for:

$ tshark -n -r completelog.pcap -R '(dns.qry.type == AAAA)' -o column.format:'"QTYPE", "%Cus:dns.qry.name"' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -rn > analysis/AAAA_queries.txt

The full list of AAAA query frequencies is available: (AAAA_queries.txt, 17KB, text).

AAAA Queries Domain
794921ns2.0mdn.net
774496ns1.0mdn.net
77181static.ak.dbcdn.net
77053support.doublechick.net
66595gmaml.com
58634g.mic2osoft.com
28107s0.0mdn.net
16327www.amazgn.com
13401mail.gmaml.com
6367www.micro3oft.com
5678amazgn.com
4924www.mic2osoft.com
4789www.eicrosoft.com
4578pop.gmaml.com
4346static.ak.fbgdn.net


The two authoritative name servers receive the most AAAA requests, but there are other domains with numerous IPv6 lookups. Maybe these domains are just popular?

Ratio of IPv4 to IPv6 address lookups

The ratio of IPv4 address resolutions to IPv6 address resolutions will show the proportion of IPv6 traffic for each domain. This measurement should completely disregard popularity, as it uses ratios instead of absolute numbers. My hypothesis was that the ratios should be approximately the same for all domains, as none of the domains I bitsquatted were IPv6 related. Lets calculate the ratios. 

Step 1: Calculate A record frequency

The following command will tabulate the frequency of A record requests for each domain:

$ tshark -n -r completelog.pcap -R '(dns.qry.type == A)' -o column.format:'"QTYPE", "%Cus:dns.qry.name"' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -rn > analysis/A_queries.txt

The full list of A query frequencies is available: (A_queries.txt, 99KB, text).

Step 2: Massage Data

The following commands will prepare both the A record frequency and AAAA record frequency tables to be joined on the domain name field.

$ sort -f -k2 analysis/A_queries.txt  > a_q_for_join.txt
$ sort -f -k2 analysis/AAAA_queries.txt  > aaaa_q_for_join.txt

Step 3: Calculate the ratio of A to AAAA record requests

Amazingly, the POSIX standard specifies a relational join command that operates on specially delimited text files. The join command below will join the first file on the second field (-1 2), with the second file also on the second field (-2 2). The second field of both files is the domain name. The output of join is then piped to awk to calculate the ratio of A to AAAA record requests.

$ join -1 2 -2 2 a_q_for_join.txt aaaa_q_for_join.txt | awk '{printf "%d\t%2.2f\t%s\n", $2+$3, $2/$3, $1}' | sort -rn >analysis/ratio_of_a_to_aaaa.txt

The full list of A:AAAA ratios is available: (ratio_of_a_to_aaaa.txt, 18KB, text).

Total Query Count A to AAAA Query Ratio Domain
10957630.41ns1.0mdn.net
10726420.35ns2.0mdn.net
932080.40gmaml.com
808620.05static.ak.dbcdn.net
771470.00support.doublechick.net
701404.23mail.gmaml.com
595000.01g.mic2osoft.com
539692.31www.amazgn.com
432706.62amazgn.com
286940.02s0.0mdn.net
205758.63micro3oft.com
135859.32miarosoft.com
121750.91www.micro3oft.com
107621.19www.mic2osoft.com
903226.62u2s.micro3oft.com

Different domains exhibit a wildly different ratio of IPv4 to IPv6 lookups! Some actually have more IPv6 resolutions than IPv4 resolutions. The mystery is, why is this the case?

Conclusion


IPv6 connectivity is important. When removing outliers, there were almost as many IPv6 resolution requests as IPv4 requests. When investigating in more detail, some domains actually receive more IPv6 resolution requests than IPv4 resolution requests. I do not know why. If you have suggestions, please contact me.

Update:
Part 3 is now up, Bitsquatting PCAP Analysis Part 3: Bit-error distribution.