Analyzing Let's Encrypt statistics via Map/Reduce
I've been supplying the statistics for Let's Encrypt since they've launched. In Q4 of 2016 their volume of certificates exceeded the ability of my database server to cope, and I moved it to an Amazon RDS instance.
Amazon's RDS service is really excellent, but paying out of pocket hurts.
I've been re-building my existing Golang/MySQL tools into a Golang/Python toolchain at slowly over the past few months. This switches from a SQL database with flexible, queryable columns to a EBS volume of folders containing certificates.
The general structure is now:
/ct/state/Y3QuZ29vZ2xlYXBpcy5jb20vaWNhcnVz /ct/2017-08-13/qEpqYwR93brm0Tm3pkVl7_Oo7KE=.pem /ct/2017-08-13/oracle.out /ct/2017-08-13/oracle.out.offsets /ct/2017-08-14/qEpqYwR93brm0Tm3pkVl7_Oo7KE=.pem
/ct/state exists the state of the log-fetching utility, which is now a Golang tool named
ct-fetch. It's mostly the same as the
ct-sql tool I have been using, but rather than interact with SQL, it simply writes to disk.
This program creates folders for each
notAfter date seen in a certificate. It appends each certificate to a file named for its issuer, so the path structure for cert data looks like:
The Map step
A Python 3 script
ct-mapreduce-map.py processes each date-named directory. Inside, it reads each
.cer file, decoding all the certificates within and tallying them up. When it's done in the directory, it writes out a file named
oracle.out containing the tallies, and also a file named
oracle.out.offsets with information so it can pick back up later without starting over.
The tallies contain, for each issuer:
- The number of certificates issued each day
- The set of all FQDNs issued
- The set of all Registered Domains (eTLD + 1 label) issued
oracle.out is large, but much smaller than the input data.
The Reduce step
Another Python 3 script
ct-mapreduce-reduce.py finds all
oracle.out files in directories whose names aren't in the past. It reads each of these in and merges all the tallies together.
The result is the aggregate of all of the per-issuer data from the Map step.
This will get converted then into the data sets currently available at https://ct.tacticalsecret.com/
I'm not yet to the step of synthesizing the data sets; each time I compare this data with my existing data sets there are discrepancies - but I believe I've improved my error reporting enough that after another re-processing, the data will match.
Right now I'm trying to do all this with a free-tier AWS EC2 instance and 100 GB of EBS storage; it's not currently clear to me whether that will be able to keep up with Let's Encrypt's issuance volume. Nevertheless, even an EBS-optimized instance is much less expensive than an RDS instance, even if the approach is less flexible.