During the months I worked in Let's Encrypt's operations team I got fairly used to being the go-to man for any question that a database query could solve. I also threw together the code for the Let's Encrypt Stats page. All said, I'd gotten quite attached to being able to query the Let's Encrypt data set. Now, however, I don't have any access to the datacenters, or the databases within. But I do have access to Certificate Transparency logs, as well as other data sources like Censys.io.

I've been coding up a Golang tool called ct-sql that imports CT and Censys.io certificate exports into a MariaDB Database to help me perform population statistics on the state of TLS on the public Internet.

With this tool I imported the Google Pilot, Certly.io, and Izenpe.com certificate transparency logs into a local MariaDB database:

go get -u github.com/jcjones/ct-sql  
ct-sql -dbConnect "mysql+tcp://ctsql@localhost:3306/ctdb" -log "https://log.certly.io"  
ct-sql -dbConnect "mysql+tcp://ctsql@localhost:3306/ctdb" -log "https://ct.izenpe.com"  
ct-sql -dbConnect "mysql+tcp://ctsql@localhost:3306/ctdb" -log "https://ct.googleapis.com/pilot"

# Create the views from https://gist.github.com/jcjones/f140919a4d2d41216bee
curl "https://gist.githubusercontent.com/jcjones/f140919a4d2d41216bee/raw/86a284f3b659834f11e936debc635df204def30f/letsencrypt-ct-sql-views.sql" > letsencrypt-ct-sql-views.sql  
mysql < letsencrypt-ct-sql-views.sql  

I've been really interested in one question since we started opening up the private beta at IETF94 last November:

Is Let's Encrypt prompting people to secure domains which had not been secure before, or are people changing from their existing CA?

Since I was first introduced to the Let's Encrypt concept in 2014, I had a feeling free and automated certificates would primarily be added to new sites, and for the most part existing properties would stay with their status quo.

Using my tools, I did some analysis and posted some of the results to Twitter, which has gotten a lot of positive response; particularly this chart from Certificate Transparency data showing how many of the domain names in Let's Encrypt's certificates have not appeared in Certificate Transparency before. Certificate Transparency logs are a limited data set, and Ryan Hurst rightly called me out on that:

At his recommendation, I patched ct-sql to understand the Censys.io certificate JSON export format, and then expanded my data set by importing the 92.4GB export from 2016-02-16 (free account required). This took about a day, given the speed of iSCSI on my LAN.

ct-sql -dbConnect "mysql+tcp://ctsql@localhost:3306/ctdb" -censysJson "/mnt/nas/certificates.20160216T112902.json"  

I set up a couple of useful SQL views to simplify queries and then got started. First, I wanted to find out how big the overlap actually is between Censys.io and CT.

Certificates in Censys vs CT vs Both

-- Determine the quantity of certificates which are
-- in Censys but not in CT, CT but not in Censys, or
-- in both. Do this by using LEFT JOINs and counting the
-- combinations of NULL fields.

    DATE_FORMAT(c.notBefore, "%Y-%m") AS issueMonth,
    SUM(ct.certID IS NOT NULL
        AND censys.certID IS NULL) AS onlyCT,
    SUM(censys.certID IS NOT NULL
        AND ct.logID IS NULL) AS onlyCensys,
    SUM(ct.logId IS NOT NULL
        AND censys.certID IS NOT NULL) AS inBoth
    unexpired_certificate AS c
        LEFT JOIN
    censysentry AS censys ON censys.certID = c.certID
        LEFT JOIN
    ctlogentry AS ct ON ct.certID = c.certID
GROUP BY issueMonth;  

The output is here: CertsInCTversusCensys.tsv.

With the data, summed over those months, it turns out while Censys.io has a larger data set, 82% of their crawled certificates have already been imported into at least one of those CT logs.
Certs in CT versus Censys.io

This is useful to anyone doing analysis on CT, whether it's via tools like mine, or via Rob Stradling's fabuluous crt.sh. Generally speaking, most certs are in CT, and it won't be very difficult to back-fill the million certs which aren't yet. (Maybe next week...)

So that leads to an opportunity to use this additional data to re-examine the first question: is Let's Encrypt being used to secure new domains, or as a cost-saving replacement CA for existing domains?

Domain Names Secured By Other CAs

-- For every domain name on every cert issued by 
-- Let's Encrypt, find how many trusted issuers have
-- previously issued certificates with that name.
            COUNT(DISTINCT issuerID)
            name AS n2
            certificate AS c2 ON n2.certID = c2.certID
            n2.name = n.name) AS numIssuers
    le_certificate AS c
    name AS n ON c.certID = n.certID
GROUP BY numIssuers;  

The output is here: NamesIssuedElsewhere.tsv.

This query takes about 4 hours on my home server: The name table has over 97 million rows, and the certificate table another almost 14 million rows. This is the same query I ran to produce the tweeted graph, but here's the new one including one more CT Log (Izenpe) and the Censys.io export:

Domains Issued Elsewhere than Let's Encrypt

As Ryan expected, with a larger dataset of certificates, we did find some more instances of sites changing from another CA to Let's Encrypt. Overall, the percent of sites which have only had a publicly-trusted certificate issued by Let's Encrypt went from 94.3% down to 93.9%.

There are still some caveats:

  1. Censys.io's export contains many historical / expired certificates, but it's unknown how many certificates existed which are not in the data set.
  2. When importing the Censys.io dataset, I only imported certificates which Censys had flagged with nss_valid=true, so upstream provided the filtering as to which certificates from that dataset were publicly trusted. This is also true for CT, as each log restricts what issuers are permitted to be entered.
  3. My ct-sql tool throws away certificates which do not parse in Golang's x509 code. Some old certificates are discriminated against due to this, though the total number of affected certs is < 1000.
  4. The ct-sql tool enforces a database constraint that for a given Issuer, certificate serial numbers should be unique. This is also in the Baseline Requirements, but I had ~100 issuer/serial collisions during the imports, again mostly from old certificates.

All said, I believe Ryan Hurst is correct that this is about the most complete data set we can readily obtain. In the first quarter of its' operation then, Let's Encrypt has far and away been used more to secure previously-unsecured (or at least untrusted) websites than simply as a cost-savings measure.

Most Observed Unexpired Certificates by Issuer

One last question I often get asked is:

How large is Let's Encrypt compared to other Certificate Authorities?

How you ask this question is important. Most Certificate Authorities issue a lot of TLS certificates for corporations' internal use, not to mention email and code-signing certificates. Also, there have been two decades of SSL/TLS, so we'll need to limit this to unexpired certificates.

If we stick with just unexpired certificates observed on the Web (via Certificate Transparency and Censys.io), we can figure that out pretty readily:

-- For each known issuer, count the number of unexpired
-- certificates issued by the issuer.
            unexpired_certificate AS c
            c.issuerID = i.issuerID) AS numCerts
    issuer AS i
ORDER BY numCerts DESC;  

The output is here: UnexpiredCertsByIssuer.tsv.

This is a relatively fast query, as my database contains only 2,383 unique issuing certificates.

Most Common Issuing Certificates on the Public Web

I took a best effort at grouping all the various intermediate issuers to match with the actual organization that operates the Certificate Authority. You can see the grouping I used here: UnexpiredCertsGroupedByCA.tsv. I apologize in advance if I made mistakes; I'm going to be adding that information on a per-issuer basis into my tables to make this a little easier, and will (of course) publish the list.

Slicing the Data

As I write this, I'm importing WoSign, Symmantec, DigiCert, Venafi, and Google's other logs into the database. There's always more ways to pull out data.

If you have some questions you think my database can answer, feel free to suggest them to me via Twitter.