The Privacy Problems (?) of Outsourcing the Dragnet

Both Ed Felten

I am reminded of the scene in Austin Powers where Dr. Evil, in exchange for not destroying the world, demands the staggering sum of “… one MILLION dollars.” In the year 2014, billions of records is not a particularly large database, and searching through billions of records is not an onerous requirement. The metadata for a billion calls would fit on one of those souvenir thumb drives they give away at conferences; or if you want more secure, backed up storage, Amazon will rent you what you need for $3 a month. Searching through a billion records looking for a particular phone number seems to take a few minutes on my everyday laptop, but that is only because I didn’t bother to build a simple index, which would have made the search much faster. This is not rocket science.

And Tim Edgar have started thinking about how to solve the dragnet problem.

One helpful technique, private information retrieval, allows a client to query a server without the server learning what the query is.  This would allow the NSA to query large databases without revealing their subjects of interest to the database holder, and without collecting the entire database.  Recent advances should allow such private searches across multiple, very large databases, a key requirement for the program.  The use of these cryptographic techniques would make the need for a separate consortium that holds the data unnecessary.  I discussed this in more detail in my testimonybefore the Senate Select Committee on Intelligence last fall.  Seny Kamara of Microsoft Researchpoints out these techniques were first outlined over fifteen years ago, while the state of the art is outlined in “Useable, Secure, Private Search” from IEEE Security and Privacy.

But I want to consider something both point to that President Obama said in his speech which both Felten and Edgar consider.

Relying solely on the records of multiple providers, for example, could require companies to alter their procedures in ways that raise new privacy concerns.

I’m admittedly obsessed by this, but one processing step the NSA currently uses on dragnet data seems to pose particularly significant privacy concerns: the data integrity role, in which high volume numbers — pizza joints, voice mail access numbers, and telemarketers, for example — are “defeated” before anyone starts querying the database.

This training module from 2011 (and therefore before some apparent additions to the data integrity role, as I’ll lay out in a future post) describes three general technical roles, the first of which would be partly eliminated if the telecoms kept the data.

  • Ensuring production meets the terms of the order and destroying that which exceeds it (5)
  • Ensuring the contact-chaining process works as promised to FISC (much of this description is redacted) (7)
  • Ensuring that all BR and PR/TT queries are tagged as such, as well as several other redacted tasks (this tagging feature was added after the 2009 problems) (9)

The first and third are described as “rarely coming into contact with human intelligible” metadata (the first function would likely see more intelligible data on intake of completed queries from the telecoms). But — assuming a parallel structure across these three descriptions — the redacted description on page 8 suggests that the middle function — what elsewhere is called the data integrity function — has “direct and continual access and interaction” with human intelligible metadata.

And indeed, the 2009 End-to-End Review and later primary orders describe the data integrity analysts querying the database with non-RAS approved identifiers to determine whether they’re high volume identifiers that should be taken out of the dragnet.

Those analysts are not just accessing data in raw form. They’re making analytic judgments about it, as this description from the E-2-E report explains.

As part of their Court-authorized function of ensuring BR FISA metadata is properly formatted for analysis, Data Integrity Analysts seek to identify numbers in the BR FISA metadata that are not associated with specific users, e.g., “high volume identifiers.” [Entire sentence redacted] NSA determined during the end-to-end review that the Data Integrity Analysts’ practice of populating non-user specific numbers in NSA databases had not been described to the Court.

(TS//SI//NT) For example, NSA maintains a database, [redacted] which is widely used by analysts and designed to hold identifiers, to include the types of non-user specific numbers referenced above, that, based on an analytic judgment, should not be tasked to the SIGINT system. Read more