NSA Has a Database Problem

Back in 2009 when the government released what we now know is a FISA Court of Review decision ordering Yahoo to cooperate in PRISM, I questioned a passage of the decision that relied on the government’s claim that it doesn’t keep a database of incidentally collected conversations involving US persons.

In this post, I just want to point to a passage that deserves more scrutiny:

The government assures us that it does not maintain a database of incidentally collected information from non-targeted United States persons, and there is no evidence to the contrary. On these facts, incidentally collected communications of non-targeted United States persons do not violate the Fourth Amendment.(26)

To translate, if the government collects information from a US citizen (here or abroad), a legal permanent US resident, a predominantly US organization, or a US corporation in the course of collecting information on someone it is specifically targeting, it it claims it does not keep that in a database (I’ll come back and parse this in a second). In other words, if the government has a tap on your local falafel joint because suspected terrorists live off their falafels, and you happen to call in a take out order, it does not that have in a database.

There are reasons to doubt this claim.

In the rest of the post, I showed how a response from Michaels Mukasey and McConnell to Russ Feingold’s efforts to protect US person incidental collection during the FISA Amendments Act had made it clear having access to this incidentally collected data was part of the point, meaning the government’s reassurances to the FISCR must have been delicate dodges in one way or another. (Feingold’s Amendments would have prevented 3 years of Fourth Amendment violative collection, by the way.)

Did the court ask only about a database consisting entirely of incidentally collected information? Did they ask whether the government keeps incidentally collected information in its existing databases (that is, it doesn’t have a database devoted solely to incidental data, but neither does it pull the incidental data out of its existing database)? Or, as bmaz reminds me below but that I originally omitted, is the government having one or more contractors maintain such a database? Or is the government, rather, using an expansive definition of targeting, suggesting that anyone who buys falafels from the same place that suspected terrorist does then, in turn, becomes targeted?

McConnell and Mukasey’s objections to Feingold’s amendments make sense only in a situation in which all this information gets dumped into a database that is exposed to data mining. So it’s hard to resolve their objections with this claim–as described by the FISA Appeals Court.

Which is part of the reason I’m so intrigued by this passage of John Bates’ October 3, 2011 decision ruling some of NSA’s collection and retention practices violated the Fourth Amendment. In a footnote amending a passage explaining why the retention of entirely US person communications with the permissive minimization procedures the government had proposed is a problem, Bates points back to that earlier comment.

The Court of Review plaining limited its holding regarding incidental collection to the facts before it. See In re Directives at 30 (“On these facts, incidentally collected communications of non-targeted United States persons do not violate the Fourth Amendment.” (emphasis added). The dispute in In re Directives involved the acquisition by NSA of discrete to/from communications from an Internet Service Provider, not NSA’s upstream collection of Internet transactions. Accordingly, the Court of Review had occasion to consider NSA’s acquisition of MCTs (or even “about” communications, for that matter). Furthermore, the Court of Review noted that “[t]he government assures us that it does not maintain a database of incidentally collected information from non-targeted United States persons, and there is no evidence to the contrary.” Id. Here, however, the government proposes measures that will allow NSA to retain non-target United States person information in its databases for at least five years.

Ultimately, Bates’ approval for the government to query on US person identifiers on existing incidentally collected Section 702 material (see pages 22-23) show that he hasn’t really thought through what happens to US person incidental collection; he actually has a shocking (arguably mis-) understanding of how permissive the existing minimization rules are, and therefore how invasive his authorization for searching on incidentally collected information will actually be.

But his complaint with the proposed minimization procedures shows what he believes they should be.

The measures proposed by the government for MCTs, however, largely dispense with the requirement of prompt disposition upon initial review by an analyst. Rather than attempting to identify and segregate information “not relevant to the authorized purpose of the acquisition” or to destroy such information promptly following acquisition, NSA’s proposed handling of MCTs tends to maximize the retention of such information, including information of or concerning United States persons with no direct connection to any target.

As Bates tells it, so long as he’s paying close attention to an issue, the government should ideally destroy any US person data it collects that is not relevant to the authorized purpose of the acquisition. (His suggestion to segregate it actually endorses Russ Feingold’s fix from 2008.)

But the minimization rules clearly allow the government to keep such data (after this opinion, they made an exception only for the multiple communication transactions in question, but not even for the other search identifiers involving entirely domestic communication so long as that’s the only communication in the packet).

All the government has to do, for the vast majority of the data it collects, is say it might have a foreign intelligence or crime or encryption or technical data or threat to property purpose, and it keeps it for 5 years.

In a database.

Back when the FISCR used this language, it allowed the government the dodge that, so long as it didn’t have a database dedicated to solely US person communications incidentally, it was all good. But the language Bates used should make all the US person information sitting in databases for 5 year periods (which Bates seems not to understand) problematic.

Not least, the phone dragnet database, which — after all — includes the records of 310 million people even while only 12 people’s data has proved useful in thwarting terrorist plots.

Update: Fixed the last sentence to describe what the Section 215 dragnet has yielded so far.

26 replies
  1. bsbafflesbrains says:

    There doesn’t seem to be any control or oversight that is effective because the collection of the data IS warrantless. What about all the other Govt and Private contractors who have or had the same access as Snowden? Snowden acted from altruism or good conscience but how many people have acted for invidious or corrupt motives. Seems like the data has a life of it’s own now that it exists.

  2. peasantparty says:

    Excellent Point!

    I’m not a techie, so I don’t know how the major ISP’s work. However, I would think they have a separate or divisional area for what is American based communications, and what is for China, or Middle Eastern service areas. I would also think the same would be true for cell phone providers.

    The US Domestic dragnet of communications both via internet and cell or landline would have to be presented to the FISA court for review with explicit reference to terror suspects. I do not see how any Judge could legally agree to the mass dragnet of Domestic communications plus the storage of them. Especially not under the guise of the Patriot Act!

    Are there more MEMOS left to be shared? Are there more OLC opinions that make these programs appear legal?

  3. Saul Tannenbaum says:

    Thinking that the NSA has a “dedicated database” for anything misunderstands what databases have become and what we know about NSA technology.

    We actually know a lot about how the NSA databases and queries work because the NSA open-sourced its software, and its developers left the NSA to become entrepeneurs and commercialize their software. (I’ve written that up here: http://cctvcambridge.org/sqrrl ).

    The NSA database structure isn’t a database in the sense most people think about databases. Rather than being rows of data representing people and columns representing attributes about them (like a spreadsheet), its database structure is something called a key-value store. Think of it this way: If you have standard database with names, birthdates, and hair color, and rows of data for people, the way you turn this into a key-value store is by using a key of say “emptywheel|birthdate” with the value of Marcy’s birthdate. The advantage of the key-value approach is that these database can be massive, distributed across the globe. And, you don’t have to define a schema for them, that is, decide what columns you want. You want to add a new column for phone numbers, you just start storing keys with, say “emptywheel|phone”.

    If I were running the NSA and I had this technology, I’d be building to one massive database to store it all. There’s really no good reason not to, and every reason pursue this good.

  4. emptywheel says:

    @Saul Tannenbaum: So when they say they have the 215 dragnet data someplace special what does that mean?

    And what does it mean that in Feb 2012 NSA found records from that database on a server used by technical people massaging the dragnet data?

  5. Saul Tannenbaum says:

    @emptywheel: Maybe they mean that all they keys for the 215 dragnet data start with “specialplsace” and those keys are thus considered to be in a special place. Maybe they mean they’re using the Accumulo cell level security to keep that stuff more private.

    When they’re talking about this stuff, to the normal obfuscation, you add the reality that a really, really small number of people understand how these technologies work. It’s unlikely that the lawyers writing this stuff understand what they’re being told by the technical people.

    Or, to put it another way, I have no idea what they’re really doing, and am cautioning that the words they’re using don’t have a very good mapping to what we know about their technology. Thus, drawing inferences from the common sense meaning of their words is likely misleading.

    As to finding records on a server, it’s unlikely that they’ve got their Accumulo database wired directly into whatever they’re capturing data from. You just wouldn’t want to expose it that way. That means there are intermediate servers and I assume they mean that somebody forgot to clean everything up on one of them.

  6. omphaloscepsis says:

    @Saul Tannenbaum:

    “add the reality that a really, really small number of people understand how these technologies work. It’s unlikely that the lawyers writing this stuff understand what they’re being told by the technical people.”

    So true. Then consider, if a Senate or House Intel committee member has someone on staff who does understand the technologies (we can hope they do, but who knows?), then it still doesn’t enable oversight if the staff member isn’t invited along to the closed door meetings, or isn’t allowed to view notes or presentations from those meetings. And an awful lot of the news reports imply that’s the case.

    Started collecting some basic tutorials on Internet details a few months back, before Snowden, based on a few scare stories about reaching maximum capacity on the Internet. Along with guilt pangs from streaming so much Netflix and other video content.

    Here’s a 10 year old tutorial from Harvard Law School that begins by explaining how e-mail travels, and follows with much more:


    Same paper in PDF form:


    Can post some other links to material on Internet Exchange Points (IXPs) — effectively the I/O ports for whole countries — fiber optic cable networks, and much more, if anyone is interested in that level of detail.

    Probably not. And it shouldn’t be necessary for the average citizen to know what’s under the hood in order to understand that they’ve been sold a lemon.

  7. Peasantparty says:

    @Saul Tannenbaum: thanks, I think. LOL

    Let me clarify my comments a little better. I was wondering if on a daily basis they can separate communications from France, from those from the US, and also the ones from Africa. Is there a delineation by country? Or is it all globbed together when they get it in?

    I guess, what I trying to say is does the NSA spend anytime separating communications by country.

  8. Peasantparty says:

    @omphaloscepsis: Thank you as well. I’m not really into the details, just the by country issue. If they can see by country, then the reasons they would have to dragnet the entire US is still bull hockey!

  9. earlofhuntingdon says:

    It seems likely that the special Sec. 215 “database”, if it exists, would be virtual, not physical. It might even exist only as a query or search term, meaning that it’s called into existence from a wider ocean of data only when searched for with the correct terms.

  10. orionATL says:

    @Saul Tannenbaum:

    one implication of your very helpful comment on the meaning of terminology is that sincerely esrnest or deliberately evasive and dishonest political leaders can reassure the public on the use of this data call-up-and-analyze software and be protected from subsequent criticisms of having misled the citizenry.

    i say this with sen feinstein’s repeated comment “i have been told that…” in mind.

    put another way, it follows that few if any assurances to us citizens by our political leaders may truly reflective of the reality of the nsa data collection/storage/retrieval/analysis technology capability.

    our leaders are as blind as we.

  11. orionATL says:


    my operating assumption is put whatever i run across that serms relevant out here for all to read.

    who knows when something may click with a reader with specialized knowledge.

    i’ve seen that happen over and again on internet weblogs.

  12. Saul Tannenbaum says:

    @Peasantparty: For a data stream, the one thing you can be absolutely sure of is where it was just before it gets to you, the previous hop. So, if you’re monitoring traffic at the edge of the US, you can do a pretty good job of selecting by where the previous hop was. But the actual origin? That gets harder, especially if you’re assuming the folks your most interested in are going to employ technology to cloak their origin.

  13. jerryy says:

    @Saul Tannenbaum: … “The advantage of the key-value approach is that these database can be massive, distributed across the globe”.

    When you add in how relational databases work, you also get that the things are in essence automagically updated as events happen.

  14. omphaloscepsis says:



    A tutorial on Internet Exchange Points (IXPs), the Internet portal(s) to a given country:


    List of IXPs by country — note that half the countries in the world don’t have an IXP, hence import the signal across a border with one that does:


    Underwater fiber optic cables:


    Global map on pg. 16 of this file — the rest is interesting, but many readers here may not care:


    Global map on pg. 8 of this file, with tales of several severe outages:


    The cables and the IXPs are a large portion of the “backbone of the Internet”. They are constructed and maintained by corporations, maybe sometimes in partnership with governments, and recover their investment by leasing their use. Not unlike toll roads and bridges.

    Currently a good portion of many telephone calls is conducted over the Internet, using VoIP, with conversion to analog only at the end points. There is a push in the US to convert the entire US telephone system to an Internet-based one within 5 years, kind of like switching to digital TV. So right now, and more so in the future, one can monitor a lot of phone calls by intercepting Internet packets. Maybe not the easiest method, but doable.

  15. Peasantparty says:

    @Saul Tannenbaum: It still does not give the credibility of being selective, nor within the laws. Warrantless with reason, or any stupid OLC jargon still does not jive with this.

  16. Saul Tannenbaum says:

    @Peasantparty: At the detailed technical level, this is all horribly messy. And there is always, even the most innocurous situations, pressure to gloss over the messiness as a details make it up the management chain. So, you don’t have to posit a deliberate effort to avoid statutory and constitution isses by whitewashing to understand how this happens.

    @earlofhuntingdon: Yes, this.

Comments are closed.