Fear not, says the NSA, we “touch” only 1.6% of daily internet traffic. If, as they say, the net carries 1,826 petabytes of information per day, then the NSA “touches” about 29 petabytes a day. They don’t say what “touch” means. Ingest? Store? Analyze? Inquiring minds want to know.
For context, Google in 2010 said it had indexed only 0.004% of the data on the net. So by inference from the percentages, does that mean that the NSA is equal to 400 Googles? Better math minds than mine will correct me if I’m wrong.
Seven petabytes of photos are added to Facebook each month. That’s .23 petabytes per day. So that means the NSA is 126 Facebooks.
Keep in mind that most of the data passing on the net is not email or web pages. It’s media. According to Sandvine data for the U.S. fixed net from 2013, real-time entertainment accounted for 62% of net traffic, P2P file-sharing for 10.5%. The NSA needn’t watch all those episodes of Homeland (or maybe they should) or listen to all that Cold Play — though I’m sure the RIAA and MPAA are dying to know what the NSA knows about who’s “stealing” what since that “stealing” allegedly accounts for 23.8% of net traffic.
HTTP — the web — accounts for only 11.8% of aggregated up- and download traffic in the U.S., Sandvine says. Communications — the part of the net the NSA really cares about — accounts for 2.9% in the U.S.
So by very rough, beer-soaked-napkin numbers, the NSA’s 1.6% of net traffic would be half of the communication on the net. That’s a fuckuvalota “touching.”
And keep in mind that by one estimate 68.8% of email is spam.
And, of course, metadata doesn’t add up to much data at all; it’s just a few bits per file — who sent what to whom — and that’s where the NSA finds much of its incriminating information. So these numbers are meaningless when it comes to looking at how much the NSA knows about who’s talking to whom. A few weeks ago on Twitter, I showed that with the NSA’s clearance to go three hops out from a suspect, it doesn’t take very long at all before this law of large numbers encompasses us all and our cats.
If you have better data (and better math) than I have, please do share it.
* “Reach out and touch someone” art inspired by Josh Stearns












Excellent perspective, Jeff. Thank you.
The Sandvine data you show is from over a year ago. The latest data from March 2013 shows about the same percent Netflix, but YouTube in second place for download with 17% and much less HTTP and BitTorrent. I think that increases how much non media traffic NSA could be collecting.
The NSA’s choice of the word “touch” interests me.
“Touching” implies some sort of active engagement.
What the NSA is *not* saying (I’d be willing to bet) is that it has to “examine” everything, at the TCP/IP packet level, before it knows what subset it wants to “touch”.
And if the NSA is talking about individual TCP/IP packets (which are beyond vast in quantitiy) it would make it very easy for them to imply that mathematically they’re only “touching” a very, very tiny fraction of what the Internet carries.
Which makes what they’re saying just more disingenuous, duplicitous bullshit.
How to lie with statistics? Let me count the ways …
Most web traffic is duplicative. If a web page is accessed a million times, they only need to “touch” the page once, then just make note of who is accessing it. The numbers are horribly misleading.
“For context, Google in 2010 said it had indexed only 0.004% of the data on the net. So by inference from the percentages, does that mean that the NSA is equal to 400 Googles? Better math minds than mine will correct me if I’m wrong.”
So, no, this is actually a completely inapt comparison. Google indexes stored data, the NSA is more interested in transmitted data and, furthermore, mostly data which is not available on the public internet. You’re doing a bit of an apples-to-orangutans comparison here.
Given Google’s market share in the email arena and their willingness to index your emails for ad-serving purposes, it stands to reason that the NSA is some small, fractional part of “A Google”, even using the most damning figures and interpretations of the information we have.
“to touch” our data, as in “a pederast likes to touch schoolboys”
This article is starting from a faulty premise. It assumes that NSA ‘sees’ 100% of Internet activity and then chooses to ‘touch’ 1.6%, which would be communications (not real-time entertainment). I think it is much more likely that the statement that NSA ‘touches’ 1.6% of daily traffic actually suggests that it’s starting data pool is 1,6% of daily traffic, including porn, spam, and episodes of homeland. That may still be ‘fuckovalota’ information, but not on the scale of some of the comparisons above.
http://fotki.yandex.ru/users/pashenko-ecolog/view/1293060/?page=0
Anecdotal … http://t.co/lTHDFclrqo
good numbers!
Particularly when looking at web traffic, most data is incredibly redundant and the copies needn’t be stored. For example, a thousand people downloading the same web page. As long as the NSA has one copy of that page, stores the differences when it changes, and knows who downloaded it when, they know what we’ve been reading and writing—not just on the open web, but in private forums and pretty much any browser-based communications app.
Email is often very redundant too. They can detect redundant quotation of prior emails, web pages, and other documents, and not store the extra copies.
In principle, all they need to store is the NEW content that we generate—not much more what people type and say, as fast as they type and talk. (Assume they do speech to text conversion on audio and the audio tracks of video.)
If my math is right, 29 petabytes a day is about four megabytes per person per day, for every person on earth.
People just do not type and talk anywhere near that fast, so they could easily be storing everything everyone says and writes, all the time, with lots of capacity to spare.
I dont get it. they represent us. Who’s f*** us. For once I don’t want to pay taxes to be spied on PERIOD.
Great Article Jeff!
BTW the highlight of my jump day afternoon is your pre-show talk with Leo and Gina on TWIG
Keep up the good work!!! :)
So if they’re not actually looking at all the stuff sent from various piratish sites around the world, does that mean I can send my plans for terrorist bombings via a “Game of Thrones” upload? Because if that’s true the operation is both incredibly expensive and incredibly ineffective.
“If my math is right, 29 petabytes a day is about four megabytes per person per day, for every person on earth.”
Including guys in Somalia who just learned how to text.
It also includes a zillion people who don’t have access at all, meaning the average for people who do is significantly higher.
I looked up the bit rate of telephone-quality audio, and it’s only 8 kiloBITS per second, or 1 kiloBYTE per second. A megabyte is a thousand kilobytes, so that’s 1000 seconds per megabyte—about 20 minutes.
If they use 3 of the 4 megabytes per person on capturing audio—a smart thing to do IMO—that’s about an hour of audio a day.
Most people spend less than that amount of time talking on the phone or audio or video chat, so I’d guess they’re capturing EVERYTHING EVERYONE SAYS on the net/phone and running it through a speech-to-text translator so that it can be stored much more compactly and searched with free text queries.
They presumably fingerprint every video file that’s posted to see if you’re posting is just a copy of something already out there—easier to do than you think, even if you’ve cut/edited the video in basic ways.
Then they can do a fuzzy fingerprint to see if it’s a modified version of something, e.g., an .mkv rip from a DVD, and how it’s been changed, and look to see if those changes look like (straightforwardly or cryptographically) encoded information. (E.g., “steganography”—look it up.)
If so, they probably store it and set some processors to work on it.
If they check if it’s just a copy then they’re “touching” it by any reasonable definition. If all I have to do is dirty up a “Game of Thrones” vid a bit, add in some steganography and then allow people to download it then they’re really not very effective. They wouldn’t be able to automatically tell the difference between a slightly corrupt copy and one with steganography without looking into it carefully, which they can’t do for every upload.
It’s not difficult to tell the difference between a corrupted file and one with a bunch of data encoded in it steganographically—the varying bits are distributed differently. (In steganography you generally change the “least significant” bits in a way that looks like background noise, and is hard to distinguish from the kinds of noise that are already there—e.g. sensor noise and/or film grain noise. Corrupted files generally do not suffer from selective corruption in the low order bits. They typically have flaws in noticeable bits that visibly affect the image and/or metadata bits that makes the file not play correctly.
But if you know what a corrupted file looks like you can encode the file with bits wrong that look like that. Sure it’s not immune to breakage, but it needs actual examination to tell the difference, which, according to the NSA they do 1.6% of the time.
Jeff, thanks for this article. We also enjoy your perspective on TWiT’s ‘This Week in Google.’ Keep up the good investigative analysis and reporting with your ever-inquiring mind.
Yes, you can encode some information in what looks like corruption. But not a lot of information—AIUI, most corruption isn’t a whole bunch of random bits that could encode concealed information. It’s little glitches that make bigger stretches of data unreadable in the way they were intended to be read, or big gaps from lost or corrupted packets, or whatever.
So I’d expect fake corruption to be okay for communicating little messages like “meet at the usual place at 3 PM Sunday,” but not for big things like a gigabyte of documents. (But I’m no expert.)
I’d also expect that if you kept postinga bunch of apparently corrupted stuff, the NSA would notice—why is this person posting so many corrupt files? Even if they couldn’t decode the messages hidden in apparent corruption, they could flag you as somebody who’s likely trying to conceal something, and it would get you extra scrutiny by other means.
Great idea! Thanks for the suggestion. I’ll be updating the post text and be sure to include it up there, thanks.