One of the features of SpiderOak is that if you backup the same file twice, on the same computer or different computers within your account, the second copy doesn't take up any additional space. This also applies if you have several versions of a file as it evolves over time -- we only need to save the new data blocks.
Some storage companies take this de-duplication to a second level, and do a similar form of de-duplication across all the data from all their customers. It's a great deal for the company. They can sell the bytes of storage to every user at full price while incurring zero additional cost. In some ways its helpful to the user too -- uploads are certainly faster when you don't have to transfer the data!
HOW DOES CROSS USER DATA DE-DUPLICATION EVEN WORK?
The entire process of a server de-duplicating files that haven't even been uploaded to the server yet is a bit magical, and works through the properties of cryptographic hash functions. These allow us to make something like a fingerprint of any file. Like people, no two files should have the same fingerprints, right? The server can just keep a database of file fingerprints and compare any new data to these fingerprints.
SO IT'S POSSIBLE FOR THE SERVER TO DE-DUPLICATE AND STORE MY FILES, KNOWING ONLY THE FINGERPRINTS. SO, HOW DOES THIS AFFECT MY PRIVACY AT ALL THEN?
With only the knowledge of a file's fingerprint, there's no clear way to reconstruct the file the fingerprint was made from. We could even use a technique for prepending deduplicated files with some random data when making fingerprints, so they would not match outside databases of common files and their fingerprints.
However, imagine a scenario like this. Alice has operated the best BBQ restaurant in Kansas City for decades. No one can match Alice's amazing sauce. Suddenly Mallory opens a BBQ joint right across the street, with better prices and sauce that's just as good! Alice is pretty sure Mallory has stolen the recipe right off her computer! Her attorney convinces a court to issue a subpoena to SpiderOak: Does Mallory have a copy of her recipe? "How would we know? We have no knowledge of his data beyond its billable size." Exasperated, the court rewrites their subpoena, "Does Mallory's data include a file with matching fingerprints from the provided recipe file here in exhibit A?" If we have a de-duplication database, this is indeed a question we can answer, and we will be required to answer. As much as we enjoyed Alice's BBQ, we never wanted to support her cause by answering a 3rd party's questions about customer data.
Imagine more everyday scenarios: a divorce case; a patent, trademark, or copyright dispute; a political case where a prosecutor wants to establish that the high level defendant "had knowledge of" the topic. Establishing that they had a document about whatever it was in their personal online storage account might be very interesting to the attorneys. Is it a good idea for us to be even capable of betraying our customers like that?
BONUS: DEDUPING VIA CRYPTOGRAPHIC FINGERPRINTS ENABLES THE ULTIMATE SIN
The ultimate sin from a storage company isn't simply losing customer data. That's far too straight forward a blunder to deserve much credit, really.
The ultimate sin is when a storage company accidentally presents Bob's data to Alice as if it were her own. At once Bob is betrayed and Alice is frightened. This is what can happen if Bob and Alice each have different files that happen to have the same fingerprints.
Actually cryptographic hashes are more like DNA evidence at a crime scene than real fingerprints -- people with identical DNA markers can and do exist. Cryptographers have invented many smart ways to reduce the likelihood of this, but those ways tend to make the calculations more expensive and the database larger, so some non-zero level of acceptable risk must of collisions be determined. In a large enough population of data, collisions happen.
This all makes for an entertaining conversation between Alice and Bob when they meet each other this way. Hopefully they'll tell the operators of the storage service, which will otherwise have no way of even knowing this error has happened. Of course, it's still rather unlikely to happen to you...
THERE'S A PUBLIC INFORMATION LEAK ANYONE CAN EXPLOIT
Any user of the system can check if a file is already contained within the global storage set. They do this simply by adding the file to their own storage account, and observing the network traffic that follows. If the upload completes without transferring the content of the file, it must be in the backup somewhere already.
For a small amount of additional work they could arrange to shutdown the uploading program as soon as they observe enough network traffic to know the file is not a duplicate. Then they could check again later. In this way, they could check repeatedly over time, and know when a given file enters the global storage set.
If you wanted to, you could check right now if a particular file was already backed up by many online storage services.
How might someone be able to maliciously use this property of a global de-duplication system to their advantage?
- You could send a new file to someone using the storage service and know for sure when it had arrived in their possession
- You could combine with the Canary Trap method to expose the specific person who is leaking government documents or corporate trade secrets to journalists
- You could determine whether your copyrighted work exists on the backup service, and then sue the storage service for information on users storing the work
There are also categories of documents that only a particular user is likely to have.
HOW MUCH SPACE SAVINGS ARE WE REALLY TALKING ABOUT?
Surely more than a few users have the same Britney Spears mp3s and other predictable duplicates. Across a large population, might 30%, 40%, or perhaps even 50% of the data be redundant? (Of course there should be greater likelihood of matches as the total population increases. This effect of increasing de-duplication diminishes though: it is more significant as the data set grows from 1 user to 10,000 users than from 10,000 users to 20,000 users, and so on.)
In our early planning phase with SpiderOak, and during the first few months while we operated privately before launch, we did a study with a population of cooperative users who were willing to share fingerprints, anonymized as much as was practical. Of course, our efforts suffered from obvious selection bias, and probably numerous other drawbacks that make them unscientific. However, even when we plotted the equations up to very large populations, we found that the savings was unlikely to be as much as 20%. We chose to focus instead on developing other cost advantages, such as building our own backend storage clustering software.
WHAT IF SPIDEROAK SUDDENLY DECIDES TO START DOING THIS IN THE FUTURE?
We probably won't... and if we did, it's not possible to do so retroactively with the data already stored. Suppose we were convinced someday; here are some ways we might minimize the dangers:
- We would certainly discuss it with the SpiderOak community first and incorporate the often-excellent suggestions we receive
- It would be configurable according to each user's preference
- We would share some portion of the space savings with each customer
- We would only de-duplicate on commonly shared and traded filetypes, like mp3s, where it's most likely to be effective, and least likely to be harmful